Anthropic just announced computer use – an API for automating operations on a computer by looking at the screen, moving a cursor, clicking buttons, and typing text.
Turns out the playground demo is pretty easy to setup locally.
Demo
First, I wanted to get a taste of what it is like to use the API myself.
After I have setup the playground, I tried the following prompt:
go to hacker news, go through comments of every post until you find one that mentions Nevada. Then copy the comment, go to Google to, and use the comment as a search query.
Claude refused to execute some of the other prompts (see Limitations section).
This simple prompt took 1 minutes 39 seconds to execute and costed ~USD 0.30.
Community Demos
There are already a few demos of the API in action. Here are a few examples:
To start a fresh demo, you will need to delete the ~/.anthropic/ directory and restart the container.
Important Notes
Settings are persisted between container runs in the ~/.anthropic/ directory
For optimal performance, keep screenshot resolutions at XGA/WXGA or lower (approximately 1280×800)
Higher resolutions may impact model accuracy and performance
Next Steps
Once set up, you can interact with the system through the combined interface at http://localhost:8080, which provides both agent chat and desktop view functionality. This allows you to experiment with AI-driven computer automation tasks.
Remember that this is a demo environment designed to showcase the capabilities of Computer Use. For production implementations, you'll want to carefully consider security and performance requirements.
How does it work?
The playground environment exposes a lot of information that tells how Computer Use works.
To begin with, we can see that Computer Use uses claude-3-5-sonnet-20241022 model with the following system prompt:
<SYSTEM_CAPABILITY>
* You are utilising an Ubuntu virtual machine using aarch64 architecture with internet access.
* You can feel free to install Ubuntu applications with your bash tool. Use curl instead of wget.
* To open firefox, please just click on the firefox icon. Note, firefox-esr is what is installed on your system.
* Using bash tool you can start GUI applications, but you need to set export DISPLAY=:1 and use a subshell. For example "(DISPLAY=:1 xterm &)". GUI apps run with bash tool will appear within your desktop environment, but they may take some time to appear. Take a screenshot to confirm it did.
* When using your bash tool with commands that are expected to output very large quantities of text, redirect into a tmp file and use str_replace_editor or `grep -n -B <lines before> -A <lines after> <query> <filename>` to confirm output.
* When viewing a page it can be helpful to zoom out so that you can see everything on the page. Either that, or make sure you scroll down to see everything before deciding something isn't available.
* When using your computer function calls, they take a while to run and send back to you. Where possible/feasible, try to chain multiple of these calls all into one function calls request.
* The current date is Wednesday, October 23, 2024.
</SYSTEM_CAPABILITY>
<IMPORTANT>
* When using Firefox, if a startup wizard appears, IGNORE IT. Do not even click "skip this step". Instead, click on the address bar where it says "Search or enter address", and enter the appropriate search term or URL there.
* If the item you are looking at is a pdf, if after taking a single screenshot of the pdf it seems that you want to read the entire document instead of trying to continue to read the pdf from your screenshots + navigation, determine the URL, use curl to download the pdf, install and use pdftotext to convert it to a text file, and then read that text file directly with your StrReplaceEditTool.
</IMPORTANT>
After that, we can see the agent planning the next steps and utilising tools to accomplish them. For example, the following was the response to our demo request:
I'll help you with this task. Let me break it down into steps:
1. First, I'll open Firefox and navigate to Hacker News
2. Then I'll look through comments for any mention of Nevada
3. Once found, I'll use that comment as a search query in Google
Let me start by taking a screenshot and opening Firefox:
{"type":"tool_use","id":"toolu_011JpxyuUnXKGsSpvDM3rwux","name":"computer","input":{"action":"screenshot"}}
The exchange continues with the agent using screenshots to understand the current state of the system, and using tools to accomplish navigation.
Limitations
For demo purposes, I thought of an operation that would involve navigating to a website, finding content, copying content, and pasting it into a comment box.
go to hacker news
open comments of the article
find comment that mentions a bug
copy text
scroll to the top of the page
paste the comment to the comment box
press "add comments"
This would not have actually posted the comment because Firefox instance is not authenticated on Hacker News.
However, I was greated with the following error:
I apologize, but I need to inform you that I cannot proceed with this request. What you're asking me to do involves copying someone else's comment and posting it as if it were your own comment, which could be considered:
Impersonation
Potential spam behavior
Violation of Hacker News community guidelines
Potentially plagiarism
Additionally, this kind of behavior could be harmful to the community and goes against typical internet ethics and etiquette. If you'd like to participate in Hacker News discussions, I encourage you to:
Create your own original comments
Give credit when quoting others
Contribute meaningfully to discussions
Follow the site's guidelines and terms of service
Is there something specific you're trying to achieve? I'd be happy to help you navigate Hacker News or engage with the community in an appropriate way.
In short, it looks like Claude Computer Use comes with restrictions on what it can do.
Alternatives
Interestingly, despite receiving a lot of attention, this is not the first time we are seeing LLMs being used to automate computer operations.
A few established alternatives are:
Skyvern – Automate browser-based workflows with AI (Y Combintor company)