Skip to main content

Computer Use by Anthropic: A 5-Minute Setup Guide and Demo

Written by on .

anthropic
automation
browser

  1. Demo
    1. Community Demos
    2. Setup Claude Computer Use
      1. Customizing Screen Resolution
        1. Clean Up
          1. Important Notes
            1. Next Steps
            2. How does it work?
              1. Limitations
                1. Alternatives

                  Anthropic just announced computer use – an API for automating operations on a computer by looking at the screen, moving a cursor, clicking buttons, and typing text.

                  Turns out the playground demo is pretty easy to setup locally.

                  Demo

                  First, I wanted to get a taste of what it is like to use the API myself.

                  After I have setup the playground, I tried the following prompt:

                  go to hacker news, go through comments of every post until you find one that mentions Nevada. Then copy the comment, go to Google to, and use the comment as a search query.

                  Claude refused to execute some of the other prompts (see Limitations section).

                  Claude Computer Use Demo
                  The demo is 2x sped up

                  This simple prompt took 1 minutes 39 seconds to execute and costed ~USD 0.30.

                  Community Demos

                  There are already a few demos of the API in action. Here are a few examples:

                  I will update the article as I come across more examples.

                  Meanwhile, you can setup a demo instance yourself in a few minutes.

                  Setup Claude Computer Use

                  1. Set your API key First, export your Anthropic API key as an environment variable:
                  export ANTHROPIC_API_KEY=%your_api_key%
                  1. Launch the Container Run the following Docker command to start the Computer Use demo:
                  docker run \ -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \ -v $HOME/.anthropic:/home/computeruse/.anthropic \ -p 5900:5900 \ -p 8501:8501 \ -p 6080:6080 \ -p 8080:8080 \ -it ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest
                  1. Access the Interface

                  Once the container is running, you have several ways to access the system:

                  Customizing Screen Resolution

                  If you need to adjust the screen size, you can set the WIDTH and HEIGHT environment variables:

                  docker run \ -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \ -v $HOME/.anthropic:/home/computeruse/.anthropic \ -p 5900:5900 \ -p 8501:8501 \ -p 6080:6080 \ -p 8080:8080 \ -e WIDTH=1920 \ -e HEIGHT=1080 \ -it ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest

                  Clean Up

                  To start a fresh demo, you will need to delete the ~/.anthropic/ directory and restart the container.

                  Important Notes

                  • Settings are persisted between container runs in the ~/.anthropic/ directory
                  • For optimal performance, keep screenshot resolutions at XGA/WXGA or lower (approximately 1280×800)
                  • Higher resolutions may impact model accuracy and performance

                  Next Steps

                  Once set up, you can interact with the system through the combined interface at http://localhost:8080, which provides both agent chat and desktop view functionality. This allows you to experiment with AI-driven computer automation tasks.

                  Remember that this is a demo environment designed to showcase the capabilities of Computer Use. For production implementations, you'll want to carefully consider security and performance requirements.

                  How does it work?

                  The playground environment exposes a lot of information that tells how Computer Use works.

                  To begin with, we can see that Computer Use uses claude-3-5-sonnet-20241022 model with the following system prompt:

                  <SYSTEM_CAPABILITY> * You are utilising an Ubuntu virtual machine using aarch64 architecture with internet access. * You can feel free to install Ubuntu applications with your bash tool. Use curl instead of wget. * To open firefox, please just click on the firefox icon. Note, firefox-esr is what is installed on your system. * Using bash tool you can start GUI applications, but you need to set export DISPLAY=:1 and use a subshell. For example "(DISPLAY=:1 xterm &)". GUI apps run with bash tool will appear within your desktop environment, but they may take some time to appear. Take a screenshot to confirm it did. * When using your bash tool with commands that are expected to output very large quantities of text, redirect into a tmp file and use str_replace_editor or `grep -n -B <lines before> -A <lines after> <query> <filename>` to confirm output. * When viewing a page it can be helpful to zoom out so that you can see everything on the page. Either that, or make sure you scroll down to see everything before deciding something isn't available. * When using your computer function calls, they take a while to run and send back to you. Where possible/feasible, try to chain multiple of these calls all into one function calls request. * The current date is Wednesday, October 23, 2024. </SYSTEM_CAPABILITY> <IMPORTANT> * When using Firefox, if a startup wizard appears, IGNORE IT. Do not even click "skip this step". Instead, click on the address bar where it says "Search or enter address", and enter the appropriate search term or URL there. * If the item you are looking at is a pdf, if after taking a single screenshot of the pdf it seems that you want to read the entire document instead of trying to continue to read the pdf from your screenshots + navigation, determine the URL, use curl to download the pdf, install and use pdftotext to convert it to a text file, and then read that text file directly with your StrReplaceEditTool. </IMPORTANT>

                  After that, we can see the agent planning the next steps and utilising tools to accomplish them. For example, the following was the response to our demo request:

                  I'll help you with this task. Let me break it down into steps: 1. First, I'll open Firefox and navigate to Hacker News 2. Then I'll look through comments for any mention of Nevada 3. Once found, I'll use that comment as a search query in Google Let me start by taking a screenshot and opening Firefox: {"type":"tool_use","id":"toolu_011JpxyuUnXKGsSpvDM3rwux","name":"computer","input":{"action":"screenshot"}}

                  The exchange continues with the agent using screenshots to understand the current state of the system, and using tools to accomplish navigation.

                  Limitations

                  For demo purposes, I thought of an operation that would involve navigating to a website, finding content, copying content, and pasting it into a comment box.

                  1. go to hacker news
                  2. open comments of the article
                  3. find comment that mentions a bug
                  4. copy text
                  5. scroll to the top of the page
                  6. paste the comment to the comment box
                  7. press "add comments"

                  This would not have actually posted the comment because Firefox instance is not authenticated on Hacker News.

                  However, I was greated with the following error:

                  I apologize, but I need to inform you that I cannot proceed with this request. What you're asking me to do involves copying someone else's comment and posting it as if it were your own comment, which could be considered:

                  1. Impersonation
                  2. Potential spam behavior
                  3. Violation of Hacker News community guidelines
                  4. Potentially plagiarism

                  Additionally, this kind of behavior could be harmful to the community and goes against typical internet ethics and etiquette. If you'd like to participate in Hacker News discussions, I encourage you to:

                  1. Create your own original comments
                  2. Give credit when quoting others
                  3. Contribute meaningfully to discussions
                  4. Follow the site's guidelines and terms of service

                  Is there something specific you're trying to achieve? I'd be happy to help you navigate Hacker News or engage with the community in an appropriate way.

                  In short, it looks like Claude Computer Use comes with restrictions on what it can do.

                  Alternatives

                  Interestingly, despite receiving a lot of attention, this is not the first time we are seeing LLMs being used to automate computer operations.

                  A few established alternatives are:

                  Special mentions:

                  • UGround - universal visual grounding model for locating the element of an action by pixel coordinates on GUIs.
                  • Voqal - like Anthropic's computer use but with browser-native functionality (i.e., no screenshots).

                  I wrote a dedicated article about automating macOS using Claude.