Skip to main content
Glama

Screenbox

Real desktops for AI agents.

Screenbox gives any MCP-compatible AI agent (Claude, Cursor, Copilot, etc.) its own isolated virtual desktop with a real Chromium browser. Your agents see, click, type, and navigate -- just like a human would. You watch them work via RDP or VNC. You take control when they need help.

Each desktop is a fully isolated Docker container. No bind mounts -- files move only through explicit API calls. Save and restore state with snapshots. Everything runs on your machine.

Demo

Screenbox Demo

Quick Start

Full setup with dashboard, multi-desktop support, and web UI.

git clone https://github.com/dklymentiev/screenbox.git
cd screenbox
./setup.sh          # generates .env, builds desktop image + services
docker compose up -d

Dashboard: http://localhost:16000 MCP endpoint: http://localhost:8080/mcp

Add to your MCP client (Claude Desktop, Claude Code, Cursor):

{
  "mcpServers": {
    "screenbox": {
      "url": "http://localhost:8080/mcp"
    }
  }
}

Option B: pip install (single agent, no dashboard)

Lightweight setup -- MCP server runs locally via stdio.

pip install screenbox-mcp
docker build -f docker/Dockerfile -t screenbox:latest docker/
{
  "mcpServers": {
    "screenbox": {
      "command": "python3",
      "args": ["-m", "screenbox"]
    }
  }
}

Then tell your agent:

"Create a desktop and go to github.com"

Authentication

Screenbox supports three auth modes depending on your setup.

Strict Mode (default: on)

Set SCREENBOX_REQUIRE_AUTH=false in .env to disable. When disabled, all agents have full access without authentication -- suitable for single-user or VPN-protected setups.

Admin Access

Set SCREENBOX_ADMIN_KEY in .env -- full access to all desktops. The SCREENBOX_API_TOKEN (Bearer token) also grants admin access.

Token can be passed via (in priority order):

  1. X-API-Key header

  2. Authorization: Bearer <token> header

  3. ?token=<token> query parameter in URL

Agent Registration (multi-agent setups)

1. Register:  desktop_manage(action="register", agent_id="my-bot", label="My Bot")
              -> returns api_key (save it!)

2. Login:     desktop_manage(action="login", agent_id="my-bot", text="<api_key>")
              -> session stored on server for this MCP connection

3. Create:    desktop_manage(action="create", desktop_id="work-1")
              -> desktop owned by "my-bot"

4. Work:      desktop_screenshot("work-1"), desktop_click("work-1", ...) etc.
              -> only "my-bot" can access "work-1"

Step 2 (login) is needed once per session. Alternatively, pass the api_key via header or ?token= URL param to skip the login step.

Ownership Rules

  • Desktop created by an agent belongs to that agent (persists across restarts)

  • Admin-created desktops are shared (any agent can use them)

  • Agents see only their own desktops + shared desktops

  • Admin sees and manages all desktops

MCP Client Config

Option A -- headers (if your MCP client supports them):

{
  "mcpServers": {
    "screenbox": {
      "url": "http://localhost:8080/mcp",
      "headers": {
        "Authorization": "Bearer <your-api-token>"
      }
    }
  }
}

Option B -- token in URL (works with any MCP client):

{
  "mcpServers": {
    "screenbox": {
      "url": "http://localhost:8080/mcp?token=<your-api-token>"
    }
  }
}

Option C -- no auth (strict mode off):

{
  "mcpServers": {
    "screenbox": {
      "url": "http://localhost:8080/mcp"
    }
  }
}

What Your Agent Can Do

Agent: desktop_manage(action="create", desktop_id="browser-1")
Agent: desktop_chrome(desktop_id="browser-1", action="navigate", url="https://github.com")
Agent: desktop_screenshot("browser-1")                           -- sees the page
Agent: desktop_chrome(desktop_id="browser-1", action="page_map") -- structured page content
Agent: desktop_look("browser-1", cell=5)                         -- OCR a grid cell for precise coords
Agent: desktop_click("browser-1", 640, 360)                      -- clicks
Agent: desktop_type("browser-1", "hello world")                  -- types

Chrome Recovery

If Chrome crashes or MCP restarts, relaunch Chrome with the Screenbox extension:

Agent: desktop_manage(action="app_launch", app="chrome", app_args="https://example.com")
       -> launched: true, extension_ready: true

This uses start-chrome.sh which handles singleton locks, service worker cache, and extension loading automatically.

Architecture

All desktop operations go through a single path: MCP server -> manager -> Docker API. Dashboard, MCP tools, and HTTP API all use the same manager.exec() for screenshots, shell commands, and container lifecycle. No direct docker CLI calls.

A custom Docker API proxy (docker-proxy.py) sits between MCP and the Docker daemon, whitelisting allowed endpoints and properly streaming exec stdout for reliable binary data transfer (screenshots, file reads).

Security

Screenbox gives AI agents full desktop access -- browser, shell, files. Run it responsibly:

  • Do not expose MCP API to the public internet. Use localhost or VPN only.

  • Use unique API tokens. setup.sh generates them automatically.

  • Desktops are isolated containers but not hardened sandboxes. Do not run untrusted agents without review.

  • Enable Docker API proxy for shared or multi-tenant environments.

See SECURITY.md for vulnerability reporting and detailed security architecture.

Features

  • MCP-native -- works with Claude Desktop, Claude Code, Cursor, or any MCP client

  • Real Chromium -- not headless, not Playwright. A real browser with DevTools and extensions

  • Fully isolated -- each desktop is an isolated Docker container. No bind mounts, no host access

  • Snapshots -- save and restore desktop state (files, sessions) on demand

  • Observable -- watch agents work live via RDP or VNC

  • Human-in-the-loop -- take mouse/keyboard control, help the agent, release control

  • Semantic element map -- agents get a structured map of all interactive elements with coordinates

  • Cross-platform -- Linux (native Docker), macOS (Docker Desktop), Windows (WSL2)

  • Lightweight -- ~2 GB RAM per desktop, no GPU needed

  • Knowledge compilation -- agents learn from past sessions. Action logs are compiled into reusable knowledge facts that are auto-injected into future interactions

Knowledge Compilation

Agents lose learned knowledge between sessions. The knowledge compilation pipeline solves this:

Session logs (action history)
    |  desktop_compile_knowledge()
    v
Candidate facts (declarative, not imperative)
    |  desktop_merge_knowledge(mode="preview")
    v
Diff: new / updated / unchanged
    |  desktop_merge_knowledge(mode="apply")
    v
Stored knowledge (auto-injected into screenshot/look responses)

Configure any OpenAI-compatible LLM in .env:

SCREENBOX_LLM_ENDPOINT=https://openrouter.ai/api/v1
SCREENBOX_LLM_MODEL=google/gemini-2.5-flash
SCREENBOX_LLM_KEY=sk-...

MCP Tools

Screenbox exposes 21 MCP tools: 8 core, 4 dispatchers, 4 knowledge, 2 system, and 1 debug tool.

Core Tools (8)

Tool

Description

desktop_screenshot

Capture screen as JPEG (grid overlay, enhance options)

desktop_look

OCR a grid cell -- get precise text and coordinates for clicking

desktop_click

Click at (x, y) with observe mode -- returns OCR around click point

desktop_type

Type text via keyboard

desktop_key

Key combo (Ctrl+C, Enter, Alt+F4, etc.)

desktop_shell

Run shell command in container

desktop_batch

Execute multiple actions in sequence (reduce round-trips)

desktop_help

Show tool reference and workflow patterns

Dispatcher Tools (4)

Each dispatcher consolidates related actions behind a single action parameter:

Tool

Actions

desktop_chrome

navigate, page_map, page_read, view_read, cursor_read, eval, tabs, new_tab, close_tab, switch_tab, back, forward, wait_for, screenshot, search, extract, dom, page_info, cookies, set_cookies, clear_cookies, pdf, click, type, performance, network, console_start, console_stop, console_get, ready, ssl_errors, emulate, geolocation

desktop_window

list, activate, minimize, maximize, restore, resize, move, close, show_desktop

desktop_file

upload, download, list, upload_tar

desktop_manage

create, destroy, list, status, pause, resume, acquire, release, smart_acquire, heartbeat, health, snapshot_save, snapshot_restore, snapshot_list, clipboard_get, clipboard_set, grid_on, grid_off, overlay, install, uninstall, app_launch, proc_list, proc_kill, scroll, drag, mouse_move, mouse_down, mouse_up, right_click, wait_window, wait_idle

Debug Tools (1)

Tool

Actions

desktop_debug

on_screen (AT-SPI/OCR/Vision cascade), text, click_text, wait_text, element (AI vision), hover, a11y_apps, a11y_tree, a11y_find, a11y_activate, a11y_set_text, inspect_cell, menu_click

Debug tools are for advanced automation and accessibility inspection. Normal agent workflow should use screenshot -> look -> click.

Knowledge Tools (4)

Tool

Description

desktop_add_knowledge

Teach the agent facts about specific apps (auto-injected into screenshots)

desktop_knowledge_search

Search or list knowledge. Empty call = list all available knowledge

desktop_compile_knowledge

Compile session action logs into knowledge facts via LLM

desktop_merge_knowledge

Preview or apply merge of compiled facts into existing knowledge

System Tools (2)

Tool

Description

screenbox_info

Architecture, config, and running desktops overview

screenbox_logs

Read action history for a desktop session

Workflow: screenshot -> look -> click

The recommended interaction pattern:

1. desktop_screenshot("my-desktop")              -- see the full screen
2. desktop_look("my-desktop", cell=5)             -- OCR cell 5 for precise coordinates
3. desktop_click("my-desktop", x=642, y=358)      -- click using coordinates from look

desktop_click returns an image + OCR around the click point by default (observe=true), so you often don't need a separate screenshot after clicking.

How Page Map Works

desktop_chrome(action="page_map") returns semantic page structure -- headings, links, forms -- with viewport coordinates:

{
  "u": "https://github.com",
  "t": "GitHub",
  "v": [1280, 720],
  "n": 42,
  "e": [
    {"i": 1, "t": "a", "l": "Sign in", "r": [1150, 12, 60, 24]},
    {"i": 2, "t": "input", "l": "Search GitHub", "r": [320, 10, 400, 32]},
    {"i": 3, "t": "button", "l": "Search", "r": [730, 10, 50, 32]}
  ]
}

Each element has: index (i), type (t), label (l), and viewport rect (r: [x, y, w, h]). Click the center: desktop_click(x + w/2, y + h/2). No vision model needed -- faster and cheaper than screenshot-based agents.

Architecture

MCP Client (Claude, Cursor, any agent)
    |
    | MCP protocol (stdio, streamable-http, or SSE)
    |
Screenbox MCP Server (Python, docker.sock)
    |
    +-- HTTP API (:8080) -- REST + SSE events
    |       |
    |   Dashboard (pure UI, no docker access)
    |       +-- VNC/RDP proxy to desktops
    |       +-- State from MCP SSE events
    |       +-- Screenshots from MCP API
    |
    +-- Desktop 1: Xvnc + xrdp + Chromium + CDP extension
    +-- Desktop 2: ...
    +-- Desktop N: ...
            |
            +-- xrdp (port 3389) -- RDP viewer
            +-- Xvnc (port 5900) -- VNC protocol
            +-- Chrome CDP (port 9222) -- semantics, navigate, eval
            +-- WS bridge (port 8765) -- extension communication

Data & Isolation

Desktops are fully isolated -- no bind mounts between container and host. Files only move through explicit API calls.

~/.screenbox/
  config.json                         # Settings
  desktops/{id}/                      # Desktop metadata
  snapshots/{id}/snapshot-*.tar.gz    # Saved desktop states
  logs/                               # Action logs

Save state before destroying:

Agent: desktop_manage(action="snapshot_save", desktop_id="browser-1", label="logged-into-github")
Agent: desktop_manage(action="destroy", desktop_id="browser-1")

Restore later:

Agent: desktop_manage(action="create", desktop_id="browser-1")
Agent: desktop_manage(action="snapshot_restore", desktop_id="browser-1")

Clone a desktop:

Agent: desktop_manage(action="snapshot_save", desktop_id="template")
Agent: desktop_manage(action="create", desktop_id="worker-1")
Agent: desktop_manage(action="snapshot_restore", desktop_id="worker-1")

Docker Images

Build the desktop container image (setup.sh does this automatically):

docker build -f docker/Dockerfile -t screenbox:latest docker/

Image

Size

Use case

screenbox:latest

~920 MB

Default -- XFCE desktop + Xvnc + xrdp + Chromium

screenbox:mate

~1.7 GB

Full MATE desktop + Chromium + file manager + terminal

Configuration

~/.screenbox/config.json:

{
  "max_desktops": 5,
  "memory_per_desktop": "2048m",
  "default_viewport": "1920x1080",
  "idle_pause_minutes": 20,
  "lease_ttl": 600,
  "image": "screenbox:latest"
}

Key

Default

Description

max_desktops

5 (3 on macOS/WSL2)

Maximum concurrent desktops

memory_per_desktop

2048m

Docker memory limit per container

default_viewport

1920x1080

Screen resolution

idle_pause_minutes

20

Auto-pause inactive desktops (0 = disabled)

lease_ttl

600

Seconds before acquired desktop auto-releases (0 = no expiry)

image

screenbox:latest

Default Docker image for new desktops

chrome_args

[]

Extra Chrome launch arguments

port_bind_address

127.0.0.1

Address to bind container ports

Remote Mode (Streamable HTTP)

Run Screenbox as a remote MCP server:

python3 -m screenbox --http
# or
SCREENBOX_TRANSPORT=streamable-http SCREENBOX_PORT=8080 python3 -m screenbox

Connect from any MCP client:

{
  "mcpServers": {
    "screenbox": {
      "url": "http://your-server:8080/mcp"
    }
  }
}

Streamable HTTP is stateless -- survives container restarts without breaking client connections. SSE (--sse, /sse endpoint) is also supported but deprecated.

Docker Compose

./setup.sh           # one-time: generates .env, builds all images
docker compose up -d # start MCP server + dashboard

setup.sh generates an API token, creates data directories, and builds the desktop image. After setup, docker compose up -d is all you need.

The MCP server has direct docker.sock access and acts as the single controller for all desktop operations. The dashboard is a pure UI that proxies everything through the MCP HTTP API.

For reverse proxy setups, see the Docker Compose documentation.

Upgrading

git pull
./setup.sh

setup.sh detects update vs first install automatically. On update it rebuilds all images, restarts services, and tells you to recreate desktops.

After update, recreate desktops (old containers use old image) via dashboard UI or API.


Old Docker images are preserved (untagged as `<none>`). Only `docker image prune` removes them.

## Requirements

- Docker 20.10+
- Python 3.10+
- 2 GB RAM per desktop (minimum)
- `--shm-size=512m` for Chrome (handled automatically)

## vs Alternatives

| | Screenbox | Browserbase | Browser MCP | Computer Use |
|---|-----------|-------------|-------------|--------------|
| Full desktop | Yes | No (browser only) | No (bridge) | Yes (cloud) |
| Self-hosted | Yes | No (SaaS) | Yes | No |
| MCP-native | Yes | Yes | Yes | No |
| Container isolation | Yes | Cloud | No | Cloud |
| Persistent state | Yes (snapshots) | No | Shared browser | No |
| Observable (live) | Yes (RDP/VNC) | No | No | No |
| Open source | AGPL-3.0 | Partial | Yes | No |
| Semantic map | Yes (DOM) | Yes (AI) | No | No (vision) |

## License

AGPL-3.0 -- see [LICENSE](LICENSE)

## Links

- Website: [screenbox.dev](https://screenbox.dev)
A
license - permissive license
-
quality - not tested
A
maintenance

Maintenance

Maintainers
15hResponse time
Release cycle
1Releases (12mo)

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/dklymentiev/screenbox'

If you have feedback or need assistance with the MCP directory API, please join our Discord server