Enables AI-powered web browsing automation using Google's Gemini 2.5 Computer Use API to navigate websites, click buttons, fill forms, and interact with web pages through intelligent visual analysis and browser control
Gemini Web Automation MCP
Production-ready Model Context Protocol (MCP) server providing AI-powered web browsing automation using Google's Gemini 2.5 Computer Use API. Built with FastMCP and optimized for 4-5x faster performance than baseline implementations.
Table of Contents
Overview
Gemini Web Automation MCP enables Claude Desktop and other MCP clients to perform intelligent web browsing automation. The AI agent navigates websites, clicks buttons, fills forms, searches for information, and interacts with web pages like a human user.
Key Statistics:
7 MCP tools for comprehensive browser control
13 browser actions supported (click, type, scroll, navigate, etc.)
4-5x faster than naive implementations
90% context reduction with compact progress mode
1440x900 optimized resolution (Gemini recommended)
Features
Core Capabilities
7 Production-Ready MCP Tools:
browse_web
- Synchronous web browsing with immediate completionstart_web_task
- Start long-running tasks in backgroundcheck_web_task
- Monitor progress with compact/full modeswait
- Intelligent rate limiting (1-60 seconds)stop_web_task
- Cancel running background taskslist_web_tasks
- View all active and completed tasksget_web_screenshots
- Retrieve session screenshots for verification
Advanced Features:
Real-time progress tracking with timestamped events
Automatic screenshot capture at each step
Safety decision framework (Gemini safety controls)
Context-aware polling with recommended delays
Background task management with status tracking
Session-based screenshot organization
Technical Highlights
MCP Protocol Compliant: Follows 2025 Model Context Protocol best practices
Performance Optimized: Conditional wait states (0.3-3s), fast page loads, single screenshot per turn
Safety-First: Implements Gemini's safety decision acknowledgment framework
Production Ready: Comprehensive error handling, logging, and validation
User-Friendly: Action-oriented tool names, clear descriptions, helpful examples
Quick Start
Prerequisites
Python 3.10 or higher
UV package manager (Install UV)
Gemini API key (Get one here)
Installation (3 Steps)
1. Clone and install dependencies:
2. Configure environment:
3. Test the server:
The MCP Inspector will open at http://localhost:6274
where you can test all tools interactively.
Verification
Run validation tests to ensure everything is set up correctly:
All tests should pass with ✓ markers.
Installation
Method 1: Claude Desktop
Add to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json
on macOS):
Important: Replace /absolute/path/to/
with your actual path. Use pwd
in the repository root to get the full path. Restart Claude Desktop after editing the config.
Method 2: Direct Clone
Method 3: UVX (Future)
Once published to PyPI:
Configuration
Create a .env
file in the project root with the following variables:
Required Configuration
Optional Configuration
Note: Screen resolution of 1440x900 is optimized for Gemini's Computer Use model. Other resolutions may degrade performance.
Usage
Synchronous Workflow
For quick tasks that complete in under 10 seconds, use the synchronous browse_web
tool:
Asynchronous Workflow
For long-running tasks (15+ seconds), use the async workflow to monitor progress:
Step 1: Start the task
Step 2: Check progress
Step 3: Get final result
All 7 MCP Tools
1. browse_web
Synchronous web browsing that waits for task completion.
Parameters:
task
(string, required): Natural language description of what to accomplishurl
(string, optional): Starting URL (defaults to Google)
Returns:
ok
(boolean): Success statusdata
(string): Task completion message with resultssession_id
(string): Unique session identifierscreenshot_dir
(string): Path to saved screenshotsprogress
(array): Full progress history with timestampserror
(string): Error message if task failed
Example:
2. start_web_task
Start a long-running web task in the background.
Parameters:
task
(string, required): Natural language task descriptionurl
(string, optional): Starting URL
Returns:
ok
(boolean): Task started successfullytask_id
(string): Unique ID for checking progressstatus
(string): Always "running" initiallymessage
(string): Instructions for checking progress
Example:
3. check_web_task
Monitor progress of a background task.
Parameters:
task_id
(string, required): Task ID from start_web_taskcompact
(boolean, optional): Return summary only (default: true)
Returns (compact mode):
ok
(boolean): Success statustask_id
(string): Task identifierstatus
(string): "pending" | "running" | "completed" | "failed" | "cancelled"progress_summary
(object): Recent actions and total stepsresult
(object): Task results (when completed)error
(string): Error message (when failed)recommended_poll_after
(string): ISO timestamp for next check
Example:
4. wait
Pause execution for rate limiting and polling delays.
Parameters:
seconds
(integer, required): Wait time in seconds (1-60)
Returns:
ok
(boolean): Success statuswaited_seconds
(integer): Actual wait timemessage
(string): Confirmation message
Example:
5. stop_web_task
Cancel a running background task.
Parameters:
task_id
(string, required): Task ID to cancel
Returns:
ok
(boolean): Cancellation successmessage
(string): Confirmation messagetask_id
(string): Cancelled task IDerror
(string): Error if task not found
Example:
6. list_web_tasks
View all tasks (active and completed).
Parameters: None
Returns:
ok
(boolean): Success statustasks
(array): List of task status objectscount
(integer): Total number of tasksactive_count
(integer): Number of running tasks
Example:
7. get_web_screenshots
Retrieve screenshots from a completed session.
Parameters:
session_id
(string, required): Session ID from browse_web or check_web_task
Returns:
ok
(boolean): Success statusscreenshots
(array): List of screenshot file pathssession_id
(string): Session identifiercount
(integer): Number of screenshotserror
(string): Error if session not found
Example:
Architecture
How It Works
Task Submission: MCP client sends natural language task + optional URL
Browser Launch: Playwright launches Chromium (1440x900 viewport)
Gemini Loop: Screenshot → Gemini vision analysis → Browser actions → Screenshot (repeat)
Completion: Gemini returns text response when task is complete
Cleanup: Browser closed, screenshots saved to session directory
Browser Actions
The Gemini Computer Use API supports 13 browser automation actions:
Action | Description | Example Use |
| Go to a URL | Navigate to |
| Click at normalized coordinates (0-999) | Click button at position (500, 300) |
| Hover at normalized coordinates | Hover over menu item |
| Type text at coordinates | Type search query in input field |
| Press keyboard combinations | Press Enter, Ctrl+A, etc. |
| Page-level scrolling | Scroll down one page |
| Scroll at specific coordinates | Scroll within specific element |
| Drag from source to destination | Reorder list items |
| Navigate backward in history | Return to previous page |
| Navigate forward in history | Go to next page |
| Navigate to Google search | Start a Google search |
| Wait 5 seconds | Wait for dynamic content |
| No-op (browser already open) | - |
Coordinate System: Gemini uses a normalized 1000x1000 grid. The agent automatically converts to actual pixels based on your screen resolution.
Performance
Benchmark Results
Task Type | Duration | Turns | Description |
Simple | 5-8 seconds | 1-2 | Navigate to a page, verify content |
Medium | 20-40 seconds | 5-8 | Search, click, extract information |
Complex | 60-120 seconds | 15-30 | Multi-step workflows, data compilation |
Optimization Strategies
Before Optimization:
6 seconds wait after every action
networkidle
page loads (waits for all network requests)3 screenshots captured per turn
Sequential action execution
After Optimization:
0.3-3 seconds conditional waits (navigation only)
domcontentloaded
page loads (DOM ready)1 screenshot per turn (reused for all responses)
Parallel function execution when possible
Performance Improvement: 4-5x faster execution
Latency Breakdown (Per Turn)
Gemini API call: 2-4 seconds (vision processing)
Browser action: 0.3-3 seconds (optimized waits)
Screenshot capture: <0.5 seconds
Total per turn: ~3-8 seconds
Best Practices
Task Design
1. Be Specific and Clear
2. Choose the Right Tool
3. Use Appropriate Polling
Error Handling
4. Handle All Status States
5. Use Compact Mode
Security
6. Review Screenshots Always check saved screenshots to verify agent behavior, especially for sensitive operations.
7. Environment Variables Never commit API keys. Always use environment variables or secure vaults.
8. Rate Limiting
Use the wait
tool to respect rate limits and avoid overwhelming services.
9. Domain Validation Be cautious with user-provided URLs. Consider implementing domain allowlists for production.
10. Logging and Audit All actions are logged with timestamps. Review logs for debugging and compliance.
Troubleshooting
Common Issues
Error: 400 INVALID_ARGUMENT (safety decision)
Solution: This is fixed in v1.0.0. Update to latest version.
Context window filling up too fast
Solution: Use
compact=true
incheck_web_task
(default behavior).
Tasks timing out at 30 turns
Solution: Break complex tasks into smaller subtasks or increase
max_turns
inbrowser_agent.py
.
Browser not visible during execution
Solution: Set
HEADLESS=false
in.env
to see the browser window.
"No module named 'mcp'" error
Solution: Activate virtual environment and run
uv sync
.
"GEMINI_API_KEY environment variable not set"
Solution: Create
.env
file with your API key or set it in Claude Desktop config.
"Executable doesn't exist" (Playwright)
Solution: Run
playwright install chromium
.
MCP server not showing in Claude Desktop
Solution: Verify absolute paths in config, ensure
uv
is in PATH, restart Claude Desktop completely.
FAQ
Q: Can I use both synchronous and asynchronous workflows?
A: Yes! Use browse_web
for quick tasks (<10s) and start_web_task
for long tasks (>15s).
Q: What happens to old completed tasks? A: Tasks auto-cleanup after 24 hours to free memory.
Q: Can I check progress of a synchronous task? A: No, but the response includes full progress history after completion.
Q: How many tasks can run simultaneously? A: Unlimited. Each task runs in its own browser instance and thread.
Q: Does Claude automatically poll async tasks?
A: No, Claude must manually call check_web_task()
multiple times with wait()
between calls.
Q: Can I cancel a task mid-execution?
A: Yes, use stop_web_task(task_id)
to cancel any running task.
Development
Setup for Contributors
1. Fork and clone:
2. Install dependencies:
3. Create
4. Test your setup:
Testing
Manual Testing:
Validation Tests:
Coding Standards
Follow PEP 8 style guide
Use type hints for all function signatures
Write docstrings for public functions/classes
Keep functions focused and small
Use meaningful variable names
Example Function:
Contributing
How to Contribute:
Fork the repository
Create a feature branch:
git checkout -b feature/amazing-feature
Make your changes following coding standards
Test your changes with MCP Inspector
Update documentation if needed
Commit with clear message:
git commit -m "Add: Brief description"
Push to your fork:
git push origin feature/amazing-feature
Open a Pull Request
Commit Message Prefixes:
Add:
- New featuresFix:
- Bug fixesUpdate:
- Updates to existing featuresRefactor:
- Code refactoringDocs:
- Documentation changesTest:
- Adding or updating tests
Pull Request Checklist:
All tests pass
Documentation updated
Code follows style guidelines
No breaking changes (or clearly documented)
Commit messages are clear
Tool Design Principles
When adding or modifying MCP tools:
User-focused naming:
browse_web
notexecute_browser_automation
Clear descriptions: Explain what users accomplish, not technical details
Action-oriented: Use verbs (check, start, stop, wait)
Proper validation: Validate inputs and provide helpful error messages
Consistent responses: Always return
{"ok": bool, ...}
format
Deployment
Pre-Deployment Checklist
Project Files:
README.md with comprehensive documentation
LICENSE (MIT)
CHANGELOG.md
pyproject.toml with proper metadata
.gitignore with comprehensive rules
.env.sample template
Core files (server.py, browser_agent.py, task_manager.py)
Code Quality:
All tests passing
Safety decision bug fixed
Compact progress mode implemented
MCP best practices followed
Performance optimizations applied
GitHub Release Process
1. Create GitHub Repository:
2. Push to GitHub:
3. Create Release Tag:
4. Create GitHub Release:
Go to Releases → Create a new release
Choose tag:
v1.0.0
Release title:
v1.0.0 - Initial Production Release
Description: Copy from CHANGELOG.md
Publish release
Repository Settings
Configure:
Description: Production-ready MCP server for AI-powered web automation
Topics:
mcp
,gemini
,browser-automation
,claude-desktop
,ai-agents
,playwright
Enable Issues for bug reports
Enable Discussions for community support
Distribution Methods
Method 1: Direct Git Clone (Recommended)
Method 2: UVX (Future - when published to PyPI)
Roadmap
Planned Features
Priority 1: Security Enhancements
Human-in-the-loop confirmation UI for safety decisions
Domain allowlist/blocklist for navigation
Input sanitization and validation
Container-based sandboxing for production
Priority 2: Reliability
Retry logic with exponential backoff
Better error recovery mechanisms
Network resilience improvements
Connection timeout handling
Priority 3: Functionality
Cookie and session management
Form auto-fill templates
Multi-tab support
Mobile viewport emulation
Priority 4: Developer Experience
Proxy support for corporate environments
Custom wait conditions
Advanced screenshot comparison
Performance profiling tools
Known Limitations
Maximum 30 turns per task (configurable but not recommended to increase)
Browser automation only (no desktop OS-level control)
Single browser instance per task (no tab switching)
Limited to Chromium (Firefox/WebKit not supported)
Safety confirmations not yet implemented (future enhancement)
Changelog
[1.0.0] - 2025-01-17
Initial production-ready release
Added:
7 MCP tools for comprehensive browser automation
Real-time progress tracking with compact mode (90% size reduction)
Safety decision framework (fixes 400 INVALID_ARGUMENT error)
Context-aware polling with recommended delay timestamps
Performance optimizations (4-5x faster than baseline)
Automatic screenshot capture at each step
Background task management with status tracking
Comprehensive documentation and examples
Performance:
Conditional wait states (0.3-3s vs 6s per action)
Fast page loads (
domcontentloaded
vsnetworkidle
)Single screenshot per turn (eliminates duplicates)
Parallel function execution for batch operations
Security:
Safety decision acknowledgment implementation
Environment-based API key configuration
Comprehensive logging for audit trails
Screenshot-based verification capability
Security
Safety Decision Framework
This MCP server implements Gemini's safety decision framework to prevent:
Financial transactions without confirmation
Sensitive data access without review
System-level changes without approval
CAPTCHA bypassing attempts
Potentially harmful actions
Best Practices
Sandboxing: Run in containerized environment for production deployment
API Key Security: Use environment variables, never commit keys to version control
Rate Limiting: Respect built-in polling delays (recommended 5 seconds)
Human-in-Loop: Review outputs before taking actions based on results
Logging: All actions are logged with timestamps for audit trail
Screenshot Review: Check saved screenshots to verify agent behavior
Reporting Vulnerabilities
For security vulnerabilities, please email security@example.com with:
Description of the vulnerability
Steps to reproduce
Potential impact
Suggested fix (if available)
Do not open public issues for security vulnerabilities.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
Built with excellence using:
Google Gemini Team - Gemini 2.5 Computer Use API
Anthropic - Model Context Protocol specification
FastMCP - Excellent MCP server framework
Playwright - Robust browser automation library
Support
Get Help
Documentation: You're reading it!
GitHub Issues: Report bugs or request features
GitHub Discussions: Ask questions and share ideas
Useful Links
Community
Share your use cases and automations
Contribute improvements and bug fixes
Help others in GitHub Discussions
Star the repository if you find it useful
Built with Gemini 2.5 Computer Use API
MCP Protocol 2025 | Production Ready | Open Source
local-only server
The server can only run on the client's local machine because it depends on local resources.
Enables AI-powered web browsing automation using Google's Gemini 2.5 Computer Use API. Allows agents to navigate websites, click buttons, fill forms, and extract information through natural language commands with real-time progress tracking.