Supports configuration via .env files for settings like debug mode, parser URL, and log levels
Integrates with GitHub for repository management, including cloning repositories and accessing project files
Provides testing infrastructure for validating OmniMCP functionality
Built on Python with async support for UI automation and interaction
Implements rich debug information and context visualization for UI automation
Includes linting support using Ruff for code quality assurance
OmniMCP
OmniMCP provides rich UI context and interaction capabilities to AI models through Model Context Protocol (MCP) and microsoft/OmniParser. It focuses on enabling deep understanding of user interfaces through visual analysis, structured planning, and precise interaction execution.
Core Features
- Visual Perception: Understands UI elements using OmniParser.
- LLM Planning: Plans next actions based on goal, history, and visual state.
- Agent Executor: Orchestrates the perceive-plan-act loop (
omnimcp/agent_executor.py
). - Action Execution: Controls mouse/keyboard via
pynput
(omnimcp/input.py
). - CLI Interface: Simple entry point (
cli.py
) for running tasks. - Auto-Deployment: Optional OmniParser server deployment to AWS EC2 with auto-shutdown.
- Debugging: Generates timestamped visual logs per step.
Overview
cli.py
uses AgentExecutor
to run a perceive-plan-act loop. It captures the screen (VisualState
), plans using an LLM (core.plan_action_for_ui
), and executes actions (InputController
).
Demos
- Real Action (Calculator):
python cli.py
opens Calculator and computes 5*9. - Synthetic UI (Login):
python demo_synthetic.py
uses generated images (no real I/O). (Note: Pending refactor to use AgentExecutor).
Prerequisites
- Python >=3.10, <3.13
uv
installed (pip install uv
)- Linux Runtime Requirement: Requires an active graphical session (X11/Wayland) for
pynput
. May need system libraries (libx11-dev
, etc.) - seepynput
docs.
(macOS display scaling dependencies are handled automatically during installation).
For AWS Deployment Features
Requires AWS credentials in .env
(see .env.example
). Warning: Creates AWS resources (EC2, Lambda, etc.) incurring costs. Use python -m omnimcp.omniparser.server stop
to clean up.
Installation
Quick Start
Ensure environment is activated and .env
is configured.
Debug outputs are saved in runs/<timestamp>/
.
Note on MCP Server: An experimental MCP server (OmniMCP
class in omnimcp/mcp_server.py
) exists but is separate from the primary cli.py
/AgentExecutor
workflow.
Architecture
- CLI (
cli.py
) - Entry point, setup, starts Executor. - Agent Executor (
omnimcp/agent_executor.py
) - Orchestrates loop, manages state/artifacts. - Visual State Manager (
omnimcp/visual_state.py
) - Perception (screenshot, calls parser). - OmniParser Client & Deploy (
omnimcp/omniparser/
) - Manages OmniParser server communication/deployment. - LLM Planner (
omnimcp/core.py
) - Generates action plan. - Input Controller (
omnimcp/input.py
) - Executes actions (mouse/keyboard). - (Optional) MCP Server (
omnimcp/mcp_server.py
) - Experimental MCP interface.
Development
Environment Setup & Checks
Debug Support
Running python cli.py
saves timestamped runs in runs/
, including:
step_N_state_raw.png
step_N_state_parsed.png
(with element boxes)step_N_action_highlight.png
(with action highlight)final_state.png
Detailed logs are in logs/run_YYYY-MM-DD_HH-mm-ss.log
(LOG_LEVEL=DEBUG
in .env
recommended).
(Note: Details like timings, counts, IPs, instance IDs, and specific plans will vary)
Roadmap & Limitations
Key limitations & future work areas:
- Performance: Reduce OmniParser latency (explore local models, caching, etc.) and optimize state management (avoid full re-parse).
- Robustness: Improve LLM planning reliability (prompts, techniques like ReAct), add action verification/error recovery, enhance element targeting.
- Target API/Architecture: Evolve towards a higher-level declarative API (e.g.,
@omni.publish
style) and potentially integrate loop logic with the experimental MCP Server (OmniMCP
class). - Consistency: Refactor
demo_synthetic.py
to useAgentExecutor
. - Features: Expand action space (drag/drop, hover).
- Testing: Add E2E tests, broaden cross-platform validation, define evaluation metrics.
- Research: Explore fine-tuning, process graphs (RAG), framework integration.
Project Status
Core loop via cli.py
/AgentExecutor
is functional for basic tasks. Performance and robustness need significant improvement. MCP integration is experimental.
Contributing
- Fork repository
- Create feature branch
- Implement changes & add tests
- Ensure checks pass (
uv run ruff format .
,uv run ruff check . --fix
,uv run pytest tests/
) - Submit pull request
License
MIT License
Contact
- Issues: GitHub Issues
- Questions: Discussions
- Security: security@openadapt.ai
This server cannot be installed
hybrid server
The server is able to function both locally and remotely, depending on the configuration or use case.
A server that provides rich UI context and interaction capabilities to AI models, enabling deep understanding of user interfaces through visual analysis and precise interaction via Model Context Protocol.
- Core Features
- Overview
- Prerequisites
- Installation
- Quick Start
- Architecture
- Development
- Roadmap & Limitations
- Project Status
- Contributing
- License
- Contact
Related Resources
Related MCP Servers
- AsecurityAlicenseAqualityThis server implements the Model Context Protocol to facilitate meaningful interaction and understanding development between humans and AI through structured tools and progressive interaction patterns.Last updated -1322TypeScriptMIT License
- AsecurityAlicenseAqualityAn enhanced Model Context Protocol server that enables AI assistants to interact with ClickUp workspaces, supporting task relationships, comments, checklists, and workspace management through natural language.Last updated -40313TypeScriptMIT License
- AsecurityAlicenseAqualityA Model Context Protocol server that provides AI vision capabilities for analyzing UI screenshots, offering tools for screen analysis, file operations, and UI/UX report generation.Last updated -261JavaScriptISC License
- -securityFlicense-qualityA server that enables AI systems to browse, retrieve content from, and interact with web pages through the Model Context Protocol.Last updated -