OmniMCP

by OpenAdaptAI
Verified

hybrid server

The server is able to function both locally and remotely, depending on the configuration or use case.

Integrations

  • Supports configuration via .env files for settings like debug mode, parser URL, and log levels

  • Integrates with GitHub for repository management, including cloning repositories and accessing project files

  • Provides testing infrastructure for validating OmniMCP functionality

OmniMCP

OmniMCP provides rich UI context and interaction capabilities to AI models through Model Context Protocol (MCP) and microsoft/OmniParser. It focuses on enabling deep understanding of user interfaces through visual analysis, structured responses, and precise interaction.

Core Features

  • Rich Visual Context: Deep understanding of UI elements
  • Natural Language Interface: Target and analyze elements using natural descriptions
  • Comprehensive Interactions: Full range of UI operations with verification
  • Structured Types: Clean, typed responses using dataclasses
  • Robust Error Handling: Detailed error context and recovery strategies

Overview

  1. Spatial Feature Understanding: OmniMCP begins by developing a deep understanding of the user interface's visual layout. Leveraging microsoft/OmniParser, it performs detailed visual parsing, segmenting the screen and identifying all interactive and informational elements. This includes recognizing their types, content, spatial relationships, and attributes, creating a rich representation of the UI's static structure.
  1. Temporal Feature Understanding: To capture the dynamic aspects of the UI, OmniMCP tracks user interactions and the resulting state transitions. It records sequences of actions and changes within the UI, building a Process Graph that represents the flow of user workflows. This temporal understanding allows AI models to reason about interaction history and plan future actions based on context.
  1. Internal API Generation: Utilizing the rich spatial and temporal context it has acquired, OmniMCP leverages a Large Language Model (LLM) to generate an internal, context-specific API. Through In-Context Learning (prompting), the LLM dynamically creates a set of functions and parameters that accurately reflect the understood spatiotemporal features of the UI. This internal API is tailored to the current state and interaction history, enabling precise and context-aware interactions.
  1. External API Publication (MCP): Finally, OmniMCP exposes this dynamically generated internal API through the Model Context Protocol (MCP). This provides a consistent and straightforward interface for both humans (via natural language translated by the LLM) and AI models to interact with the UI. Through this MCP interface, a full range of UI operations can be performed with verification, all powered by the AI model's deep, dynamically created understanding of the UI's spatiotemporal context.

Installation

pip install omnimcp # Or from source: git clone https://github.com/OpenAdaptAI/omnimcp.git cd omnimcp ./install.sh

Quick Start

from omnimcp import OmniMCP from omnimcp.types import UIElement, ScreenState, InteractionResult async def main(): mcp = OmniMCP() # Get current UI state state: ScreenState = await mcp.get_screen_state() # Analyze specific element description = await mcp.describe_element( "error message in red text" ) print(f"Found element: {description}") # Interact with UI result = await mcp.click_element( "Submit button", click_type="single" ) if not result.success: print(f"Click failed: {result.error}") asyncio.run(main())

Core Types

@dataclass class UIElement: type: str # button, text, slider, etc content: str # Text or semantic content bounds: Bounds # Normalized coordinates confidence: float # Detection confidence attributes: Dict[str, Any] = field(default_factory=dict) def to_dict(self) -> Dict: """Convert to serializable dict""" @dataclass class ScreenState: elements: List[UIElement] dimensions: tuple[int, int] timestamp: float def find_elements(self, query: str) -> List[UIElement]: """Find elements matching natural query""" @dataclass class InteractionResult: success: bool element: Optional[UIElement] error: Optional[str] = None context: Dict[str, Any] = field(default_factory=dict)

MCP Implementation and Framework API

OmniMCP provides a powerful yet intuitive API for model interaction through the Model Context Protocol (MCP). This standardized interface enables seamless integration between large language models and UI automation capabilities.

Core API

async def describe_current_state() -> str: """Get rich description of current UI state""" async def find_elements(query: str) -> List[UIElement]: """Find elements matching natural query""" async def take_action( description: str, image_context: Optional[bytes] = None ) -> ActionResult: """Execute action described in natural language with optional visual context"""

Architecture

Core Components

  1. Visual State Manager
    • Element detection
    • State management and caching
    • Rich context extraction
    • History tracking
  2. MCP Tools
    • Tool definitions and execution
    • Typed responses
    • Error handling
    • Debug support
  3. UI Parser
    • Element detection
    • Text recognition
    • Visual analysis
    • Element relationships
  4. Input Controller
    • Precise mouse control
    • Keyboard input
    • Action verification
    • Movement optimization

Development

Environment Setup

# Create development environment ./install.sh --dev # Run tests pytest tests/ # Run linting ruff check .

Debug Support

@dataclass class DebugContext: """Rich debug information""" tool_name: str inputs: Dict[str, Any] result: Any duration: float visual_state: Optional[ScreenState] error: Optional[Dict] = None def save_snapshot(self, path: str) -> None: """Save debug snapshot for analysis""" # Enable debug mode mcp = OmniMCP(debug=True) # Get debug context debug_info = await mcp.get_debug_context() print(f"Last operation: {debug_info.tool_name}") print(f"Duration: {debug_info.duration}ms")

Configuration

# .env or environment variables OMNIMCP_DEBUG=1 # Enable debug mode OMNIMCP_PARSER_URL=http://... # Custom parser URL OMNIMCP_LOG_LEVEL=DEBUG # Log level

Performance Considerations

  1. State Management
    • Smart caching
    • Incremental updates
    • Background processing
    • Efficient invalidation
  2. Element Targeting
    • Efficient search
    • Early termination
    • Result caching
    • Smart retries
  3. Visual Analysis
    • Minimal screen captures
    • Region-based updates
    • Parser optimization
    • Result caching

Limitations and Future Work

Current limitations include:

  • Need for more extensive validation across UI patterns
  • Optimization of pattern recognition in process graphs
  • Refinement of spatial-temporal feature synthesis

Future Research Directions

Beyond reinforcement learning integration, we plan to explore:

  • Fine-tuning Specialized Models: Training domain-specific models on UI automation tasks to improve efficiency and reduce token usage
  • Process Graph Embeddings with RAG: Embedding generated process graph descriptions and retrieving relevant interaction patterns via Retrieval Augmented Generation
  • Development of comprehensive evaluation metrics
  • Enhanced cross-platform generalization
  • Integration with broader LLM architectures
  • Collaborative multi-agent UI automation frameworks

Contributing

  1. Fork repository
  2. Create feature branch
  3. Implement changes
  4. Add tests
  5. Submit pull request

License

MIT License

Project Status

Active development - API may change


For detailed implementation guidance, see CLAUDE.md. For API reference, see API.md.

Contact

Remember: OmniMCP focuses on providing rich UI context through visual understanding. Design for clarity, build with structure, and maintain robust error handling.

-
security - not tested
F
license - not found
-
quality - not tested

A server that provides rich UI context and interaction capabilities to AI models, enabling deep understanding of user interfaces through visual analysis and precise interaction via Model Context Protocol.

  1. Core Features
    1. Overview
      1. Installation
        1. Quick Start
          1. Core Types
            1. MCP Implementation and Framework API
              1. Core API
            2. Architecture
              1. Core Components
            3. Development
              1. Environment Setup
              2. Debug Support
            4. Configuration
              1. Performance Considerations
                1. Limitations and Future Work
                  1. Future Research Directions
                2. Contributing
                  1. License
                    1. Project Status
                      1. Contact