OmniMCP
by OpenAdaptAI
Verified
# CLAUDE.md - OmniMCP Implementation Guide
## Overview
This document describes how to implement OmniMCP, a system for UI automation through visual understanding and the Model Context Protocol (MCP).
## Core Architecture
The system consists of these essential components:
1. VisualState - Current screen state
2. MCP Server - Protocol implementation
3. Input Control - UI actions
4. UI Parser Integration - Visual analysis
## Implementation Approach
### 1. Start with VisualState
```python
class VisualState:
def __init__(self):
self.elements = []
self.timestamp = None
self.screen_dimensions = None
def update(self, screenshot):
"""Update visual state from screenshot.
Critical function that maintains screen state.
Must handle:
- Screenshot capture
- UI element parsing
- State updates
- Coordinate normalization
"""
```
### 2. Implement Core MCP Server
```python
from mcp.server.fastmcp import FastMCP
mcp = FastMCP("omnimcp")
@mcp.tool()
async def get_screen_state() -> ScreenState:
"""Get current state of visible UI elements"""
@mcp.tool()
async def click_element(description: str) -> ClickResult:
"""Click UI element matching description"""
@mcp.tool()
async def type_text(text: str) -> TypeResult:
"""Type text"""
```
### 3. Build Element Targeting
```python
def find_element(description: str) -> Element:
"""Find UI element matching description.
Critical for action reliability.
Consider:
- Text matching
- Element type
- Location/context
- Confidence scores
"""
```
## Implementation Order
1. Visual State Management
- Screenshot capture
- UI parsing
- State updates
- Basic caching
2. MCP Protocol
- Observe endpoint
- Simple actions
- Rich responses
- Error handling
3. Action System
- Element targeting
- Input simulation
- Action verification
- Error recovery
## Key Considerations
### State Management
- Always update before actions
- Cache intelligently
- Track history when needed
- Clear invalidation
### Error Handling
- Rich error context
- Recovery strategies
- Debug information
- Verification
### Performance
- Minimize updates
- Smart caching
- Async where beneficial
- Efficient targeting
## MCP Protocol Details
### Observe
```python
@dataclass
class UIElement:
content: str
type: str
bounds: Bounds
confidence: float
@dataclass
class ScreenState:
elements: List[UIElement]
dimensions: tuple[int, int]
timestamp: float
@dataclass
class ActionResult:
success: bool
element: Optional[UIElement]
error: Optional[str] = None
```
## Code Structure
Current implementation:
```
./
├── omnimcp/ # Main package directory
│ ├── omnimcp.py # Core implementation with OmniMCP class and VisualState
│ ├── input.py # Input controller for UI interactions
│ ├── types.py # Type definitions (Bounds, UIElement, etc.)
│ ├── utils.py # Utilities for screenshots, coordinates, etc.
│ ├── config.py # Centralized configuration
│ └── omniparser/ # UI parsing functionality
│ ├── client.py # Parser client and provider
│ └── server.py # Parser deployment and management
├── tests/ # Test directory
│ ├── test_synthetic_ui.py # Synthetic UI generation for testing
│ └── test_omnimcp.py # Core functionality tests
└── run_omnimcp.py # Command-line entry point
```
Planned expansion:
```
./
├── utils.py # Core utilities and input control
├── omniparser/ # UI parsing functionality
│ ├── client.py # Parser client and provider
│ └── server.py # Parser deployment and management
├── core/ # Future: Core state management
│ ├── visual_state.py
│ └── element.py
└── mcp/ # Future: MCP implementation
└── server.py
```
## Package Management
OmniMCP uses `uv` for dependency management. When adding new dependencies, use:
```bash
uv add <package-name> # Add a regular dependency
uv add --dev <package-name> # Add a development dependency
uv pip install -e . # Install all dependencies
```
This ensures dependencies are properly recorded in pyproject.toml.
## Configuration System
OmniMCP now uses a centralized configuration system with:
- Settings loaded from environment variables and `.env` file
- Default values for all settings
- Support for various configuration types:
- Claude API settings
- OmniParser connection settings
- AWS deployment configuration
- Debug and logging settings
To configure OmniMCP, create a `.env` file in the project root with your settings:
## Implementation Notes
### Core Principles
1. Visual state is always current
2. Every action verifies completion
3. Rich error context always available
4. Debug information accessible
### Critical Functions
1. VisualState.update()
2. MCPServer.observe()
3. find_element()
4. verify_action()
### Error Handling
```python
@dataclass
class ToolError:
message: str
visual_context: Optional[bytes] # Screenshot
attempted_action: str
element_description: str
recovery_suggestions: List[str]
```
### Testing Requirements
1. Unit tests for core logic
2. Integration tests for flows
3. Visual verification
4. Performance benchmarks
### Synthetic UI Testing
OmniMCP includes tools for generating synthetic test UIs with:
- Predefined UI elements (buttons, text fields, checkboxes)
- Before/after image pairs for action verification
- Element visualization for debugging
This approach offers several advantages:
- Works across all platforms
- Runs in any environment (including CI)
- Provides deterministic results
- Doesn't require actual displays
- Enables testing different scenarios
## Example Implementation Flow
1. **Setup Visual State**
```python
visual_state = VisualState()
visual_state.update(take_screenshot())
```
2. **Find Target Element**
```python
element = visual_state.find_element_by_content("Submit")
if not element:
raise MCPError("Element not found", context=visual_state.to_dict())
```
3. **Take Action**
```python
success = await input_controller.click(element.center)
if not success:
raise MCPError("Click failed", context={"element": element})
```
4. **Verify Result**
```python
@dataclass
class ActionVerification:
success: bool
before_state: bytes # Screenshot
after_state: bytes
changes_detected: List[BoundingBox]
confidence: float
async def verify_tool_execution(
action_result: ActionResult,
verification: ActionVerification
) -> bool:
"""Verify tool executed successfully"""
```
## Remember
1. Focus on core functionality first
2. Build incrementally
3. Test thoroughly
4. Keep it simple but robust
5. Always verify actions
6. Maintain current state
7. Provide rich error context
This implementation guide focuses on the essential components needed for effective UI automation through visual understanding and action. Follow the implementation order strictly and ensure each component is solid before moving to the next.
===
===
Here's a high-level description of the ideal OmniMCP system:
# OmniMCP System Design
## Core Purpose
OmniMCP is a Model Context Protocol (MCP) server that enables AI models (particularly Claude) to:
1. Understand UI elements on screen through visual analysis
2. Take actions through mouse and keyboard control
3. Get rich visual context about UI elements using Claude's vision capabilities
## Key Components
### 1. MCP Server
```python
class MCPServer:
"""Core MCP server implementing the Model Context Protocol.
Primary interface for AI models to interact with the UI.
"""
async def get_screen_state() -> Dict:
"""Get current screen state with UI elements."""
async def analyze_ui(query: str, max_elements: int = 5) -> Dict:
"""Analyze UI elements matching a natural language query."""
async def click_element(descriptor: str) -> Dict:
"""Click UI element by description."""
async def type_text(text: str) -> Dict:
"""Type text using keyboard."""
async def press_key(key: str) -> Dict:
"""Press a keyboard key."""
```
### 2. Visual Analysis
```python
class VisualState:
"""Represents current screen state with UI elements."""
def update_from_parser(self, parser_result: Dict):
"""Update state from UI parser results."""
def find_element_by_content(self, content: str) -> Optional[Element]:
"""Find UI element by content."""
def to_mcp_description(self) -> Dict:
"""Convert state to MCP format."""
```
### 3. UI Parser Integration
```python
class OmniParserClient:
"""Client for interacting with the OmniParser API."""
def parse_image(self, image: Image.Image) -> Dict[str, Any]:
"""Parse an image using the OmniParser service."""
def check_server_available(self) -> bool:
"""Check if the OmniParser server is available."""
class OmniParserProvider:
"""Provider for OmniParser services with deployment capabilities."""
def deploy(self) -> bool:
"""Deploy OmniParser if not already running."""
def is_available(self) -> bool:
"""Check if parser is available."""
```
### 4. Input Control
```python
class InputController:
"""Handles mouse and keyboard input."""
def click(self, x: float, y: float):
"""Click at coordinates."""
def type_text(self, text: str):
"""Type text."""
def press_key(self, key: str):
"""Press keyboard key."""
```
### 5. Claude Vision Integration
```python
class ClaudeVision:
"""Handles visual analysis using Claude."""
async def describe_elements(
elements: List[Element],
context: Optional[Image] = None
) -> List[str]:
"""Get detailed descriptions of UI elements."""
async def analyze_visual_query(
query: str,
screenshot: Image,
elements: List[Element]
) -> Dict:
"""Answer questions about UI using Claude's vision."""
```
## MCP Tools Interface
@mcp.tool()
async def get_screen_state() -> ScreenState:
"""Get current state of visible UI elements"""
state = await visual_state.capture()
return state
@mcp.tool()
async def find_element(description: str) -> Optional[UIElement]:
"""Find UI element matching natural language description"""
state = await get_screen_state()
return semantic_element_search(state.elements, description)
@mcp.tool()
async def click_element(description: str) -> ClickResult:
"""Click UI element matching description"""
element = await find_element(description)
if not element:
return ClickResult(success=False, error="Element not found")
return await perform_click(element)
@mcp.tool()
async def type_text(text: str) -> TypeResult:
"""Type text using keyboard"""
try:
await keyboard.type_text(text)
return TypeResult(success=True, text_entered=text)
except Exception as e:
return TypeResult(success=False, error=str(e))
@mcp.tool()
async def press_key(
key: str,
modifiers: List[str] = None
) -> ActionResult:
"""Press keyboard key with optional modifiers"""
## Key Features
1. **Smart UI Analysis**
- Visual element detection
- Natural language queries
- Rich context through Claude vision
- Element relationships and hierarchy
2. **Robust Actions**
- Smart element targeting
- Coordinate normalization
- Input verification
- Action confirmation
3. **Development Support**
- Debug visualizations
- Action logging
- Error diagnostics
- Performance metrics
4. **Deployment Options**
- Local parser
- Remote parser service
- Auto-deployment
- Service management
===
===
# OmniMCP Implementation Approach
## Core Design Principles
1. MCP server is the primary interface
2. Visual state is always current
3. Errors are descriptive and actionable
4. Debug information is always available
## Implementation Path
### 1. Foundation (Based on proven code)
```python
class OmniMCP:
def __init__(self):
self.visual_state = VisualState()
self.ui_parser = UIParserProvider()
self.keyboard = KeyboardController()
self.mouse = MouseController()
def update_visual_state(self):
screenshot = take_screenshot()
parser_result = self.ui_parser.parse_screenshot(screenshot)
self.visual_state.update_from_parser(parser_result)
```
### 2. MCP Server First
- Implement core MCP tools based on our working server.py
- Each tool updates visual state before acting
- All tools return structured responses
- Debug screenshots for each action
### 3. Visual Analysis Pipeline
1. Screenshot capture
2. UI element parsing
3. State management
4. Claude vision integration for rich context
### 4. Action System
1. Element targeting
2. Coordinate handling
3. Input simulation
4. Action verification
### 5. Debug Infrastructure
- Visual state snapshots
- Action logging
- Error context
- Performance metrics
## Key Implementation Details
### MCP Server
- Use FastMCP for protocol compatibility
- Structured responses for all actions
- Visual state always updated before actions
- Rich error context in responses
### Visual State
- Keep normalized and absolute coordinates
- Track element confidence scores
- Maintain element relationships
- Cache recent states for context
### UI Parser Integration
- Start with local parser
- Remote parser as fallback
- Smart deployment management
- Connection recovery
### Input Control
- Use proven pynput implementation
- Coordinate normalization
- Action verification
- Error recovery
## Critical Considerations
1. **Error Handling**
- Clear error messages
- Recovery strategies
- Debug context
- User feedback
2. **Performance**
- Minimize visual state updates
- Cache when possible
- Async where beneficial
- Smart retries
3. **Reliability**
- Verify actions
- Handle edge cases
- Recover from failures
- Maintain state consistency
===
===
# OmniMCP Core Protocol
## Core Concept
MCP for OmniMCP is fundamentally about enabling AI models to:
1. Understand what's on screen through rich context
2. Take actions using natural language descriptions
## Essential Tools
@mcp.tool()
async def get_screen_state() -> ScreenState:
"""Get current state of visible UI elements
Returns:
ScreenState containing all visible UI elements with their properties
"""
@mcp.tool()
async def find_element(description: str) -> Optional[UIElement]:
"""Find UI element matching natural language description
Args:
description: Natural language description of element (e.g. "the submit button")
"""
@mcp.tool()
async def click_element(description: str) -> ClickResult:
"""Click UI element matching description
Args:
description: Natural language description of element to click
"""
@mcp.tool()
async def type_text(text: str) -> TypeResult:
"""Type text using keyboard
Args:
text: Text to type
"""
@mcp.tool()
async def press_key(
key: str,
modifiers: List[str] = None
) -> ActionResult:
"""Press keyboard key with optional modifiers
Args:
key: Key to press (e.g. "enter", "tab")
modifiers: Optional modifier keys (e.g. ["ctrl", "shift"])
"""
## Key Design Points
1. **Simplicity**
- Two core endpoints: observe and act
- Analysis as enhancement of observation
- Clear, consistent response structure
2. **Stateful Context**
- Server maintains current visual state
- Actions update state automatically
- Historical context available when needed
3. **Natural Language Interface**
- Element targeting by description
- Rich analysis of visual state
- Error messages in natural language
4. **Verification**
- Actions confirm completion
- Visual state updates verify changes
- Clear error reporting
This represents the minimal, essential MCP interface needed for effective UI automation through visual understanding and action.
### Prompt Templates
Use template utilities for clean, maintainable prompts:
```python
from omnimcp.utils import create_prompt_template, render_prompt
# Create reusable template
analyze_template = create_prompt_template("""
Analyze this UI element:
{{ element.description }}
Location: {{ element.bounds }}
Type: {{ element.type }}
Suggest interactions based on:
{% for attr in element.attributes %}
- {{ attr }}
{% endfor %}
""")
# Render with data
prompt = analyze_template.render(
element=ui_element
)
# Or one-step helper
prompt = render_prompt("""
Quick analysis: {{ element.description }}
""", element=ui_element)
## Implementation Status
Note: The current implementation in `omnimcp.py` represents the API design based on MCP specifications but has not been tested with actual MCP server implementations yet. The types and tools are defined but require:
1. Integration testing with MCP SDK
2. Verification of tool definitions
3. Testing with Claude and other MCP clients
4. Implementation of actual tool logic
This design serves as a starting point for implementing a compliant MCP server for UI understanding.
## Testing Strategy
### Synthetic UI Testing
For testing visual understanding without relying on real UIs or displays, we'll use programmatically generated images:
```python
def generate_test_ui():
"""Generate synthetic UI image with known elements."""
from PIL import Image, ImageDraw
# Create blank canvas
img = Image.new('RGB', (800, 600), color='white')
draw = ImageDraw.Draw(img)
# Draw UI elements with known positions
elements = []
# Button
draw.rectangle([(100, 100), (200, 150)], fill='blue', outline='black')
draw.text((110, 115), "Submit", fill="white")
elements.append({
"type": "button",
"content": "Submit",
"bounds": {"x": 100, "y": 100, "width": 100, "height": 50},
"confidence": 1.0
})
# Text field
draw.rectangle([(300, 100), (500, 150)], fill='white', outline='black')
draw.text((310, 115), "Username", fill="gray")
elements.append({
"type": "text_field",
"content": "Username",
"bounds": {"x": 300, "y": 100, "width": 200, "height": 50},
"confidence": 1.0
})
return img, elements
```
### Action Verification Testing
For testing action verification, we'll generate before/after image pairs:
```python
def generate_action_test_pair(action_type="click"):
"""Generate before/after UI image pair for a specific action."""
before_img, elements = generate_test_ui()
after_img = before_img.copy()
after_draw = ImageDraw.Draw(after_img)
if action_type == "click":
# Show button in pressed state
after_draw.rectangle([(100, 100), (200, 150)], fill='darkblue', outline='black')
after_draw.text((110, 115), "Submit", fill="white")
# Add success message
after_draw.text((100, 170), "Form submitted!", fill="green")
elif action_type == "type":
# Show text entered in field
after_draw.rectangle([(300, 100), (500, 150)], fill='white', outline='black')
after_draw.text((310, 115), "testuser", fill="black")
return before_img, after_img, elements
```
### Test Implementation
Testing Claude integration with synthetic images:
```python
async def test_element_finding():
"""Test Claude's ability to find elements in synthetic UI."""
# Generate test image with known elements
test_img, elements = generate_test_ui()
# Mock screenshot capture to return test image
with patch('omnimcp.utils.take_screenshot', return_value=test_img):
# Setup OmniMCP with mock parser that returns our elements
# ...
# Test with various descriptions
descriptions = [
"submit button",
"blue button",
"the username field",
"textbox in the middle",
]
for desc in descriptions:
# Call find_element with each description
element = await mcp._visual_state.find_element(desc)
# Verify the correct element was found
# ...
```
This testing approach:
- Works across all platforms
- Runs in any environment (including CI)
- Provides deterministic results
- Doesn't require actual displays or UI
- Allows testing a variety of scenarios
For real UI action testing, we'll start with manual verification while developing more sophisticated test environments.
Focus on implementing the core functionality first, then expand the testing framework.