Where's Waldo Rick

ARCHITECTURE.md•20.1 KiB

# Architecture Research - MCP Server with Vision Capabilities ## Executive Summary **Architecture Pattern**: Tool as Orchestrator - MCP tools delegate to service layer for testability and separation of concerns. **Key Innovation**: Multi-platform adapter pattern for screenshot capture + rate-limited Gemini integration + filesystem storage with JSON index. **Build Order**: Foundation → Capture → Comparison → Operations (prevents external API dependencies early) --- ## System Architecture ### High-Level Components ``` ┌─────────────────────────────────────────────────────────────┐ │ Claude Code │ │ (MCP Client) │ └──────────────────────┬────────────────────────────────────────┘ │ JSON-RPC ▼ ┌─────────────────────────────────────────────────────────────┐ │ MCP Server │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Tool Layer (Thin Wrappers) │ │ │ │ - visual_capture() │ │ │ │ - visual_compare() │ │ │ │ - visual_prepare() │ │ │ │ - visual_cleanup() │ │ │ └──────────────────────┬────────────────────────────────┘ │ │ │ │ │ ┌──────────────────────▼────────────────────────────────┐ │ │ │ Service Layer (Business Logic) │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │ │ │ │ │ Capture │ │ Compare │ │ Storage │ │ │ │ │ │ Service │ │ Service │ │ Service │ │ │ │ │ └──────────────┘ └──────────────┘ └────────────┘ │ │ │ └──────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ┌──────────────────┼──────────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────┐ ┌──────────────┐ ┌──────────┐ │ Platform │ │ Gemini │ │ Local │ │ Adapters │ │ Agentic │ │ Storage │ │ │ │ Vision API │ │ │ │ - macOS │ │ │ │ Filesyst. │ │ - iOS │ │ Rate Limited │ │ + JSON │ │ - Web │ │ │ │ │ └───────────┘ └──────────────┘ └──────────┘ ``` --- ## Component Boundaries ### 1. Tool Layer (MCP Protocol) **Responsibility**: Expose functionality via MCP protocol, handle JSON-RPC **Components**: - `visual_capture(name, description)` - Capture and store screenshot - `visual_compare(before, after, expected_changes)` - Compare with agentic vision - `visual_prepare(expected_changes_list)` - Declare expected changes - `visual_cleanup(phase_to_remove)` - Remove old screenshots **Characteristics**: - Thin wrappers (minimal logic) - Argument validation - Error handling (MCP protocol errors) - Delegate to service layer **Output**: MCP Tool definitions, JSON-RPC handlers --- ### 2. Service Layer (Business Logic) #### 2.1 Capture Service **Responsibility**: Screenshot capture across platforms **Interface**: ```python class CaptureService: def capture(name: str, description: str = "") -> Screenshot def get_screenshot(path: str) -> Screenshot def list_screenshots() -> List[Screenshot] ``` **Platform Adapters**: ```python class CaptureAdapter(ABC): @abstractmethod def capture(self) -> Image: ... class MacOSCocoaAdapter(CaptureAdapter): """Use MSS for macOS""" def capture(self) -> Image: with mss.mss() as sct: monitor = sct.monitors[1] screenshot = sct.grab(monitor) return Image.frombytes('RGB', screenshot.size, screenshot.rgb) class IOSSimulatorAdapter(CaptureAdapter): """Use simctl for iOS simulators""" def capture(self) -> Image: subprocess.run([ 'xcrun', 'simctl', 'io', 'booted', 'screenshot', f'{self.temp_path}.png' ]) return Image.open(f'{self.temp_path}.png') class WebChromeAdapter(CaptureAdapter): """Use chrome-devtools MCP for web""" def capture(self) -> Image: # Call chrome-devtools MCP take_screenshot tool # Download and return image ... ``` **Orchestration**: ```python class CaptureOrchestrator: def __init__(self): self.adapters = [ MacOSCocoaAdapter(), IOSSimulatorAdapter(), WebChromeAdapter(), ] def auto_detect_and_capture(self) -> Screenshot: for adapter in self.adapters: if adapter.is_available(): return adapter.capture() raise CaptureError("No capture adapter available") ``` --- #### 2.2 Compare Service **Responsibility**: Image comparison with agentic vision analysis **Interface**: ```python class CompareService: def compare( self, before: Screenshot, after: Screenshot, expected_changes: List[str] = None ) -> ComparisonResult def pixel_diff(before: Image, after: Image) -> DiffResult def agentic_analysis( before: Image, after: Image, diff: DiffResult ) -> AgenticResult ``` **Pipeline**: ```python class CompareService: def __init__(self, gemini_client, storage): self.gemini = gemini_client self.storage = storage def compare(self, before_path, after_path, expected_changes): # 1. Load images before = Image.open(before_path) after = Image.open(after_path) # 2. Pixel diff (fast, local) diff_result = self.pixel_diff(before, after) # 3. Agentic vision analysis (slow, API) agentic_result = self.agentic_analysis( before, after, diff_result ) # 4. Validate against expectations validation = self.validate_expectations( expected_changes, agentic_result ) # 5. Generate annotated output annotated = self.generate_annotated( after, diff_result, agentic_result ) return ComparisonResult( pixel_diff=diff_result, agentic_analysis=agentic_result, validation=validation, annotated_screenshot=annotated ) def pixel_diff(self, before, after): # OpenCV diffing with 2px Gaussian blur # Returns: bounding boxes, change count, heatmap ... def agentic_analysis(self, before, after, diff_result): # Gemini 3 Flash iterative analysis # Returns: text description, found regions, annotations ... ``` **Gemini Integration**: ```python class GeminiClient: def __init__(self, api_key, rate_limiter): self.model = genai.GenerativeModel('gemini-3-flash-preview') self.rate_limiter = rate_limiter # Token bucket (15 req/min) async def analyze_with_agentic_vision( self, before: Image, after: Image, query: str ) -> str: # Rate limiting await self.rate_limiter.acquire() # Low-res pass (cheaper) response_low = await self.model.generate_content([ query, resize_image(before, 512), resize_image(after, 512), "What changed? Be specific." ]) # Medium-res if needed (more expensive) if "zoom" in response_low.text.lower(): response_med = await self.model.generate_content([ query, resize_image(after, 1024), "Deeper analysis of changed regions" ]) return response_med.text return response_low.text ``` --- #### 2.3 Storage Service **Responsibility**: Filesystem storage with JSON index **Schema**: ``` .screenshots/ ├── phases/ │ ├── 01-before-card-update.png │ ├── 01-before-card-update.json # Metadata │ ├── 02-after-card-update.png │ ├── 02-after-card-update.json │ └── .index.json # Global index └── comparisons/ ├── 01-vs-02-diff.png ├── 01-vs-02-diff.json └── 01-vs-02-annotation.png ``` **Metadata Format**: ```json { "name": "01-before-card-update", "timestamp": "2025-02-04T10:30:00Z", "description": "Baseline before card layout update", "phase": 3, "platform": "macos", "dimensions": {"width": 1920, "height": 1080}, "file_size": 2048576, "hash": "sha256:abc123..." } ``` **Interface**: ```python class StorageService: def save_screenshot(self, image: Image, metadata: dict) -> str def load_screenshot(self, path: str) -> Screenshot def list_screenshots(self, phase: int = None) -> List[Screenshot] def delete_screenshot(self, path: str) def cleanup_old_screenshots(self, keep_last_n: int = 3) ``` **Cleanup Policy**: ```python class StorageService: def cleanup_old_screenshots(self, keep_last_n=3): """Delete all screenshots except last N phases""" screenshots = self.list_screenshots() sorted_by_phase = sorted(screenshots, key=lambda s: s.phase) to_delete = sorted_by_phase[:-keep_last_n] for screenshot in to_delete: self.delete_screenshot(screenshot.path) self.update_index(remove=screenshot) ``` --- ## Data Flow ### Screenshot Capture Flow ``` User/AI Agent │ │ "Capture screenshot" ▼ MCP Tool: visual_capture() │ │ Delegate to service ▼ CaptureOrchestrator │ │ Auto-detect platform ├─→ macOS available? │ └─→ MacOSCocoaAdapter.capture() ├─→ iOS simulator running? │ └─→ IOSSimulatorAdapter.capture() └─→ Chrome devtools available? └─→ WebChromeAdapter.capture() │ │ Returns: Screenshot object ▼ StorageService.save_screenshot() │ ├─→ Save PNG to .screenshots/phases/ ├─→ Write metadata JSON └─→ Update .index.json │ ▼ Return: Path to saved screenshot ``` --- ### Comparison Flow ``` User/AI Agent │ │ "Compare these screenshots" ▼ MCP Tool: visual_compare(before, after, expected) │ │ Delegate to service ▼ CompareService.compare() │ ├─→ Load both images ├─→ Pixel diff (OpenCV, local) │ └─→ Returns: bounding boxes, heatmap │ ├─→ Agentic vision (Gemini API) │ └─→ GeminiClient.analyze_with_agentic_vision() │ ├─→ Rate limiter (token bucket) │ ├─→ Low-res pass (cheap) │ ├─→ Medium-res if needed (expensive) │ └─→ Returns: text analysis, annotations │ ├─→ Validate expectations │ └─→ Compare found changes vs expected_changes │ └─→ Mark ✅ expected, ⚠️ unexpected │ └─→ Generate annotated output └─→ Draw bounding boxes, labels on screenshot └─→ Save to .screenshots/comparisons/ │ ▼ Return: ComparisonResult - annotated_screenshot.png - pixel_diff_summary - agentic_analysis_text - validation_report (✅ expected, ⚠️ regressions) ``` --- ## MCP Tool Contracts ### Tool: visual_capture **Input Schema**: ```typescript { name: "visual_capture", description: "Capture screenshot and store with metadata", inputSchema: { type: "object", properties: { name: { type: "string", description: "Screenshot name (e.g., '03-phase-complete')" }, description: { type: "string", description: "Optional description of what this screenshot captures", default: "" } }, required: ["name"] } } ``` **Output**: ```typescript { success: true, data: { path: ".screenshots/phases/03-phase-complete.png", timestamp: "2025-02-04T10:30:00Z", phase: 3, dimensions: { width: 1920, height: 1080 } } } ``` --- ### Tool: visual_compare **Input Schema**: ```typescript { name: "visual_compare", description: "Compare two screenshots with agentic vision analysis", inputSchema: { type: "object", properties: { before: { type: "string", description: "Path to baseline screenshot" }, after: { type: "string", description: "Path to current screenshot" }, expected_changes: { type: "array", items: { type: "string" }, description: "List of expected changes (e.g., ['Card padding +2px'])", default: [] } }, required: ["before", "after"] } } ``` **Output**: ```typescript { success: true, data: { annotated_screenshot: ".screenshots/comparisons/01-vs-02-annotation.png", pixel_diff: { total_changes: 47, changed_pixels: 1234, diff_percentage: 0.08 }, agentic_analysis: { summary: "Card padding increased by 2px top/bottom. Button shifted 20px right.", found_changes: [ { "region": "card", "change": "+2px padding" }, { "region": "button", "change": "moved 20px right" } ], unexpected_changes: [ { "region": "title", "change": "shifted 5px down", "severity": "warning" } ] }, validation: { expected_matched: ["Card padding +2px", "Button moved 20px right"], expected_missed: [], unexpected_found: ["Title shifted 5px down"], regressions: true } } } ``` --- ## Error Handling Strategies ### 1. Rate Limit Errors (Gemini API) **Detection**: HTTP 429, timeout after 30s **Strategy**: Token bucket rate limiter + exponential backoff retry ```python class TokenBucketRateLimiter: def __init__(self, rate=15, per=60): self.rate = rate # 15 requests self.per = per # per 60 seconds self.allowance = rate self.last_check = time.time() async def acquire(self): with threading.Lock(): current = time.time() time_passed = current - self.last_check self.last_check = current self.allowance += time_passed * (self.rate / self.per) if self.allowance > self.rate: self.allowance = self.rate if self.allowance < 1: sleep_time = (self.per - self.allowance) / self.rate await asyncio.sleep(sleep_time) self.allowance = 0 else: self.allowance -= 1 ``` **Retry Logic**: ```python async def call_gemini_with_retry(self, prompt, image): for attempt in range(3): try: return await self.model.generate_content([prompt, image]) except RateLimitError: if attempt == 2: raise await asyncio.sleep(2 ** attempt) # 1s, 2s, 4s ``` --- ### 2. Capture Failures **Detection**: Platform adapter unavailable, screenshot fails **Strategy**: Fallback chain + clear error messages ```python class CaptureOrchestrator: def auto_detect_and_capture(self): for adapter in self.adapters: try: if adapter.is_available(): return adapter.capture() except CaptureError as e: logger.warning(f"{adapter} failed: {e}") continue raise CaptureError( "No capture method available. Tried: " f"{[a.__class__.__name__ for a in self.adapters]}" ) ``` --- ### 3. Storage Errors **Detection**: Disk full, permissions, corrupted files **Strategy**: Validate before write, clean up on error ```python class StorageService: def save_screenshot(self, image, metadata): # Validate disk space if not self.has_disk_space(image.size): raise StorageError("Insufficient disk space") # Validate permissions if not self.can_write_to_directory(): raise StorageError("No write permissions") try: path = self.write_image(image) self.write_metadata(path, metadata) self.update_index(metadata) except Exception as e: # Rollback: remove partial files self.cleanup_partial_write(path) raise StorageError(f"Failed to save: {e}") ``` --- ## Build Order ### Phase 1: Foundation **Components**: - Storage service (filesystem + JSON index) - Type definitions (Screenshot, ComparisonResult, etc.) - Utilities (image I/O, hashing) **Why First**: No external dependencies, pure Python, testable in isolation **Deliverable**: Can save/load screenshots with metadata --- ### Phase 2: Capture **Components**: - Capture orchestrator - Platform adapters (macOS, iOS, Web) - Capture service **Why Second**: Depends on storage, no external APIs **Deliverable**: Can capture screenshots from all platforms **Risk**: Platform tool availability needs verification --- ### Phase 3: Comparison (HIGH RISK) **Components**: - Pixel diffing (OpenCV) - Gemini client (rate-limited) - Agentic vision workflow - Compare service **Why Third**: Most complex, external API dependency **Deliverable**: Full comparison with agentic vision **Risk**: HIGH - Gemini API capabilities need phase-specific research --- ### Phase 4: Operations **Components**: - Cleanup policies - MCP tool exposure - Query operations (list, find) **Why Fourth**: Builds on all previous components **Deliverable**: Complete MCP server with all tools --- ## Phase-Specific Research Flags ### Phase 2: Platform Capture **Verify**: axe CLI availability for macOS **Research**: iOS simctl screenshot capabilities **Test**: chrome-devtools MCP integration --- ### Phase 3: Gemini Integration (HIGH PRIORITY) **Test**: Actual Gemini API for iterative zoom/crop/annotate **Measure**: Token usage for typical comparison **Verify**: Code execution works in free tier **Calibrate**: Blur strength for 2px detection --- ## Open Questions ### MCP Server Stability **Question**: Can MCP server handle async operations without blocking? **Research Needed**: Test MCP Python SDK with async tool execution **Impact**: If not, need background job processing --- ### Screenshot Quality **Question**: What resolution/format for optimal comparison? **Research Needed**: Test PNG vs JPEG, resolution vs API cost **Impact**: Affects storage, API costs, accuracy --- ### Annotated Output Format **Question**: How to present annotated results to users? **Options**: - Overlay on screenshot - Side-by-side comparison - Separate HTML report - JSON with bounding box coordinates **Research Needed**: User testing with actual workflows --- *Last updated: 2025-02-04* *Confidence: HIGH (MCP architecture) / MEDIUM (Gemini integration) / LOW (iOS capture)*

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/bretbouchard/gemini-vision-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

ARCHITECTURE.md•20.1 KiB