Where's Waldo Rick

gemini-vision-mcp
.planning

ROADMAP.md•17.1 KiB

# Roadmap — Where's Waldo Rick **Project**: Visual Regression MCP Server with Agentic Vision **Version**: 1.0 **Last Updated**: 2025-02-04 --- ## Overview This roadmap defines the execution plan for building "Where's Waldo Rick" - an MCP server for visual regression testing using Gemini 3 Flash's agentic vision capabilities. **Total Phases**: 5 **Total Plans**: 27 (5-7 plans per phase) **Total Requirements**: 34 **Estimated Duration**: 3-4 weeks (sequential execution) --- ## Phase 1: Foundation (3-4 days) **Goal**: Establish MCP server infrastructure, type system, and storage layer **Risk**: LOW **Dependencies**: None **Requirements Mapped**: 3 ### Plans (5) #### P1-PLAN-1: MCP Server Skeleton with FastMCP **Requirements**: REQ-RELIABILITY-03 (MCP JSON-RPC Compliance) **Deliverables**: - FastMCP project structure - Tool stubs: `visual_capture`, `visual_prepare`, `visual_compare`, `visual_cleanup` - Stderr-only logging (prevents JSON-RPC breaks) - Exception handler with graceful degradation - MCP manifest file - Success Criteria: `mcp list-tools` returns 4 tool stubs, stdout contains only valid JSON #### P1-PLAN-2: Type System and Domain Models **Requirements**: REQ-RELIABILITY-03 (MCP JSON-RPC Compliance) **Deliverables**: - Pydantic models: `Screenshot`, `Baseline`, `Comparison`, `ExpectedChanges` - Enums: `Platform` (macOS, iOS, Web), `Quality` (1x, 2x, 3x), `Format` (PNG, JPEG, WebP) - Validation schemas for expected changes input - Success Criteria: All models validate correctly, type hints complete #### P1-PLAN-3: Storage Service (Filesystem + JSON Index) **Requirements**: REQ-RELIABILITY-03 (MCP JSON-RPC Compliance) **Deliverables**: - `.screenshots/phases/` directory structure - JSON index schema: `index.json` (metadata for all screenshots) - Storage service class with CRUD operations - Atomic write operations (prevent corruption) - Success Criteria: Can create/read/update/delete screenshot metadata #### P1-PLAN-4: Configuration Management **Requirements**: REQ-USABILITY-01 (Zero-Config Baseline) **Deliverables**: - `.screenshots/config.json` schema - Default configuration: 2x resolution, PNG format, 2px threshold - Config loader with fallback to defaults - Environment variable support (GEMINI_API_KEY, etc.) - Success Criteria: Works out-of-box with no config file #### P1-PLAN-5: Utility Functions (Image Hashing, Path Helpers) **Requirements**: REQ-RELIABILITY-03 (MCP JSON-RPC Compliance) **Deliverables**: - SHA256 hash function for image pairs (cache keys) - Path helpers: `get_phase_dir()`, `get_baseline_path()` - File naming utilities: `{timestamp}-{phase-name}.png` - Success Criteria: Hashes consistent for identical images, paths resolve correctly --- ## Phase 2: Capture & Baselines (4-5 days) **Goal**: Screenshot capture from multiple platforms with baseline management **Risk**: MEDIUM **Dependencies**: Phase 1 complete **Requirements Mapped**: 9 ### Plans (6) #### P2-PLAN-1: Platform Detection and Auto-Selection **Requirements**: REQ-CAPTURE-01 (Multi-Platform Screenshot Capture) **Deliverables**: - Platform detection logic (check for iOS Simulator, macOS, chrome-devtools MCP) - Fallback to manual platform selection - Platform adapter interface - Success Criteria: Auto-detects platform in 95% of cases #### P2-PLAN-2: macOS Screenshot Adapter (MSS) **Requirements**: REQ-CAPTURE-01, REQ-PERF-01 (Capture Latency < 100ms) **Deliverables**: - MSS integration for fast capture (16-47ms) - Retina support (2x/3x scaling) - Format selection (PNG/JPEG/WebP) - Success Criteria: Capture time < 100ms measured #### P2-PLAN-3: iOS Simulator Adapter (simctl) **Requirements**: REQ-CAPTURE-01, REQ-PERF-01 (Capture Latency < 500ms) **Deliverables**: - simctl CLI wrapper for iOS screenshots - Device boot check (error if simulator not running) - Simulator selection (multiple devices support) - Success Criteria: Capture time < 500ms measured #### P2-PLAN-4: Web Screenshot Adapter (chrome-devtools MCP) **Requirements**: REQ-CAPTURE-01, REQ-PERF-01 (Capture Latency < 1s) **Deliverables**: - chrome-devtools MCP integration - Page screenshot capture (viewport or full page) - Multi-tab support - Success Criteria: Capture time < 1s measured #### P2-PLAN-5: Baseline Declaration with Expected Changes **Requirements**: REQ-BASELINE-01, REQ-BASELINE-02, REQ-EXPECTED-01 **Deliverables**: - `visual_prepare` tool implementation - Expected changes input format (JSON + natural language) - Baseline ID generation (timestamp-based) - Metadata storage (phase, description, expected changes) - Success Criteria: Can declare baseline with expected changes #### P2-PLAN-6: Screenshot Organization and Listing **Requirements**: REQ-CAPTURE-02 (Phase-Based Organization) **Deliverables**: - `visual_capture` tool implementation - Phase directory auto-creation - JSON index updates on each capture - `visual_list` tool for browsing screenshots - Success Criteria: Screenshots organized by phase, searchable --- ## Phase 3: Comparison Engine with Gemini Integration (7-10 days) ⚠️ HIGH RISK **Goal**: Pixel-perfect diffing with Gemini 3 Flash agentic vision analysis **Risk**: HIGH **Dependencies**: Phase 2 complete, Gemini API access **Requirements Mapped**: 15 ### Plans (7) #### P3-PLAN-1: OpenCV Integration for Pixel Diffing **Requirements**: REQ-COMPARE-01, REQ-PERF-02 (Comparison Speed < 2s) **Deliverables**: - OpenCV 4.10+ installation and import - Absolute difference calculation - Configurable threshold (1px, 2px, 3px) - Total changed pixels and percentage - Success Criteria: Comparison time < 2s for 1080p images #### P3-PLAN-2: Anti-Aliasing Noise Reduction **Requirements**: REQ-COMPARE-02, REQ-PERF-02 **Deliverables**: - 2px Gaussian blur preprocessing - Enable/disable toggle - Anti-aliasing test suite (10 known cases) - Success Criteria: >80% false positive reduction #### P3-PLAN-3: Heatmap Diff Visualization **Requirements**: REQ-COMPARE-03 **Deliverables**: - Color gradient generator (yellow → orange → red) - Overlay composer (original + heatmap) - Adjustable opacity (20-80%) - Export functionality - Success Criteria: Heatmaps clearly show change intensity #### P3-PLAN-4: Gemini 3 Flash Client Integration 🔥 CRITICAL PATH **Requirements**: REQ-AGENTIC-01, REQ-PERF-03 (Agentic Vision Response < 10s) **Deliverables**: - google-genai SDK setup - gemini-3-flash-preview model configuration - Agentic vision API wrapper (multi-resolution) - Code execution for image manipulation (crop, annotate) - Error handling with retry logic - Success Criteria: Successful agentic vision call with annotated output **Risk Mitigation**: - Start with low-resolution analysis (720p) to minimize cost - Implement rate limiter BEFORE first API call - Have fallback to pixel-only diffing if Gemini fails #### P3-PLAN-5: Token Bucket Rate Limiter **Requirements**: REQ-AGENTIC-06, REQ-RELIABILITY-01 (Rate Limit Compliance) **Deliverables**: - Token bucket algorithm (15 tokens/min refill rate) - Request queue with position tracking - Queue status display (position, ETA) - Burst allowance (up to 15 if full bucket) - Backpressure notification (queue > 5) - Success Criteria: 99.9% compliance with 15 req/min limit #### P3-PLAN-6: Intelligent Change Interpretation **Requirements**: REQ-AGENTIC-02, REQ-EXPECTED-02 **Deliverables**: - Gemini prompt engineering for UI element analysis - Natural language change descriptions ("padding increased by 2px on top of card") - Confidence scoring (0-100%) - Fallback to pixel coordinates if interpretation fails - Success Criteria: Interpretations match human assessment in >80% of cases #### P3-PLAN-7: Intended vs Unintended Classification **Requirements**: REQ-EXPECTED-02, REQ-EXPECTED-03, REQ-VALIDATION-01, REQ-VALIDATION-02 **Deliverables**: - Expected changes parser (natural language → structured) - Classification algorithm (match by proximity, description, bounding box) - Visual distinction: green (intended), red (unintended), yellow (ambiguous) - Severity scoring: critical, major, minor - Regression alerting with summary counts - Pass/fail determination based on thresholds - Success Criteria: Correctly classifies known intended/unintended changes --- ## Phase 4: Operations & Validation (3-4 days) **Goal**: Robust error handling, cleanup, reporting, and caching **Risk**: LOW **Dependencies**: Phase 3 complete **Requirements Mapped**: 5 ### Plans (4) #### P4-PLAN-1: Aggressive Caching System **Requirements**: REQ-AGENTIC-04, REQ-RELIABILITY-02 (Cache Hit Rate > 60%) **Deliverables**: - Hash-based cache key (image pair + threshold) - Cache storage (filesystem: `.screenshots/cache/`) - Cache invalidation on config change - Cache statistics (hit rate, API calls saved) - Success Criteria: >60% cache hit rate in repeated comparisons #### P4-PLAN-2: Progressive Resolution Strategy **Requirements**: REQ-AGENTIC-05 **Deliverables**: - Resolution ladder: 720p → 1080p → 4K → 8K - Confidence threshold check (upgrade if < 80%) - Token usage tracking per comparison - Cost estimation display - Success Criteria: Average token savings > 30% vs full-resolution #### P4-PLAN-3: Comparison Report Generation **Requirements**: REQ-VALIDATION-03, REQ-BASELINE-03 **Deliverables**: - Markdown report template - Embedded screenshots with annotations - Pass/fail badge with criteria - Change summary table (intended vs unintended) - Report storage (`.screenshots/reports/`) - Success Criteria: Reports readable and actionable #### P4-PLAN-4: Cleanup and Maintenance Tools **Requirements**: REQ-USABILITY-02 (Clear Error Messages) **Deliverables**: - `visual_cleanup` tool implementation - 7-day retention policy (configurable) - Old screenshot deletion - Cache clearing - Storage stats display - Actionable error messages for all failure modes - Success Criteria: Storage doesn't bloat, errors have remediation steps --- ## Phase 5: Polish & Conversational Investigation (3-4 days) **Goal**: Multi-turn conversational UI with progressive disclosure **Risk**: LOW-MEDIUM **Dependencies**: Phase 4 complete **Requirements Mapped**: 5 ### Plans (5) #### P5-PLAN-1: Multi-Turn Conversation Context **Requirements**: REQ-CONV-01, REQ-AGENTIC-03 **Deliverables**: - Conversation state management (5-turn limit) - Context persistence (comparison, questions, answers) - Turn counter with graceful exit - Conversation summary on exit - Success Criteria: Can maintain context across 5 turns #### P5-PLAN-2: Focused Follow-Up Analysis **Requirements**: REQ-CONV-02 **Deliverables**: - Follow-up question parser ("Where exactly did the padding change?") - Focused region analysis (zoom, crop) - Annotated screenshot generation (circles, arrows) - Bounding box coordinate returns - Success Criteria: Follow-up questions generate targeted annotations #### P5-PLAN-3: Conversational UI Pattern Matching **Requirements**: REQ-CONV-03 **Deliverables**: - Pattern recognition: "Zoom in on {region}", "Crop to {element}", "Annotate {change}", "Measure {dimension}", "Compare {region A} with {region B}" - Fuzzy intent matching - Fallback to generic analysis - Success Criteria: Recognizes 5+ common patterns #### P5-PLAN-4: Conversation History Persistence **Requirements**: REQ-CONV-04 **Deliverables**: - Conversation storage (`.screenshots/conversations/`) - JSON format (timestamp, question, answer, annotations) - Search by comparison ID, date, content - Export to markdown - Success Criteria: Conversations searchable and auditable #### P5-PLAN-5: Progressive Disclosure and Usability Polish **Requirements**: REQ-USABILITY-03 (Progressive Disclosure) **Deliverables**: - Opt-in flags for advanced features (conversational mode, custom thresholds) - Simplified default workflow (capture → compare → report) - Help text and examples - Success Criteria: New users can capture and compare without reading docs --- ## Phase Summary | Phase | Name | Plans | Requirements | Risk | Duration | |-------|------|-------|--------------|------|----------| | **1** | Foundation | 5 | 3 | LOW | 3-4 days | | **2** | Capture & Baselines | 6 | 9 | MEDIUM | 4-5 days | | **3** | Comparison Engine | 7 | 15 | **HIGH** | 7-10 days | | **4** | Operations & Validation | 4 | 5 | LOW | 3-4 days | | **5** | Polish & Conversational | 5 | 5 | LOW-MEDIUM | 3-4 days | | **Total** | | **27** | **34** | | **20-27 days** | --- ## Success Criteria by Phase ### Phase 1 Success - ✅ MCP server starts and registers 4 tool stubs - ✅ Stdout contains only valid JSON (no pollution) - ✅ Storage service can create/read/delete metadata - ✅ Configuration works with zero setup ### Phase 2 Success - ✅ Can capture screenshots from macOS, iOS Simulator, and web - ✅ Capture times meet performance targets (< 100ms macOS, < 500ms iOS, < 1s web) - ✅ Screenshots organized by phase with searchable index - ✅ Baselines declared with expected changes ### Phase 3 Success (CRITICAL) - ✅ Pixel-level comparison completes in < 2s - ✅ Anti-aliasing false positives reduced by >80% - ✅ Gemini 3 Flash integration returns annotated analysis - ✅ Rate limiter maintains 99.9% compliance with 15 req/min - ✅ Intended vs unintended classification works accurately - ✅ **Risk**: If Gemini integration fails, fallback to pixel-only diffing ### Phase 4 Success - ✅ Cache hit rate > 60% - ✅ Progressive resolution saves > 30% tokens - ✅ Comparison reports are readable and actionable - ✅ Storage doesn't bloat (cleanup works) ### Phase 5 Success - ✅ Multi-turn conversations maintain context across 5 turns - ✅ Follow-up questions generate targeted annotations - ✅ Conversational patterns recognized (zoom, crop, annotate, measure, compare) - ✅ New users can use default workflow without reading docs --- ## Risk Mitigation Strategies ### 🔥 Critical Risk: Phase 3 Gemini Integration **Risk**: Gemini API fails, rate limits exceeded, or costs blow up **Mitigation**: 1. **Token Bucket First**: Implement rate limiter BEFORE first API call 2. **Low-Resolution First**: Start with 720p, upgrade only if needed 3. **Aggressive Caching**: Cache every analysis result 4. **Fallback Strategy**: Pixel-only diffing if Gemini fails 5. **Cost Tracking**: Display token usage and estimated cost per comparison 6. **Free Tier Monitoring**: Alert at 80% of daily/monthly limits ### Secondary Risk: Anti-Aliasing False Positives **Risk**: Too many false positives from rendering artifacts **Mitigation**: 1. **2px Gaussian Blur**: Standard preprocessing step 2. **Calibration Suite**: Test with 10 known anti-aliasing cases 3. **Smart Thresholding**: Adjust threshold based on image complexity 4. **User Override**: Allow manual threshold adjustment ### Tertiary Risk: Storage Bloat **Risk**: Screenshots consume too much disk space **Mitigation**: 1. **7-Day Retention**: Auto-cleanup old screenshots 2. **Git LFS**: Use Git LFS for version control 3. **Compression**: JPEG/WebP options for large screenshots 4. **Storage Stats**: Display disk usage and cleanup recommendations --- ## Dependencies ### External Dependencies - **MCP Python SDK 1.26.0+**: MCP server framework - **MSS**: Fast screenshot capture for macOS - **OpenCV 4.10+**: Pixel-level diffing - **google-genai SDK**: Gemini 3 Flash integration - **Pydantic**: Type validation ### Platform Dependencies - **macOS**: MSS for screenshots - **iOS Simulator**: simctl CLI tool - **Web**: chrome-devtools MCP server ### API Keys Required - **GEMINI_API_KEY**: For Gemini 3 Flash access (free tier) --- ## Execution Strategy ### Sequential Execution (Required for z.ai 1-concurrent limit) - All 27 plans executed sequentially - Each plan must complete successfully before next starts - Checkpoints after each phase (commit and tag) ### Milestones 1. **M1 - Foundation Complete**: Phase 1 done, MCP server functional 2. **M2 - Capture Complete**: Phase 2 done, screenshots can be captured 3. **M3 - Comparison Complete** 🔥: Phase 3 done, Gemini integration working (CRITICAL) 4. **M4 - Operations Complete**: Phase 4 done, caching and reporting working 5. **M5 - v1 Ready**: Phase 5 done, conversational investigation working ### Rollback Strategy - Each phase committed separately with git tags - If Phase 3 fails, rollback to Phase 2 and use pixel-only diffing - If rate limits exceeded, implement more aggressive caching --- ## Post-v1 Considerations (Future) ### Potential v2 Features - Historical comparison across all phases - Visual change timeline/gifs - A/B test comparison - Performance regression (screenshot + timing) - Accessibility regression overlay - Automated regression test suite ### Technical Debt to Track - Migration path if Gemini free tier changes - Alternative vision models (OpenAI GPT-4V, Claude 3.5 Sonnet) - Cloud storage option for teams - CI/CD integration (if rate limits allow) --- ## Open Questions 1. **Gemini API Key Management**: How to securely store and rotate API keys? 2. **Cost Monitoring**: Should we implement daily/monthly budget limits? 3. **Fallback Mode**: If Gemini is unavailable, how degraded should the experience be? 4. **Multi-User Support**: Should v1 support multiple users sharing screenshot storage? 5. **Platform Expansion**: Should we add Android or Windows support in v1? --- **Next Step**: Begin Phase 1 execution with `/gsd:execute-phase 1` **Critical Success Factor**: Phase 3 (Gemini Integration) - this phase determines project viability

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/bretbouchard/gemini-vision-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

ROADMAP.md•17.1 KiB