Where's Waldo Rick

STACK.md•12.1 KiB

# Stack Research - MCP Server + Agentic Vision ## Executive Summary **Recommended Stack (2025)**: MCP Python SDK 1.26.0+ + MSS (screenshot) + OpenCV (diffing) + Gemini 3 Flash (agentic vision) + google-genai SDK **Key Innovation**: Gemini 3 Flash's agentic vision capabilities enable iterative zoom/crop/annotate analysis that no existing visual regression tool provides. **Free Tier Strategy**: Strategic screenshot usage (not mass capture) + aggressive caching + progressive resolution (start low, increase only when needed) --- ## Core Technologies ### MCP Server Framework **MCP Python SDK** `v1.26.0+` (Official) - **Rationale**: Official SDK, production-ready, FastMCP decorator-based development - **Installation**: `pip install mcp` - **Confidence**: HIGH - Official PyPI package (v1.26.0, 2026-01-24) - **Docs**: https://github.com/modelcontextprotocol/python-sdk - **Why NOT alternatives**: - `mcp-server-typescript`: TypeScript-based, not Python - Custom HTTP implementation: Reinventing wheel, missing protocol updates **Python** `3.12+` - **Rationale**: Latest stable, best performance, modern type hints - **Confidence**: HIGH - Standard Python version - **Note**: 3.11 also acceptable if dependencies require it ### Screenshot Capture **MSS** (Multi-Screen Shot) `latest` - **Rationale**: Fastest cross-platform screenshot library (16-47ms capture time) - **Installation**: `pip install mss` - **Confidence**: HIGH - PyPI official, multiple 2025 sources confirming 2.5x faster than PyAutoGUI - **Performance**: 16-47ms vs PyAutoGUI's ~100ms - **Why NOT alternatives**: - `PIL.ImageGrab`: Slower, less reliable - `pyautogui`: Bloated, includes unnecessary features - `screenshot`: Windows-only **Platform-specific tools** (integrated via MSS adapters): - **macOS**: Built-in MSS support (native) - **iOS**: `simctl` (Xcode tooling) - needs separate research - **Web**: `chrome-devtools` MCP server (existing) ### Image Processing **OpenCV** `cv2` `4.10+` - **Rationale**: Superior accuracy for pixel-level diffing, faster than Pillow - **Installation**: `pip install opencv-python` - **Confidence**: MEDIUM - Multiple 2025 blog posts comparing vs Pillow (no official docs) - **Use cases**: Pixel diffing, heatmap generation, bounding box annotation - **Why NOT Pillow only**: Slower, less accurate for pixel operations **Pillow** `10.4+` (for metadata, I/O) - **Rationale**: Needed alongside OpenCV for image save/load, metadata handling - **Installation**: `pip install Pillow` - **Confidence**: HIGH - Standard Python imaging library **NumPy** `1.26+` (for array operations) - **Rationale**: Required by OpenCV, efficient array operations - **Installation**: `pip install numpy` - **Confidence**: HIGH - Standard ### Agentic Vision API **Gemini 3 Flash** `gemini-3-flash-preview` - **Rationale**: Only API with agentic vision (iterative zoom/crop/annotate + code execution) - **Installation**: `pip install google-generativeai` - **SDK**: `google-genai` (GA May 2025, replaces deprecated `generative-ai-python`) - **Confidence**: HIGH - Official Google AI documentation (2026-01-29) - **Free Tier**: 15 requests/minute, 250K tokens/minute, 1K requests/day - **Pricing**: $0.075/1M input tokens, $0.30/1M output tokens - **Key Features**: - Multi-resolution processing (low, medium, high, ultra high) - Code execution for image manipulation - Iterative visual analysis - Conversational image editing - **Why NOT alternatives**: - `GPT-4V`: No agentic vision, expensive - `Claude 3.5 Vision`: No iterative zoom/crop workflow - `OpenAI Vision`: Limited to static analysis ### Storage **Filesystem** (built-in Python) - **Rationale**: Simple, no external dependencies, fast - **Location**: `<project>/.screenshots/phases/` - **Format**: PNG for lossless compression - **Metadata**: JSON sidecar files - **Cleanup**: Python `glob` + `os.remove` - **Why NOT database**: - Overkill for image storage - Adds dependency (SQLite, PostgreSQL) - Harder to inspect/debug manually **Git LFS** (optional, for baseline images) - **Rationale**: Store large images in Git without bloating repo - **Installation**: `git lfs install` - **Use case**: Reference baselines, not transient diffs - **Confidence**: HIGH - Industry standard --- ## Integration Patterns ### MCP Server Structure ```python from mcp.server import Server from mcp.server.stdio import stdio_server from mcp.types import Tool, TextContent app = Server("wheres-waldo-rick") @app.tool("visual_capture") async def visual_capture(name: str) -> str: """Capture screenshot and store with metadata""" # Delegate to service layer return capture_service.capture(name) @app.tool("visual_compare") async def visual_compare(before: str, after: str) -> str: """Compare two screenshots with agentic vision""" # Delegate to service layer return compare_service.compare(before, after) async def main(): async with stdio_server() as (read_stream, write_stream): await app.run(read_stream, write_stream) if __name__ == "__main__": import asyncio asyncio.run(main()) ``` ### Gemini Integration Pattern ```python import google.generativeai as genai from google.generativeai.types import HarmCategory, HarmBlockThreshold genai.configure(api_key=os.getenv("GEMINI_API_KEY")) model = genai.GenerativeModel( "gemini-3-flash-preview", generation_config=genai.types.GenerationConfig( temperature=0.0, # Deterministic for diffing top_p=0.5, ), safety_settings={ HarmCategory.HARM_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE, } ) async def analyze_with_agentic_vision(image_path: str, query: str): """Iterative visual analysis using Gemini""" image = Image.open(image_path) # Low-res pass response_1 = model.generate_content([ query, image, "What do you see? What should we investigate?" ]) # Medium-res pass if needed if "zoom" in response_1.text.lower(): cropped = crop_region(image, response_1.text) response_2 = model.generate_content([query, cropped, "Deeper analysis"]) return response_2.text ``` ### Screenshot Capture Pattern ```python import mss def capture_screenshot(name: str) -> str: """Capture screenshot using MSS""" with mss.mss() as sct: # Capture primary monitor monitor = sct.monitors[1] screenshot = sct.grab(monitor) # Save as PNG path = f".screenshots/phases/{name}.png" mss.tools.to_png(screenshot.rgb, screenshot.size, output=path) return path ``` --- ## Development Tools ### Testing **pytest** `8.0+` (unit tests) - **Rationale**: Standard Python testing - **Installation**: `pip install pytest pytest-asyncio` **pytest-mock** `3.14+` (mock Gemini API) - **Rationale**: Don't burn API credits during tests - **Installation**: `pip install pytest-mock` ### Code Quality **ruff** `0.6+` (linting + formatting, replaces black, flake8, isort) - **Rationale**: Fast, all-in-one tool - **Installation**: `pip install ruff` **mypy** `1.10+` (type checking) - **Rationale**: Catch type errors early - **Installation**: `pip install mypy` --- ## Build & Deployment ### Build System **uv** `0.5+` (fast Python package installer) - **Rationale**: 10-100x faster than pip, reliable locking - **Installation**: `curl -LsSf https://astral.sh/uv/install.sh | sh` - **Confidence**: HIGH - Emerging standard (2025) ### Packaging **pyproject.toml` (modern Python packaging) - **Rationale**: Replaces setup.py, standard format - **Build backend**: `hatchling` or `setuptools` --- ## Free Tier Considerations ### Gemini API Limits **Free Tier** (as of 2026-01-29): - **Rate limit**: 15 requests/minute - **Tokens**: 250K tokens/minute - **Daily**: 1K requests/day **Cost Projection**: - Screenshot analysis: ~1000 tokens/image - Comparison: ~2000 tokens (2 images) - Annotated output: ~500 tokens - **Total per comparison**: ~3500 tokens = $0.0026 **Strategy**: 1. **Cache aggressively**: Don't re-analyze same screenshots 2. **Progressive resolution**: Start with low-res (cheaper), increase only if needed 3. **Batch when possible**: Compare multiple screenshots in single request 4. **Rate limiting**: Token bucket algorithm to stay within 15 req/min ### Storage Costs **Screenshot size**: - PNG: 1-5 MB per screenshot (depending on resolution) - Metadata JSON: <1 KB **Phase storage**: - 3 phases × 2 screenshots = 6 screenshots - Estimated: 6-30 MB total per project **Git LFS**: - Free tier: 1 GB storage - **Sufficient for**: 200+ projects with 3 phases each --- ## Alternatives Considered ### MCP Framework ❌ **mcp-server-typescript** - Rationale: TypeScript-based - Why NOT: Project is Python, mixing languages adds complexity ❌ **Custom HTTP server** - Rationale: Build MCP protocol from scratch - Why NOT: Reinventing wheel, missing protocol updates, maintenance burden ### Screenshot Libraries ❌ **PIL.ImageGrab** - Rationale: Built-in to Pillow - Why NOT: Slower than MSS (2.5x), less reliable cross-platform ❌ **pyautogui** - Rationale: Popular screenshot library - Why NOT: Bloated (includes mouse/keyboard automation), slower ~100ms ❌ **selenium** - Rationale: Web automation can capture screenshots - Why NOT: Heavy dependency, overkill for static screenshots ### Vision APIs ❌ **GPT-4V (OpenAI)** - Rationale: Popular vision model - Why NOT: No agentic vision, expensive ($0.01-0.03/image), no iterative zoom/crop ❌ **Claude 3.5 Sonnet (Anthropic)** - Rationale: Strong vision capabilities - Why NOT: No agentic vision API, limited to static analysis, more expensive ❌ **Stable Diffusion** - Rationale: Image generation - Why NOT: Wrong use case (generation vs analysis), no diffing capabilities ### Image Processing ❌ **Pillow-only (no OpenCV)** - Rationale: Reduce dependencies - Why NOT: Slower, less accurate for pixel operations, missing advanced features --- ## Open Questions ### iOS Screenshot Capture **Status**: Need separate research for iOS simulator capture **Current knowledge**: - Xcode `simctl` command-line tool can capture screenshots - Need to verify availability and programmatic access - May require `xcrun simctl io booted screenshot` pattern **Research needed**: Phase 2 should investigate iOS capture patterns ### Gemini Code Execution Limits **Status**: Free tier rate limiting clear, but code execution specifics unclear **Current knowledge**: - 15 requests/minute applies to all API calls - Code execution may have additional constraints - Need to test actual token usage for iterative zoom/crop workflows **Research needed**: Phase 3 should test code execution patterns and measure actual costs ### Optimal Blur Strength for Anti-Aliasing **Status**: Industry standard is 2px Gaussian blur, but edge cases may differ **Current knowledge**: - 2px blur eliminates anti-aliasing noise - May miss some edge cases (subtle 1px changes) - Need calibration with regression suite **Research needed**: Phase 2 should test blur strength against 1px, 2px, 3px changes --- ## Recommendations ### v1 Stack **Core**: - MCP Python SDK 1.26.0+ - MSS (screenshot capture) - OpenCV 4.10+ (pixel diffing) - Gemini 3 Flash (agentic vision) - google-genai SDK (Gemini integration) **Supporting**: - Python 3.12+ - Pillow 10.4+ (metadata/I/O) - NumPy 1.26+ (array operations) **Development**: - uv (package management) - pytest (testing) - ruff (linting/formatting) - mypy (type checking) ### Phase 0 Dependencies **Before Phase 1**: - [ ] Get Gemini API key from https://ai.google.dev - [ ] Verify free tier limits (15 req/min, 250K TPM, 1K RPD) - [ ] Test basic agentic vision call (zoom/crop/annotate) - [ ] Verify MSS availability on development machine - [ ] Set up Git LFS for baseline storage ### Phase-Specific Research **Phase 2** (Screenshot Capture): - [ ] Verify axe CLI availability for macOS - [ ] Research iOS simctl screenshot capabilities - [ ] Test chrome-devtools MCP integration **Phase 3** (Gemini Integration): - [ ] Test actual Gemini API for iterative zoom/crop/annotate - [ ] Measure token usage for typical comparison - [ ] Verify code execution works in free tier - [ ] Calibrate blur strength for 2px detection --- *Last updated: 2025-02-04* *Confidence: HIGH (MCP SDK, Gemini, MSS) / MEDIUM (OpenCV, image processing)*

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/bretbouchard/gemini-vision-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

STACK.md•12.1 KiB