Where's Waldo Rick

FEATURES.md•14.6 KiB

# Features Research - Visual Regression Tools ## Executive Summary **Table Stakes Identified**: Screenshot capture, side-by-side comparison, diff visualization (heatmap), baseline management, pass/fail determination. **Key Differentiator**: **Agentic vision analysis using Gemini 3 Flash** - No existing tool provides iterative, intelligent analysis that can distinguish intended vs unintended changes. **Anti-Features Validated**: Mass CI/CD integration, cross-browser testing matrices, pixel-perfect strict mode. **Competitive Gap**: All existing tools (Playwright, Chromatic, Applitools) focus on automated CI/CD workflows with static reports. None provide conversational, on-demand analysis for AI agents. --- ## Feature Categories ### Table Stakes (Must-Have) Users expect these features from any visual regression tool. Without these, the tool is non-viable. #### 1. Screenshot Capture **What**: Ability to capture screenshots from various sources **Capabilities**: - Multiple input formats (PNG, JPG, PDF) - Platform support (macOS, iOS, Web) - Manual trigger (on-demand capture) - Metadata tagging (phase name, timestamp, description) **Evidence**: - Playwright: `page.screenshot()` is core feature - Chromatic: Screenshot capture is first step in workflow - Applitools: Multi-platform capture is foundational **Complexity**: LOW - Well-understood problem, standard solutions --- #### 2. Side-by-Side Comparison **What**: Visual display of before/after screenshots **Capabilities**: - Synchronized scrolling - Zoomable views - Overlay mode (fade between images) - Pixel-level inspection **Evidence**: - Playwright: Built-in diff reporter with side-by-side view - Chromatic: "Diff view" is primary UI - All tools: Side-by-side is expected default **Complexity**: LOW - Standard UI pattern --- #### 3. Diff Visualization (Heatmap) **What**: Visual representation of pixel differences **Capabilities**: - Color-coded heatmaps (red = changed, green = unchanged) - Adjustable diff threshold - Ignore regions (dynamic content) - Bounding boxes around changes **Evidence**: - Playwright: `expect(screenshot).toMatchSnapshot()` generates heatmap - Chromatic: "Diff map" shows pixel changes - Industry standard: Red/green heatmap is universal pattern **Complexity**: MEDIUM - Requires pixel diffing algorithm --- #### 4. Baseline Management **What**: Store and retrieve baseline screenshots for comparison **Capabilities**: - Save baseline images - Update baseline (accept changes) - Restore previous baseline - Baseline versioning **Evidence**: - Playwright: `toMatchSnapshot()` stores baseline in `__screenshots__` - Chromatic: "Baseline" concept is core to workflow - Applitools: "Baseline" vs "Checkpoint" terminology **Complexity**: MEDIUM - Requires storage strategy, cleanup policies --- #### 5. Pass/Fail Determination **What**: Boolean result + threshold configuration **Capabilities**: - Configurable diff threshold (0.1% - 5%) - Pass if below threshold - Fail if above threshold - Report pass/fail status **Evidence**: - Playwright: `failOnDifference: 0.1` threshold parameter - Chromatic: "Changes detected" = fail - All tools: Threshold-based pass/fail is standard **Complexity**: LOW - Simple comparison logic --- ### Differentiators (Competitive Advantage) Features that make this tool unique and superior to existing solutions. #### 1. Agentic Vision Analysis ⭐ **What**: Iterative, intelligent visual analysis using Gemini 3 Flash **Capabilities**: - Iterative zoom/crop/annotate workflow - Conversational investigation ("What changed in the header?") - Context-aware understanding ("Not the card, the box it's in") - Distinguish intended vs unintended changes - Multi-resolution analysis (low → medium → high → ultra) **Evidence**: - **No existing tool offers this** - Verified through web search - Playwright: Static diff only, no AI analysis - Chromatic: Manual review required for interpretation - Applitools: AI-powered but focused on layout, not conversational **Why It Matters**: - Current tools show WHAT changed, not WHY or IF it's intentional - Requires human interpretation of every diff - Can't handle nuanced conversations ("not that box, the child item") **Complexity**: HIGH - Emerging technology, no established patterns --- #### 2. Expected vs Unintended Change Detection **What**: Validate changes against declared expectations **Capabilities**: - Declare expected changes before work - Compare actual changes against expectations - Mark ✅ expected changes - Flag ⚠️ unintended regressions - Generate regression report **Evidence**: - **No existing tool offers this** - Verified through web search - Playwright: All diffs are failures, no "expected" concept - Chromatic: Manual review required to determine intent - Current workflow: Human must interpret every change **Why It Matters**: - Catches regressions automatically - Reduces false positives (expected changes aren't errors) - Enables "visual TDD" - declare expectations, verify against them **Complexity**: HIGH - Requires semantic understanding of changes --- #### 3. Conversational Investigation **What**: Natural language queries about visual changes **Capabilities**: - Ask: "What changed in the navigation?" - Ask: "Show me the 2 pixels that were added" - Ask: "Is the title aligned with the button?" - Follow-up questions based on answers **Evidence**: - **No existing tool offers this** - Verified through web search - Playwright: No query interface, static reports only - Chromatic: Manual visual inspection required - Current tools: Cannot answer questions about diffs **Why It Matters**: - Enables nuanced UI conversations - Reduces back-and-forth ("what specifically changed?") - Proof of work with visual specificity **Complexity**: VERY HIGH - Requires agentic vision + natural language understanding --- #### 4. Regional Analysis ("The Box It's In") **What**: Investigate specific regions with progressive zoom **Capabilities**: - Focus on specific UI element - Zoom into child elements - Analyze component hierarchy - Ignore parent, inspect children **Evidence**: - **No existing tool offers this** - Verified through web search - Playwright: Can crop but not iteratively investigate - Current tools: Static region-of-interest only **Why It Matters**: - Enables "not that box, the child item" conversations - Analyzes complex component hierarchies - Proof of specific element changes **Complexity**: HIGH - Requires agentic vision with iterative zoom --- ### Anti-Features (Deliberately NOT Building) Features that are commonly requested but problematic for this use case. #### 1. Mass CI/CD Integration **What**: Automated screenshot testing in CI/CD pipeline **Why NOT**: - Burns through free tier API limits (15 req/min) - Not the target use case (AI agents, not automated pipelines) - Adds complexity without value for manual workflows **Evidence**: - Playwright: Integrates with CI/CD but burns API credits - Chromatic: Designed for CI/CD but expensive - Our goal: Manual/on-demand for AI agents, not automated --- #### 2. Cross-Browser Testing Matrix **What**: Test across multiple browsers (Chrome, Firefox, Safari) **Why NOT**: - Multiplies API costs (browser × screenshot × comparison) - Not the target use case (focused on macOS/iOS/Web dev) - Adds complexity (browser automation, driver management) **Evidence**: - Playwright: Supports multi-browser but heavy setup - Chromatic: Multi-browser is premium feature - Our goal: Platform-specific (macOS app, iOS app, or web), not cross-browser --- #### 3. Pixel-Perfect Strict Mode **What**: Fail on ANY pixel difference, no threshold **Why NOT**: - Anti-aliasing noise creates false positives - Font rendering varies between machines - Dynamic content (timestamps) always changes - Makes tool unusable in practice **Evidence**: - Applitools: "Strict layout matching" creates noise storms - Industry best practice: Use thresholds (0.1% - 1%) - Our goal: Detect real regressions, not anti-aliasing differences --- #### 4. Automated Screenshot Scheduling **What**: Periodic screenshot capture (cron-based) **Why NOT**: - Burns through free tier limits - Generates storage bloat - Not the target use case (manual/on-demand) **Evidence**: - Some tools offer "monitoring" features - Our goal: Manual trigger at major milestones, not automated monitoring --- ## Competitive Analysis ### Playwright (Microsoft) **Strengths**: - Free, open-source - Excellent CI/CD integration - Cross-browser support **Weaknesses**: - Static diff only, no AI analysis - Cannot distinguish intended vs unintended changes - No conversational interface - Heatmap requires manual interpretation **Gap**: No agentic vision, no intelligent change interpretation ### Chromatic **Strengths**: - Git integration (branch-based testing) - Good UI for diff visualization - Team collaboration features **Weaknesses**: - Expensive for teams ($20+/user/month) - Static diff only - No AI-powered analysis - Focused on React/web components **Gap**: No conversational investigation, no expected change validation ### Applitools (Eyes) **Strengths**: - AI-powered layout matching - Cross-platform support - Enterprise features **Weaknesses**: - Very expensive ($100+/month) - AI is for layout matching, not conversational - Complex setup - Overkill for manual workflows **Gap**: No agentic vision, no expected vs unintended detection --- ## Feature Dependencies ### Core Dependencies **Screenshot Capture** ← Must exist before: - Side-by-side comparison - Diff visualization - Baseline management **Diff Visualization** ← Depends on: - Screenshot capture - Pixel diffing algorithm **Baseline Management** ← Depends on: - Screenshot capture - Storage strategy ### Agentic Vision Dependencies **Agentic Vision Analysis** ← Depends on: - Screenshot capture - Gemini API integration - Diff visualization (for context) **Expected vs Unintended** ← Depends on: - Agentic vision analysis - User-declared expectations - Diff visualization **Conversational Investigation** ← Depends on: - Agentic vision analysis - Natural language understanding - Multi-resolution analysis ### Build Order 1. **Screenshot capture** (foundation) 2. **Diff visualization** (needs screenshots) 3. **Baseline management** (needs screenshots + storage) 4. **Agentic vision** (needs all above) 5. **Expected vs unintended** (needs agentic vision) 6. **Conversational investigation** (needs agentic vision) --- ## MVP Definition ### v1 MVP (Minimum Viable Product) **Table Stakes** (Must have): - ✅ Screenshot capture (manual trigger) - ✅ Side-by-side comparison - ✅ Diff visualization (heatmap) - ✅ Baseline management (save, retrieve, update) - ✅ Pass/fail with threshold **Differentiators** (Competitive advantage): - ✅ Agentic vision analysis (iterative zoom/crop/annotate) - ✅ Expected vs unintended change detection - ⚠️ Conversational investigation (basic - single question) **Out of Scope** (Anti-features): - ❌ CI/CD integration - ❌ Cross-browser testing - ❌ Automated scheduling - ❌ Pixel-perfect strict mode ### v1+ (Post-MVP) **Advanced Features**: - Full conversational investigation (multi-turn dialog) - Regional analysis ("the box it's in") - Historical comparison across all phases - Element-level analysis with bounding boxes - Context-aware ignoring (semantic dynamic content) --- ## False Positive Reduction Strategies ### Problem: Anti-Aliasing Noise **Symptom**: Font rendering creates pixel-level noise, false positives **Solution**: 2-pixel Gaussian blur preprocessing - **Evidence**: Industry standard (Stack Overflow, GitHub issues) - **Implementation**: OpenCV `GaussianBlur(kernel=(5,5), sigmaX=2)` - **Effect**: Eliminates anti-aliasing noise while preserving real changes ### Problem: Dynamic Content **Symptom**: Timestamps, counters, animations always change **Solution**: Semantic understanding (agentic vision advantage) - **Traditional tools**: Manual masking, ignore regions - **Our approach**: AI understands "timestamp changed" is not a regression - **Implementation**: Gemini 3 Flash semantic analysis ### Problem: Layout Shifts **Symptom**: Entire UI shifts 1px, hundreds of "changes" detected **Solution**: Group adjacent changes, identify root cause - **Traditional tools**: Hundreds of red boxes, overwhelming - **Our approach**: "Card padding increased by 2px, causing title to shift 2px" - **Implementation**: Agentic vision pattern recognition --- ## Feature Complexity Assessment | Feature | Complexity | Rationale | |---------|-----------|-----------| | Screenshot Capture | LOW | Standard APIs, well-understood | | Side-by-Side Comparison | LOW | Basic UI pattern | | Diff Visualization | MEDIUM | Pixel diffing algorithm, heatmap generation | | Baseline Management | MEDIUM | Storage strategy, cleanup policies | | Pass/Fail Determination | LOW | Simple threshold comparison | | **Agentic Vision Analysis** | **HIGH** | Emerging tech, no established patterns | | **Expected vs Unintended** | **HIGH** | Requires semantic understanding | | **Conversational Investigation** | **VERY HIGH** | Multi-turn dialog, context management | | Regional Analysis | HIGH | Iterative zoom, component hierarchy | --- ## Open Questions ### Gemini API Capabilities **Question**: Can Gemini 3 Flash actually do iterative zoom/crop/annotate? **Status**: Marketing materials say yes, but need to test actual API **Research needed**: Phase 3 should verify: - Zoom into specific regions - Crop areas of interest - Annotate findings dynamically - Multi-resolution processing ### Conversation Depth **Question**: How many turns can conversational investigation support? **Status**: Unclear if single-question or multi-turn dialog **Research needed**: Phase 3 should test: - Single question: "What changed in the header?" - Follow-up: "Not the logo, the navigation items" - Follow-up: "Show me the spacing changes" ### Cost Per Comparison **Question**: How much does one comparison cost in Gemini API credits? **Estimate**: - Screenshot analysis: ~1000 tokens/image - Comparison: ~2000 tokens (2 images) - Annotated output: ~500 tokens - **Total**: ~3500 tokens = $0.0026 **Research needed**: Phase 3 should measure actual usage with real screenshots --- ## Roadmap Implications ### Phase Structure **Phase 1**: Table Stakes (screenshot, diff, baseline, pass/fail) **Phase 2**: MCP Integration (expose via MCP protocol) **Phase 3**: Agentic Vision (THE DIFFERENTIATOR) **Phase 4**: Advanced Features (conversational, regional) ### Risk Areas **High Risk**: Phase 3 (Agentic Vision) - Gemini API capabilities unverified - Cost per comparison unknown - No established patterns to follow **Mitigation**: Phase-specific research before implementation --- *Last updated: 2025-02-04* *Confidence: HIGH (table stakes, differentiators, anti-features)*

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/bretbouchard/gemini-vision-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

FEATURES.md•14.6 KiB