AI Vision MCP Server

web-context-detection-plan.md•13.7 KiB

# Plan: Web Context Detection and HTML Element Classification **Date**: 2025-01-10 **Author**: Claude Code **Issue**: Enhance object detection to automatically detect web page contexts and use appropriate HTML element names for better automation compatibility ## Problem Statement The current object detection system uses generic element names regardless of context: 1. **Context-Agnostic Naming**: Uses "button", "input", "text" for all interfaces (web, mobile, desktop) 2. **Missed Semantic Opportunities**: Web pages could benefit from HTML-specific element names 3. **Automation Mismatch**: Generic names don't align with CSS selector targeting for web automation 4. **Limited Specificity**: Cannot distinguish between HTML input types (text, email, password, etc.) ## Solution: Context-Aware System Instructions ### Core Approach - **Automatic Web Detection**: Enhance system instructions to identify web page interfaces - **HTML Element Classification**: Use semantic HTML element names when web context is detected - **Fallback Mechanism**: Maintain current generic naming for non-web contexts - **Progressive Enhancement**: Start with basic detection, enhance over time ## Implementation Strategy ### Phase 1: Enhanced System Instructions (Day 1-2) **Objective**: Modify the detection system instruction to include web context detection logic. **Key Changes**: 1. **Context Detection Prompting**: Add web interface identification step 2. **HTML Element Vocabulary**: Provide comprehensive HTML element list 3. **Conditional Logic**: Use HTML names for web contexts, generic names otherwise 4. **Input Type Specificity**: Detect specific input types when possible **Updated System Instruction Structure**: ``` 1. CONTEXT DETECTION: - Analyze if image shows a web page, browser interface, or web application - Look for indicators: address bars, browser UI, web-style layouts, form elements - If web context detected → use HTML element names - If non-web context → use generic object names 2. HTML ELEMENT CLASSIFICATION (Web Context Only): - Interactive Elements: button, input[type], select, textarea, a - Form Elements: form, label, fieldset, legend - Structural: nav, header, footer, main, section, article - Content: h1-h6, p, img, video, ul, ol, li 3. INPUT TYPE DETECTION: - Analyze visual cues for input specificity - text, email, password, search, tel, url, number, date - checkbox, radio, file, submit, reset 4. FALLBACK NAMING: - Non-web contexts: button, text, image, icon, object, container ``` ### Phase 2: Web Context Indicators (Day 2-3) **Visual Indicators for Web Detection**: - **Browser Elements**: Address bar, navigation buttons, tabs, bookmarks bar - **Web UI Patterns**: Navigation menus, breadcrumbs, pagination, form layouts - **Typography**: Web fonts, text rendering typical of browsers - **Layout Patterns**: Grid systems, responsive design indicators, web-style spacing - **Form Elements**: Standard HTML form controls with web styling **Detection Logic**: ```typescript const WEB_CONTEXT_INDICATORS = [ // Browser UI 'address bar', 'url bar', 'browser tab', 'bookmark bar', 'browser window', 'navigation buttons', // Web Interface Patterns 'navigation menu', 'breadcrumb', 'pagination', 'web form', 'login form', 'search bar', 'dropdown menu', 'checkbox', 'radio button', 'submit button', 'hyperlink', // Layout Patterns 'web page', 'website', 'web application', 'responsive design', 'grid layout', 'sidebar', 'header', 'footer', 'navigation' ]; ``` ### Phase 3: HTML Element Mapping (Day 3-4) **Interactive Elements** (High Priority): ``` button → <button>, <input type="submit">, <input type="button"> input → <input type="text">, <input type="email">, <input type="password"> select → <select> textarea → <textarea> link → <a> checkbox → <input type="checkbox"> radio → <input type="radio"> ``` **Structural Elements**: ``` navigation → <nav> header → <header> footer → <footer> main → <main> section → <section> article → <article> ``` **Content Elements**: ``` heading → <h1>, <h2>, <h3>, <h4>, <h5>, <h6> paragraph → <p> image → <img> video → <video> list → <ul>, <ol> listitem → <li> ``` ### Phase 4: Enhanced System Instruction Implementation **Strategy: Two-Phase Detection with Conditional Logic** The key is to make the LLM first analyze the context, then conditionally apply different naming schemes based on that analysis. **Updated DETECTION_SYSTEM_INSTRUCTION**: ```typescript const DETECTION_SYSTEM_INSTRUCTION = ` You are a precise visual detection assistant with context-aware element naming. STEP 1 - CONTEXT ANALYSIS: First, analyze this image to determine the interface type: WEB INTERFACE INDICATORS (look for these): - Browser elements: address bar, tabs, navigation buttons, bookmarks - Web UI patterns: navigation menus, breadcrumbs, form layouts - Web typography: typical web fonts and text rendering - HTML form controls: standard web input fields, buttons, dropdowns - Web layout patterns: grid systems, responsive design, web-style spacing - URL visible, web page content, browser interface If you detect 2 or more web indicators → WEB CONTEXT = TRUE If unclear or mobile/desktop app → WEB CONTEXT = FALSE STEP 2 - CONDITIONAL ELEMENT NAMING: IF WEB CONTEXT = TRUE: Use specific HTML element names: - Buttons: "button", "input[type=\"submit\"]", "input[type=\"button\"]" - Text inputs: "input[type=\"text\"]", "input[type=\"email\"]", "input[type=\"password\"]" - Form controls: "select", "textarea", "label", "fieldset" - Links: "a" - Navigation: "nav", "ul", "ol", "li" - Structure: "header", "footer", "main", "section", "article" - Content: "h1", "h2", "h3", "h4", "h5", "h6", "p", "img", "video" IF WEB CONTEXT = FALSE: Use generic element names: - "button", "input", "text", "image", "icon", "container", "object" STEP 3 - OUTPUT FORMAT: For each detected element, return: { "object": "<conditional name based on context analysis>", "label": "<descriptive label>", "normalized_box_2d": [ymin, xmin, ymax, xmax] } STEP 4 - EXAMPLES: Web Context Example: { "object": "input[type=\"email\"]", "label": "Email address field", "normalized_box_2d": [180, 200, 220, 600] } Non-Web Context Example: { "object": "input", "label": "Email address field", "normalized_box_2d": [180, 200, 220, 600] } Return only valid JSON array - no explanatory text, no context description. `; ``` ## Practical Implementation in Code ### Method 1: Single System Instruction (Recommended) **File**: `src/tools/detect_objects_in_image.ts` The LLM handles context detection internally through the enhanced system instruction: ```typescript // Enhanced system instruction with conditional logic const DETECTION_SYSTEM_INSTRUCTION = ` You are a precise visual detection assistant with context-aware element naming. STEP 1 - CONTEXT ANALYSIS: First, analyze this image to determine the interface type: WEB INTERFACE INDICATORS (look for these): - Browser elements: address bar, tabs, navigation buttons, bookmarks - Web UI patterns: navigation menus, breadcrumbs, form layouts - Web typography: typical web fonts and text rendering - HTML form controls: standard web input fields, buttons, dropdowns - Web layout patterns: grid systems, responsive design, web-style spacing - URL visible, web page content, browser interface If you detect 2 or more web indicators → WEB CONTEXT = TRUE If unclear or mobile/desktop app → WEB CONTEXT = FALSE STEP 2 - CONDITIONAL ELEMENT NAMING: IF WEB CONTEXT = TRUE: Use specific HTML element names: - Buttons: "button", "input[type=\"submit\"]", "input[type=\"button\"]" - Text inputs: "input[type=\"text\"]", "input[type=\"email\"]", "input[type=\"password\"]" - Form controls: "select", "textarea", "label", "fieldset" - Links: "a" - Navigation: "nav", "ul", "ol", "li" - Structure: "header", "footer", "main", "section", "article" - Content: "h1", "h2", "h3", "h4", "h5", "h6", "p", "img", "video" IF WEB CONTEXT = FALSE: Use generic element names: - "button", "input", "text", "image", "icon", "container", "object" Return only valid JSON array with conditional naming applied. `; // No code changes needed - LLM handles context detection automatically export async function detect_objects_in_image( args: ObjectDetectionArgs, config: Config, imageProvider: VisionProvider, imageFileService: FileService ): Promise<ObjectDetectionResponse> { // ... existing code ... const options: AnalysisOptions = { // ... existing options ... systemInstruction: DETECTION_SYSTEM_INSTRUCTION, // Enhanced instruction }; // LLM automatically applies conditional naming based on context analysis const result = await imageProvider.analyzeImage( processedImageSource, detectionPrompt, options ); // ... rest of existing code unchanged ... } ``` ## Implementation Approach: Single System Instruction **How it works:** - Single API call with enhanced system instruction - LLM internally analyzes context and applies conditional naming - No code changes to existing flow **Benefits:** - ✅ **Simplest Implementation**: No additional API calls or complexity - ✅ **Cost Efficient**: Single request instead of two - ✅ **Faster Execution**: No sequential API calls - ✅ **Atomic Operation**: Context and detection in one step - ✅ **No Code Refactoring**: Drop-in replacement for existing system instruction **Implementation Strategy:** - Start with single instruction approach - Monitor accuracy through logging and user feedback - Modern LLMs are capable of following complex conditional instructions reliably ## Example Outputs with Conditional Naming ### Same Screenshot, Different Contexts **Web Page Screenshot:** ```json [ { "object": "input[type=\"email\"]", "label": "Email address field", "normalized_box_2d": [180, 200, 220, 600] }, { "object": "button[type=\"submit\"]", "label": "Login submit button", "normalized_box_2d": [245, 720, 290, 850] }, { "object": "nav", "label": "Main navigation menu", "normalized_box_2d": [50, 100, 120, 900] } ] ``` **Mobile App Screenshot (Same Visual Elements):** ```json [ { "object": "input", "label": "Email address field", "normalized_box_2d": [180, 200, 220, 600] }, { "object": "button", "label": "Login submit button", "normalized_box_2d": [245, 720, 290, 850] }, { "object": "container", "label": "Main navigation menu", "normalized_box_2d": [50, 100, 120, 900] } ] ``` **Key Difference:** Same visual elements get HTML-specific names for web contexts, generic names for non-web contexts. ## CSS Selector Integration **Enhanced Summary Generation**: ```typescript function suggestCSSSelectors(detection: DetectedObject): string[] { const selectors = []; // HTML element-specific selectors if (detection.object.startsWith('input[type=')) { const inputType = detection.object.match(/type="([^"]+)"/)?.[1]; selectors.push(`input[type="${inputType}"]`); selectors.push(`input[name="${detection.label.toLowerCase().replace(/\s+/g, '_')}"]`); } else if (detection.object === 'button') { selectors.push('button[type="submit"]'); selectors.push(`button:has-text("${detection.label}")`); } else if (detection.object === 'select') { selectors.push('select'); selectors.push(`select[name="${detection.label.toLowerCase().replace(/\s+/g, '_')}"]`); } return selectors; } ``` ## Benefits - **Perfect CSS Alignment**: HTML element names directly map to selectors - **Automation Accuracy**: More precise element targeting for web automation - **Context Awareness**: Appropriate naming based on interface type - **Backward Compatibility**: Non-web contexts unchanged - **Framework Integration**: Better integration with Playwright, Puppeteer, Cypress - **Accessibility Support**: HTML semantics align with accessibility standards - **Developer Experience**: More intuitive element identification ## Risk Assessment **Low Risk:** - **Fallback Mechanism**: Non-web contexts remain unchanged - **Gradual Implementation**: Can be deployed incrementally - **No Breaking Changes**: Existing functionality preserved **Medium Risk:** - **Detection Accuracy**: Web context detection may have false positives/negatives - **HTML Specificity**: Need to ensure HTML element names are accurate **Mitigation Strategies:** - **Conservative Detection**: Err on side of generic naming when uncertain - **Validation Testing**: Extensive testing across web and non-web contexts - **User Feedback**: Monitor accuracy and adjust detection logic ## Success Metrics 1. **Web Detection Accuracy**: >90% correct web vs non-web classification 2. **HTML Element Accuracy**: >85% appropriate HTML element names for web contexts 3. **CSS Selector Utility**: Functional CSS selectors for detected web elements 4. **User Adoption**: Increased usage for web automation scenarios ## Implementation Timeline - **Day 1**: Update system instructions with web context detection - **Day 2**: Add HTML element classification logic - **Day 3**: Integrate with CSS selector generation - **Day 4**: Testing and validation across different interface types - **Day 5**: Documentation and deployment ## Future Enhancements - **Machine Learning**: Train models to better detect web contexts - **Framework-Specific**: Tailored outputs for different automation frameworks - **Accessibility**: Include ARIA attributes and accessibility information - **Mobile Web**: Detect mobile web interfaces vs native mobile apps - **Component Libraries**: Recognize common UI component patterns ## Conclusion This enhancement will significantly improve the tool's value for web automation by providing context-aware element classification. The progressive approach ensures backward compatibility while adding substantial value for web-based use cases, making the tool more aligned with modern web development and automation practices.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/honeyvig/ai-vision-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

web-context-detection-plan.md•13.7 KiB