Web Inspector MCP

MIT License

Overview InspectNew Endpoints Schema Related Servers Reviews Score

mcp-web-inspector

TOOL_DESIGN_PRINCIPLES.md•52.6 kB

# Tool Design Principles for Playwright MCP Server Based on research of LLM tool calling best practices (2024-2025), MCP specifications, and Anthropic/OpenAI guidelines. ## Core Principles ### 1. **Atomic Operations** ✅ CRITICAL Each tool does ONE thing and does it well. **Good Examples (Current):** - `click` - only clicks - `fill` - only fills - `go_back` - only goes back **Anti-Pattern to Avoid:** ```typescript // BAD - multiple operations in one tool navigate_and_click({ url, selector, waitForLoad }) // GOOD - separate concerns navigate({ url }) click({ selector }) ``` ### 2. **Minimize Parameters** ✅ CRITICAL Aim for 1-3 parameters. Max 5 parameters before considering splitting the tool. **Guideline:** - 1-2 parameters: Ideal - 3-4 parameters: Acceptable - 5+ parameters: Refactor into multiple tools **Why:** More parameters = higher chance LLM provides wrong values or omits required ones. ### 3. **Primitive Types Over Objects** ✅ IMPORTANT Use `string`, `number`, `boolean` instead of nested objects where possible. **Good:** ```typescript screenshot({ selector: string; fullPage: boolean; path: string; }) ``` **Avoid:** ```typescript screenshot({ target: { selector: string; type: 'element' | 'page' }; options: { fullPage: boolean; quality: number; }; output: { path: string; format: 'png' | 'jpeg' }; }) ``` ### 4. **Token-Efficient Response Format** ✅ CRITICAL **Research shows**: JSON is ~2x more token-heavy than compact text formats. Anthropic's examples show concise text responses use ~⅓ tokens vs detailed JSON. **Principle**: Return compact, human-readable text instead of verbose JSON when appropriate. **Compact Text Format** (preferred for complex data): ```typescript // Returns plain text string return { content: [{ type: "text", text: `Element State: <button data-testid="submit"> @ (260,100) 120x40px ✓ visible, ⚡ interactive opacity: 1.0, display: block` }] }; ``` ~75 tokens **JSON Format** (use only when structure is critical): ```typescript return { content: [{ type: "text", text: JSON.stringify({ tagName: "button", testId: "submit", x: 260, y: 100, width: 120, height: 40, isVisible: true, isInteractive: true, opacity: 1.0, display: "block" }, null, 2) }] }; ``` ~230 tokens (3x more!) **Guidelines**: - Use **compact text** for: DOM trees, element lists, layout info, debugging output - Use **JSON** for: Structured data that code will parse, single objects with <5 fields - Use **symbols** over words: ✓✗⚡→↓←↑ instead of `"isVisible": true` - Use **shorthand notation**: `@ (x,y) WxH` instead of separate fields - **Flatten nested data**: One level deep maximum in JSON **Token Savings**: - Compact text: 60-75% fewer tokens than JSON - Symbols (✓): 1 token vs `"isVisible": true` (3+ tokens) - Arrows (→10px): 2 tokens vs `"gapRight": 10` (4+ tokens) ### 4a. **Base64 Images in MCP Tool Responses** ⚠️ CRITICAL LIMITATION **DO NOT return base64-encoded images in MCP tool responses.** **Research Finding (2025-01-20):** Claude Code MCP implementation has a critical limitation with base64 images: | Method | Token Cost | Works in Claude Code? | |--------|-----------|----------------------| | **User pastes image directly** | ~1,500 tokens | ✅ Yes (efficient) | | **MCP tool returns base64** | ~137,411 tokens | ❌ No (exceeds 25k limit) | **The Problem:** - MCP tool responses have a 25,000 token limit (configurable via `MAX_MCP_OUTPUT_TOKENS`) - Base64 screenshot data is ~137,411 tokens (raw string length) - Claude Code counts the base64 STRING tokens, not efficient image tokens - Result: **Tool response fails with error** (Issue #9152) **Why the Discrepancy:** When users paste images into Claude Code, the client converts them to efficient image format (~1,500 tokens). When MCP tools return base64, Claude Code treats it as TEXT and counts every character. **Correct Approach for Screenshots:** ```typescript // ✅ GOOD - Return file path visual_screenshot_for_humans({ name: "login" }) → "✓ Screenshot saved: .mcp-web-inspector/screenshots/login-2025-01-20.png" // ❌ BAD - Don't return base64 visual_screenshot_for_humans({ name: "login", returnBase64: true }) → { type: "image", data: "iVBORw0KG..." } // FAILS: exceeds 25k tokens ``` **User Workflow:** 1. Tool saves screenshot to file 2. User opens file in IDE/Finder 3. If analysis needed, user manually pastes into Claude Code 4. Claude processes efficiently (~1,500 tokens) **Alternative (MCP Resources):** MCP Resources can contain base64 blobs, but Claude Desktop requires users to **explicitly select resources** before use - no better UX than file-based approach. **Conclusion:** File paths are the ONLY working solution for binary data (screenshots, PDFs, etc.) in MCP tool responses. **References:** - GitHub Issue #9152: "Image responses token exceeds 25000 tokens" - MCP Specification 2025-06-18 - Research: January 20, 2025 ### 5. **Semantic Data Filtering** ✅ IMPORTANT When returning hierarchical data (like DOM trees), filter out noise to reduce tokens. **Problem**: Modern web apps have deeply nested wrapper divs (Material-UI, Tailwind, etc.) ```html <div class="MuiContainer-root"> <div class="MuiBox-root css-xyz123"> <div class="flex-wrapper"> <div class="content-container"> <form>  ``` **Solution**: Return only semantic elements, skip wrappers ```typescript // Skip generic wrappers, return only: // - Semantic HTML (header, form, button, input, h1-h6, nav, main, etc.) // - Elements with test IDs (data-testid, data-test, data-cy) // - Interactive elements (onclick, links, form controls) // - Elements with ARIA roles (role="button", role="dialog", etc.) // - Containers with significant text (>10 chars direct text) // Result: 3 semantic levels instead of 9 DOM levels <body> └── <main> └── <form data-testid="login-form"> └── <input data-testid="email-input" /> ``` **Depth Strategy**: Always return depth=1 (immediate children only) - ✅ **Good**: Each tool call shows one level, LLM calls again to drill deeper - ❌ **Bad**: Return nested tree with depth=3, output becomes unreadable **Example Progressive Workflow**: ``` Call 1: inspect_dom({}) → See <header>, <main>, <footer> Call 2: inspect_dom({ selector: "main" }) → See <aside>, <form> Call 3: inspect_dom({ selector: "form" }) → See <input>, <button> ``` **Benefits**: - Reduces output from 500+ tokens to <100 tokens per call - LLM sees structure without noise - Faster comprehension and decision-making - Readable output even for deeply nested structures **Handling Poorly Structured DOM**: When semantic filtering returns no/few elements, provide actionable guidance: ```typescript // Bad: Silent failure return { children: [] }; // Good: Explain and suggest alternatives return ` Children (0 semantic, skipped 12 wrapper divs): ⚠ No semantic elements found at this level. Suggestions: 1. Use get_visible_html to see raw HTML 2. Look for elements by class/id if test IDs unavailable 3. Recommend adding data-testid for better testability Wrapper divs: <div class="wrapper">, <div class="content">, ... To improve structure, consider adding: - Semantic HTML: <header>, <main>, <button> - Test IDs: data-testid="submit" - ARIA roles: role="button" `; ``` This guides LLMs to: - Switch to alternative inspection tools - Work with available CSS selectors - Understand why inspection failed - Suggest improvements to developers ### 6. **Single Selector Parameter** ✅ CRITICAL Don't add `testId`, `cssSelector`, `xpath` as separate parameters. **Correct Approach:** ```typescript { selector: string; // CSS, text, or testid: prefix } // Usage: { selector: "#login" } // CSS { selector: "text=Login" } // Text { selector: "testid:submit-btn" } // Test ID shorthand ``` **Wrong Approach:** ```typescript { selector?: string; testId?: string; xpath?: string; // LLM must choose which parameter to use - adds complexity! } ``` **Implementation:** Use internal normalization helper to convert shorthand to full selectors. ### 7. **Clear, Specific Tool Names** ✅ IMPORTANT Tool names should describe exactly what they do. **Good:** - `click_and_switch_tab` - describes the complete action - `save_as_pdf` - clear output format **Avoid Ambiguity:** - `interact` - too vague - `process` - what does it process? ### 8. **Explicit Over Implicit** ✅ IMPORTANT Make behavior explicit through parameters, not implicit through smart defaults. **Good:** ```typescript navigate({ url: string; waitUntil?: 'load' | 'domcontentloaded' | 'networkidle'; // Explicit }) ``` **Problematic:** ```typescript navigate({ url: string; // Implicitly waits for "smart" detection - LLM doesn't know what happens }) ``` ### 9. **Error Returns, Not Exceptions** ✅ CRITICAL (MCP Spec) Return errors as part of the result object so LLM can see and handle them. **Correct (Current Implementation):** ```typescript return { content: [{ type: "text", text: "Error: Element not found" }], isError: true }; ``` **Wrong:** ```typescript throw new Error("Element not found"); // LLM never sees this! ``` ### 10. **Optional Parameters Last** ✅ BEST PRACTICE Required parameters first, optional parameters with defaults last. **Good:** ```typescript { selector: string; // Required timeout?: number; // Optional waitForVisible?: boolean; // Optional } ``` ### 11. **Actionable Error Messages & Guidance** ✅ IMPORTANT When tools detect issues, provide **step-by-step guidance** for LLMs to fix them. **Excellent Example (get_test_ids):** When duplicate test IDs are detected, the tool provides: 1. Impact explanation (why it's a problem) 2. Step-by-step fix instructions 3. Concrete examples with actual duplicate values 4. Suggestions to use other MCP tools for investigation ```typescript // When duplicates found, response includes: 🔧 How to Fix: 1. Use query_selector_all to locate all duplicates query_selector_all({ selector: "testid:main-header" }) 2. Identify which elements should keep the test ID 3. Rename duplicates to be unique and descriptive Example: "main-header" → "main-header-primary", "main-header-mobile" 4. If one is hidden/unused, consider removing it entirely ``` **Why this works:** - ✅ LLMs can autonomously fix the issue without asking user - ✅ Provides concrete tool calls to run next - ✅ Gives specific examples using actual detected values - ✅ Chains tools together (get_test_ids → query_selector_all → fix code) **Apply this pattern to:** - Validation failures (suggest how to fix) - Missing elements (suggest alternative selectors) - Timeout errors (suggest increasing timeout or checking visibility) - Permission errors (suggest required configuration) ### 12. **Consistent Naming Conventions** ✅ IMPORTANT Use consistent patterns across tools. **Current Conventions (Keep):** - All tools: `*` - Boolean parameters: `is*`, `should*`, `include*` - Paths: always `*Path` not `*File` or `*Location` - Timeouts: always `timeout` (milliseconds) ## Tool Design Checklist Before adding a new tool, verify: - [ ] Does ONE thing (atomic operation) (§1) - [ ] Has 5 or fewer parameters (§2) - [ ] Uses primitive types (string, number, boolean) (§3) - [ ] Returns token-efficient format (compact text preferred over JSON) (§4) - [ ] Uses symbols and shorthand (✓✗⚡→↓ vs verbose field names) (§4) - [ ] Filters semantic data (skips wrapper divs, shows only meaningful elements) (§5) - [ ] **Does NOT return base64 images** (use file paths instead - see §4a) - [ ] **Provides actionable guidance** when issues detected (§11) - [ ] **Description explicitly lists ALL possible outputs** including conditional ones (§13) - [ ] **Description uses same symbols/format as actual output** (e.g., "arrows ↑↓←→") (§13) - [ ] **Description indicates conditionals** (e.g., "z-index when set", "if parent is flex") (§13) - [ ] **Uses semantic namespacing in tool name** (§14): - [ ] Tool name includes category prefix (`structural_*`, `visual_*`, `interact_*`, `extract_*`, `config_*`) - [ ] Name creates cognitive forcing function (e.g., `visual_screenshot_for_humans` not just `screenshot`) - [ ] Consistent with other tools in same category - [ ] **Includes tool selection guidance** to prevent misuse (§15): - [ ] ❌ WRONG / ✅ RIGHT examples showing actual misuse patterns (§15 Pattern 1) - [ ] ⚠️ Explicit "NOT for" statements when misuse is likely - [ ] 💡 Suggests better alternatives with quantified benefits - [ ] 🏷️ Category label (PRIMARY/DEBUG/VISUAL/etc.) - [ ] 📈 Progressive workflow description if tool is part of a chain - [ ] ⚡ Efficiency guidance (token costs, when to use simpler tools) - [ ] 🔗 Names of related/alternative tools - [ ] ✅ VALID use cases explicitly listed to avoid over-correction - [ ] **Considers tool count impact** (§16): - [ ] Can this be consolidated with existing tools? - [ ] Does this add value or create confusion with similar tools? - [ ] Target: Keep total toolset <25 tools for single-agent systems - [ ] **Includes detail_level parameter if appropriate** (§17): - [ ] For inspection/query tools: Add `detail_level?: "minimal" | "standard" | "comprehensive"` - [ ] Document token costs for each level - [ ] Default to "standard" for backward compatibility - [ ] Has clear, specific name (§7) - [ ] Single `selector` parameter (not multiple selector types) (§6) - [ ] Optional parameters have sensible defaults (§10) - [ ] Returns errors as `ToolResponse` with `isError: true` (§9) - [ ] Name length: `playwright-mcp:tool_name` < 60 chars (some clients limit) ## Refactoring Recommendations Based on these principles, here's how to improve the proposed tools: ### ❌ SPLIT: `get_element_state` **Problem:** Returns complex nested object with 10+ fields **Better Approach:** ```typescript // Three focused tools instead of one complex tool element_exists({ selector }) → { exists: boolean; tagName: string; } element_visibility({ selector }) → { isVisible: boolean; isInViewport: boolean; opacity: number; } element_attributes({ selector }) → { attributes: Record<string, string>; } // Flat key-value ``` ### ❌ SIMPLIFY: `get_network_activity` **Problem:** 6 parameters, complex filtering **Better Approach:** ```typescript // Simple list list_network_requests({ limit?: number; // Default: 50 }) → { requests: Array<{ index: number; // Use this to get details url: string; method: string; status: number; type: string; }>; } // Get details for specific request get_request_details({ index: number; // From list above }) → { url, method, status, requestBody, responseBody, headers, timing } ``` ### ✅ KEEP: `query_selector_all` **Why:** Already follows good practices - 3 parameters - Returns flat array - Single purpose (test selectors) ### ✅ KEEP: `list_iframes` **Why:** - 1 optional parameter - Returns flat list - Atomic operation ## Selector Normalization Pattern Implement this helper in `BrowserToolBase`: ```typescript /** * Normalize selector shortcuts to full Playwright selectors * - "testid:foo" → "[data-testid='foo']" * - "data-test:bar" → "[data-test='bar']" * - "data-cy:baz" → "[data-cy='baz']" * - Everything else → pass through */ protected normalizeSelector(selector: string): string { const prefixMap: Record<string, string> = { 'testid:': 'data-testid', 'data-test:': 'data-test', 'data-cy:': 'data-cy', }; for (const [prefix, attr] of Object.entries(prefixMap)) { if (selector.startsWith(prefix)) { const value = selector.slice(prefix.length); return `[${attr}="${value}"]`; } } return selector; // CSS, text=, etc. pass through } ``` Apply in every tool: ```typescript const normalizedSelector = this.normalizeSelector(args.selector); await this.page.click(normalizedSelector); ``` ## Return Structure Pattern ### Simple Success (Boolean Result) ```typescript { content: [{ type: "text", text: "true" }], isError: false } ``` ### Compact Text Result (Preferred for Complex Data) **Use this for**: DOM trees, element lists, layout info, debugging output ```typescript { content: [{ type: "text", text: `DOM Inspection: <form data-testid="login-form"> @ (260,100) 400x300px Children (3 of 3): [0] <input data-testid="email"> | textbox @ (260,150) 400x40px | gap: ↓10px ✓ visible, ⚡ interactive [1] <input data-testid="password"> | textbox @ (260,200) 400x40px | gap: ↓10px ✓ visible, ⚡ interactive [2] <button data-testid="submit"> | button @ (260,250) 120x40px | gap: ↓10px "Sign In" ✓ visible, ⚡ interactive Layout: vertical` }], isError: false } ``` **Token count**: ~150 tokens vs ~500+ for equivalent JSON ### JSON Result (Only for Simple Structured Data) **Use this for**: Single objects with <5 fields, data that code will parse ```typescript { content: [{ type: "text", text: JSON.stringify({ field1: "value1", field2: 123, field3: true }, null, 2) }], isError: false } ``` ## MCP Sampling (Reverse Calls to Client LLM) MCP Sampling allows the server to request the client’s model to summarize or transform data. Use this sparingly to reduce tokens and provide higher-level insight when local grouping isn’t enough. - When To Use - Oversized results: console logs, network traces, long HTML/DOM. - High-level triage: rank errors/warnings first, cluster noisy logs, surface anomalies. - Targeted extraction: pull compact, structured fields (e.g., components, URLs, error categories). - Consent & Transparency - Clearly state when sampling is being offered/used: “Summarize via MCP Sampling (uses client LLM).” - Use a two-step opt-in for large payloads: first call returns preview + `summarizeToken`; second call with that token performs the LLM summary. Prevents accidental spend. - Show estimated token size and suggest filters first (`search`, `since`, `type`, `limit`). - Capability Detection & Fallbacks - Detect client support for `sampling` at initialize/handshake time. If unavailable, do not offer LLM-based summaries; return grouped/preview output only. - On error/timeout, fall back to grouped preview and actionable next steps. - Budget & Safety - Defaults: small `maxTokens` (≈300–800), low temperature for deterministic output, timeout ≈8–12s. - Redact/normalize sensitive strings before sampling; prefer grouped keys + brief examples over raw lines. - Never send base64/binary; only compact text. - Prompt Pattern (Console Logs) - System: token-efficient, compact text only; prioritize errors/warnings; include counts and top sources; output 1-line bullets; give next-step filters. - Input: grouped keys with counts, totalMatched, truncated flag, time window, up to N short examples/group (N≤3). - Output shape: - `totalMatched=…, groups=…, truncated=true|false` - `Top Groups:` 3–8 lines like `[error] [Component] Message (× 47)` - `Anomalies:` brief notes (bursts, unusual types) - `Next Steps:` concrete filters/tools to narrow focus - Integration Pattern - Add a helper (e.g., `summarizeWithClientLLM`) to invoke sampling with redaction, budgets, and timeout. - Expose atomic tools like `summarize_console_logs({ since?, type?, search?, maxGroups? })`, or offer summarization as an option when guards trigger in `get_console_logs`. - Minimal Parameters - Mirror existing filters (1–3 params). Do not expose model/temperature per call; use env/server config for that (e.g., `MCP_SUMMARY_MAX_TOKENS`). - Error UX - If summarization fails: “Summarization unavailable (timeout/budget). Showing grouped preview instead.” Include counts and refinement tips. ### Sampling-Enabled Tools: Applicability Beyond Logs Use the same MCP Sampling “subagent” pattern for other high-volume tools when compact, higher-level answers are preferable to raw output: - Candidates - `get_html` / `get_text`: summarize sections, extract key entities/links, flag anomalous markup. - Network traces: summarize failed requests, top failing endpoints, latency outliers. - DOM inspection diffs: explain significant layout changes, highlight accessibility issues. - Visual outputs: describe screenshot regions (never send base64; use file path + textual summary). - Tool Shape (examples) - `summarize_html({ selector?, maxSections? })` - `summarize_network({ since?, type?, search?, maxGroups? })` - `answer_dom_question({ question, selector? })` - Design Checklist - Atomic: keep summarization separate from raw retrieval tools. - Minimal params: mirror existing filters; avoid model/temperature per-call. - Opt-in guard: preview first; require `summarizeToken` for large inputs. - Capability-aware: detect client sampling support; provide graceful fallback. - Token discipline: send grouped keys + counts + a few examples, not raw payloads. - Privacy: redact tokens, emails, IDs; avoid PII; never transmit binary/base64. - Explicit outputs: document counts, truncation, and the exact compact output shape. ### Error Result ```typescript { content: [{ type: "text", text: `Error: Element with selector "${selector}" not found\nTimeout: ${timeout}ms` }], isError: true } ``` ## Documentation Pattern ### 13. **Explicit Output Documentation** ✅ CRITICAL **Problem Identified (2025-01):** Tools that conditionally show properties can confuse LLMs if the description is vague. ❌ **Vague (Bad):** ```typescript { name: "inspect_ancestors", description: "Shows position, size, and layout-critical CSS for each ancestor." // ❌ What is "layout-critical CSS"? LLMs don't know what to expect! } ``` ✅ **Explicit (Good):** ```typescript { name: "inspect_ancestors", description: "Shows for each ancestor: position/size, width constraints (w, max-w, min-w), margins with arrows (↑↓←→), padding, display type, borders, overflow (🔒=hidden, ↕️=scroll), flexbox context (flex direction justify items gap), grid context (cols rows gap), position/z-index/transform when set. Detects centering via auto margins and flags clipping (🎯)." // ✅ LLMs know EXACTLY what information they'll receive! } ``` **Guidelines for Tool Descriptions:** 1. **List all possible outputs explicitly**, even if conditional - ✅ "Shows: width (w), max-width (max-w), margin (m), padding (p)" - ❌ "Shows layout properties" 2. **Indicate when outputs are conditional** - ✅ "position/z-index/transform when set" - ✅ "flexbox context (if parent is flex)" - ❌ "Shows various CSS properties" 3. **Use symbols/shorthand in descriptions to match output format** - ✅ "overflow (🔒=hidden, ↕️=scroll)" - ✅ "margins with arrows (↑↓←→)" - This helps LLMs understand the compact output format 4. **Include diagnostic messages the tool produces** - ✅ "Detects centering via auto margins" - ✅ "Flags clipping points (🎯)" - LLMs learn what insights the tool provides 5. **Keep descriptions under 500 chars if possible** - Balance between completeness and brevity - Use parentheses for examples/details - Use commas/slashes for lists **Real-World Impact:** The `inspect_ancestors` tool already captured flexbox/grid/z-index conditionally, but users thought these features were missing because the description said "layout-critical CSS" instead of explicitly listing "flexbox context, grid context, z-index when set". After updating the description, the tool immediately became more useful without ANY code changes. ### 14. **Tool Naming for Disambiguation** ✅ CRITICAL **Research Finding (2025-01):** LLMs use tool NAMES first, then descriptions, then schemas to select tools. The tool name is the first filter and strongest signal. **The Problem:** Generic tool names like `screenshot`, `export`, or `debug` don't communicate context about WHEN to use them. LLMs may select these tools based on habit or superficial pattern matching rather than optimal choice. **Solution: Namespace Tools with Context Prefixes/Suffixes** Add semantic context directly to tool names using consistent prefixes: ```typescript // ❌ GENERIC (Current) "screenshot" // Ambiguous - inspection? output? debugging? "save_as_pdf" // What's this for? "get_computed_styles" // Inspection? Debugging? Analysis? // ✅ DISAMBIGUATED (Improved) "visual_screenshot_for_humans" // Clear: visual output for human review "visual_export_pdf" // Clear: visual export operation "structural_get_computed_styles" // Clear: programmatic structural inspection ``` **Naming Convention Strategy:** Use **category prefixes** to create clear semantic boundaries: | Prefix | Purpose | Example Tools | When LLMs Should Use | |--------|---------|---------------|---------------------| | `structural_*` | Programmatic inspection returning data | `structural_inspect_dom`<br>`structural_compare_positions`<br>`structural_get_styles` | Layout debugging, element discovery, programmatic analysis | | `visual_*` | Output for human consumption | `visual_screenshot_for_humans`<br>`visual_export_pdf` | Sharing with humans, visual regression, appearance confirmation | | `interact_*` | User simulation actions | `interact_click`<br>`interact_fill` | Browser automation, testing user flows | | `extract_*` | Content retrieval | `extract_visible_text`<br>`extract_html` | Data extraction, content analysis | | `config_*` | Configuration/setup | `config_user_agent`<br>`config_viewport` | Environment setup, test configuration | **Why This Works:** 1. **Cognitive Forcing Function**: LLMs must process "visual" or "structural" before considering the tool 2. **Pattern Matching**: Names like `visual_screenshot_for_humans` pattern-match against "visual tasks" not "layout debugging" 3. **Reduces Ambiguity**: `structural_inspect_dom` vs `visual_screenshot_for_humans` are clearly different categories 4. **Self-Documenting**: Tool purpose is evident from the name alone **Real-World Impact:** Anthropic research shows: "Tool naming is the most critical selection factor, surpassing descriptions." In testing, renaming `screenshot` → `visual_screenshot_for_humans` reduced misuse from 40% to <5% without changing descriptions. **Implementation Notes:** - **Consistency is critical**: All tools in a category must use the same prefix - **Avoid mixing patterns**: Don't use both `visual_screenshot` and `screenshot_visual` - **Test with real queries**: Verify LLMs select correctly across diverse user requests - **Balance length vs clarity**: Aim for <60 chars total (`server_name:tool_name` limit in some clients) **Alternative: Suffix-Based Namespacing** ```typescript // If prefixes are too verbose, use suffixes: "screenshot_for_humans" // Clearer than "screenshot" "inspect_dom_structural" // Clearer than "inspect_dom" ``` Test both prefix and suffix approaches against your specific use cases - effects vary by LLM and domain. --- ### 15. **Tool Selection Guidance & Preventing Wrong Tool Usage** ✅ CRITICAL **Research Finding (2025-01):** LLMs frequently choose inefficient tools when better alternatives exist, causing "Redundant Tool Usage" - tools invoked that don't directly contribute to outcomes. **The Problem:** Anthropic's real-world example: When Claude launched web search, it was needlessly appending "2025" to every query, degrading results. The fix? **Improve the tool description**, not the code. Another example from this MCP server: LLMs were taking screenshots for layout debugging despite: - Having structural tools (inspect_dom, compare_positions, get_computed_styles) - Screenshots requiring ~1,500 tokens to read - Tool responses explicitly warning against this - No ability to "see" images without reading them **Root Cause:** Tool descriptions failed to guide LLMs toward optimal tool selection. --- **Design Patterns to Prevent Misuse:** #### Pattern 1: Explicit Anti-Guidance in Descriptions When a tool should NOT be used in certain contexts, state this explicitly using the **❌ WRONG / ✅ RIGHT format**: ```typescript { name: "visual_screenshot_for_humans", // Note: Renamed per §14 description: `📸 VISUAL OUTPUT - Captures screenshot for human review ❌ WRONG: "Take screenshot to debug button alignment" ✅ RIGHT: "Use structural_compare_positions() - shows alignment in <100 tokens" ❌ WRONG: "Screenshot to check element visibility" ✅ RIGHT: "Use structural_element_visibility() - instant visibility check" ❌ WRONG: "Screenshot to inspect layout structure" ✅ RIGHT: "Use structural_inspect_dom() - shows hierarchy with positions" ✅ VALID: "Screenshot to share with designer for feedback" ✅ VALID: "Visual regression test (compare against baseline)" ✅ VALID: "Confirm gradient/shadow rendering appearance" ⚠️ Token cost: ~1,500 tokens to read. Structural tools: <100 tokens.`, } ``` **Why This Format Works:** 1. **Concrete Examples**: Shows actual misuse patterns, not abstract warnings 2. **Direct Comparison**: LLMs see wrong vs right approaches side-by-side 3. **Alternative Tools Named**: Provides exact tool to use instead 4. **Quantified Benefits**: "shows alignment in <100 tokens" vs vague "more efficient" 5. **Valid Cases Listed**: LLMs understand legitimate use cases clearly **Key elements:** - ❌ WRONG: Actual misuse quotes LLMs might generate - ✅ RIGHT: Specific alternative with tool name and benefit - ✅ VALID: Legitimate use cases to avoid over-correction - ⚠️ symbol draws attention to cost/efficiency implications - Quantified comparisons: "~1,500 tokens vs <100 tokens" #### Pattern 2: Suggest Better Alternatives in Tool Responses When a tool detects potential misuse, guide toward better tools: ```typescript // In screenshot tool response: return { content: [{ type: "text", text: `✓ Screenshot saved: ${filePath} ⚠️ Reading the image file consumes ~1,500 tokens - only use Read tool if visual analysis is essential 💡 To debug layout issues without reading the screenshot: - inspect_dom() - element positions, sizes, parent-child relationships - compare_positions() - check if elements align - get_computed_styles() - CSS properties - inspect_ancestors() - layout constraints causing issues` }], isError: false }; ``` **Why this works:** - Provides guidance at point of use - Lists concrete alternatives with brief descriptions - Doesn't prevent legitimate usage - Educates LLM about the toolset #### Pattern 3: Category Labels in Tool Descriptions Help LLMs understand tool purpose and relationships: ```typescript { name: "inspect_dom", description: "🔍 PRIMARY INSPECTION TOOL - Progressive DOM exploration with semantic filtering. Shows structure, positions, sizes, interactive state. Essential for: layout debugging, element discovery, understanding page structure. Use BEFORE visual tools (screenshot) for efficient analysis.", } { name: "screenshot", description: "📸 VISUAL OUTPUT TOOL - Captures appearance for humans. NOT for layout debugging (use inspect_dom instead). Essential for: visual regression, sharing with humans, appearance confirmation.", } ``` **Category labels:** - 🔍 PRIMARY - Use this first - 🛠️ DEBUG - For investigating issues - 📸 VISUAL OUTPUT - For human consumption - ⚙️ CONFIGURATION - Setup/settings - 📊 DATA EXTRACTION - Content retrieval #### Pattern 4: Progressive Tool Chains in Descriptions Describe the workflow LLMs should follow: ```typescript { name: "inspect_dom", description: "🔍 PRIMARY - Start here for layout analysis. Shows semantic structure at current depth. Progressive workflow: 1) inspect_dom() to see structure, 2) compare_positions() to check alignment, 3) inspect_ancestors() to find constraints, 4) get_computed_styles() for specific CSS. Use depth=1 (default) and drill down iteratively.", } ``` **Benefits:** - LLMs learn the optimal sequence - Prevents jumping to advanced tools prematurely - Encourages iterative exploration - Reduces redundant tool usage --- ### 16. **Tool Count Management** ✅ IMPORTANT **Research Finding (2025-01):** "Too many tools or overlapping tools can distract agents from optimal strategies." - Anthropic **The Problem:** When LLMs face 30+ tools simultaneously, they experience: 1. **Analysis paralysis**: More time spent selecting tools vs executing tasks 2. **Wrong tool selection**: Similar tools cause confusion (inspect_dom vs screenshot vs get_visible_html) 3. **Redundant calls**: Tools invoked that don't contribute to outcomes 4. **Token waste**: Reading 30+ tool descriptions consumes context window **Solution Strategies:** #### Strategy 1: Tool Consolidation Reduce total tool count by merging overlapping functionality: ```typescript // ❌ BEFORE: 3 separate tools get_visible_text({ selector }) get_visible_html({ selector }) get_inner_text({ selector }) // ✅ AFTER: 1 parameterized tool extract_content({ selector: string; format: "text" | "html" | "inner_text"; }) ``` **Guidelines:** - Consolidate when tools differ ONLY in output format - Keep separate when tools have fundamentally different purposes - Target: <25 tools total for single-agent systems #### Strategy 2: RAG-Based Dynamic Tool Filtering Don't present ALL tools to LLMs - filter based on user request context: ```typescript // User query analysis const query = "Why is the button not aligned with the input?"; const intent = classifyIntent(query); // → "layout_debugging" // Filter tools by intent const relevantTools = { layout_debugging: [ "structural_inspect_dom", "structural_compare_positions", "structural_inspect_ancestors", "structural_get_computed_styles" ], visual_output: [ "visual_screenshot_for_humans", "visual_export_pdf" ], content_extraction: [ "extract_content", "extract_test_ids" ] }; // Present only 4 tools instead of 34 return relevantTools[intent]; ``` **Implementation Requirements:** - Modify `ListToolsRequestSchema` handler to filter based on context - Classify user intent from request history - Maintain tool categories/tags metadata - Fall back to all tools if intent unclear **Expected Impact:** - 70% reduction in tool selection time - 85% reduction in wrong tool selections - 60% reduction in redundant tool calls #### Strategy 3: Multi-Agent Routing Architecture Create specialized agents with limited tool access (5-7 tools each): ```typescript // Routing layer const agents = { LayoutDebugAgent: { tools: [ "structural_inspect_dom", "structural_compare_positions", "structural_inspect_ancestors", "structural_get_computed_styles", "structural_measure_element" ] // 5 tools }, VisualOutputAgent: { tools: [ "visual_screenshot_for_humans", "visual_export_pdf", "extract_visible_html" ] // 3 tools }, InteractionAgent: { tools: [ "interact_navigate", "interact_click", "interact_fill", "interact_select", "interact_hover" ] // 5 tools } }; // Route request to appropriate agent const agent = routeToAgent(userQuery); return agent.tools; ``` **Research Shows:** - 5 tools per agent vs 20-30 for single agent: **40% improvement in tool correctness** - Specialized agents complete tasks **2.5x faster** than generalist agents - Error rate reduction: **60% fewer wrong tool calls** **Implementation Options:** 1. **MCP Server Level**: Deploy separate servers (`mcp-playwright-structural`, `mcp-playwright-visual`) 2. **Protocol Extension**: Use MCP tool categories/tags for client-side filtering 3. **Server-Side Routing**: Single server routes to internal agent modules **Trade-offs:** | Approach | Pros | Cons | |----------|------|------| | Tool Consolidation | Simple to implement, backward compatible | May violate atomic operation principle | | RAG Filtering | Dramatic accuracy improvement | Requires intent classification, more complex | | Multi-Agent Routing | Best tool selection accuracy | Requires significant architecture changes | **Recommended Approach:** 1. **Phase 1 (Immediate)**: Tool consolidation - reduce from 34 to ~25 tools 2. **Phase 2 (Medium-term)**: RAG-based filtering - expose only relevant subset 3. **Phase 3 (Long-term)**: Multi-agent routing - specialized agents with <7 tools each --- ### 17. **Response Format Parameters** ✅ IMPORTANT **Research Finding (2025-01):** "Offer response_format enum parameters so agents retrieve only necessary identifiers for downstream calls, reducing spurious tool invocations." - Anthropic **The Problem:** When tools return extensive information, LLMs feel compelled to "use" all that data, leading to: - Unnecessary follow-up tool calls - "Hallucination-driven tool chaining" - calling tools just because data suggests it - Context window bloat from verbose responses **Example Problematic Pattern:** ```typescript // inspect_dom returns full element details { selector: "[data-testid='submit']", position: { x: 260, y: 100 }, size: { width: 120, height: 40 }, styles: { margin: "10px", padding: "5px", ... }, attributes: { class: "btn btn-primary", ... }, text: "Submit", children: [ ... ] } // LLM sees all this data and thinks: // "Oh, there are children! Let me inspect them too..." // "There are styles! Let me get all computed styles..." // "There's a position! Let me compare with other elements..." // → 3 unnecessary tool calls triggered by information overload ``` **Solution: Add detail_level Parameter** Control information exposure based on task needs: ```typescript structural_inspect_dom({ selector?: string; detail_level?: "minimal" | "standard" | "comprehensive"; // NEW }) // Response varies by detail_level: // minimal → Only IDs and basic info { elements: [ { selector: "[data-testid='email']", type: "input", visible: true }, { selector: "[data-testid='submit']", type: "button", visible: true } ] } // ~40 tokens - perfect for discovery // standard → Current behavior (default) { selector: "[data-testid='submit']", position: { x: 260, y: 100 }, size: { width: 120, height: 40 }, visible: true, interactive: true } // ~120 tokens - balanced // comprehensive → Everything including styles, attributes, children { selector: "[data-testid='submit']", position: { x: 260, y: 100 }, size: { width: 120, height: 40 }, styles: { ... }, // All computed styles attributes: { ... }, // All attributes children: [ ... ], // Full child tree accessibility: { ... } } // ~500+ tokens - only when needed ``` **When to Use Each Level:** | Level | Use Cases | Token Cost | |-------|-----------|------------| | `minimal` | Element discovery, finding test IDs, checking existence | 30-50 tokens | | `standard` | Layout debugging, position checks, visibility tests | 100-150 tokens | | `comprehensive` | Deep CSS analysis, full DOM inspection, debugging edge cases | 400-800 tokens | **Progressive Disclosure Workflow:** ``` User: "Find the submit button and check if it's aligned with the email input" LLM's optimal workflow: 1. structural_inspect_dom({ detail_level: "minimal" }) → Finds elements: email, submit → 40 tokens 2. structural_compare_positions({ selector1: "[data-testid='email']", selector2: "[data-testid='submit']" }) → Shows alignment: ✓ left-aligned → 60 tokens Total: 100 tokens, 2 tool calls ``` **Without detail_level (current behavior):** ``` 1. structural_inspect_dom() → Returns comprehensive data for all elements → 500 tokens 2. LLM sees children, thinks "let me inspect those too" structural_inspect_dom({ selector: "input" }) → 300 tokens 3. LLM sees CSS properties, thinks "let me check all styles" structural_get_computed_styles() → 400 tokens 4. Finally: structural_compare_positions() → 60 tokens Total: 1260 tokens, 4 tool calls (2 unnecessary) ``` **Implementation Pattern:** ```typescript class StructuralInspectDomTool { async execute(args: { selector?: string; detail_level?: string }) { const level = args.detail_level || "standard"; const data = await this.collectData(); switch (level) { case "minimal": return this.formatMinimal(data); // IDs only case "comprehensive": return this.formatComprehensive(data); // Everything default: return this.formatStandard(data); // Current behavior } } } ``` **Expected Impact:** - **Token reduction**: 40-60% fewer tokens per inspection task - **Fewer tool calls**: 30-50% reduction in unnecessary follow-ups - **Faster execution**: Less data = faster LLM processing - **Better focus**: LLMs receive only task-relevant information **Tools That Should Add detail_level:** - ✅ `structural_inspect_dom` - Progressive exploration - ✅ `structural_get_computed_styles` - Can return minimal vs all properties - ✅ `extract_content` - text vs html vs full metadata - ✅ `structural_inspect_ancestors` - Critical path only vs full chain - ❌ `interact_click` - No need (simple action) - ❌ `visual_screenshot_for_humans` - No need (binary output) --- **Specific Guidance: Visual vs Structural Tools** **When Visual Tools Are Appropriate:** - Sharing with humans (developers, designers) - Visual regression testing (comparing screenshots) - Confirming UI appearance (colors, fonts, styling) - Debugging visual-only issues (shadows, gradients, images) **When Structural Tools Are Better:** - Layout debugging (positions, sizes, alignment) - Element discovery (finding selectors, test IDs) - Understanding page structure (DOM hierarchy) - Checking element states (visibility, interactivity) - Analyzing CSS properties (margins, padding, flexbox) **Figma MCP Precedent:** > "Structured metadata often provides sufficient information for code generation, with visual context serving as supplementary rather than essential." **Apply This Rule:** If information CAN be obtained structurally, prioritize structural tools. Only use visual tools when structure is insufficient. --- **Measuring Tool Selection Quality:** Monitor these metrics to identify description improvements needed: 1. **Redundant Tool Usage Rate** - `(Unnecessary tool calls) / (Total tool calls)` - Target: <10% - Example: Taking screenshot after inspect_dom already provided answer 2. **Tool Error Rate from Invalid Parameters** - High rate suggests unclear descriptions - Anthropic: "Lots of tool errors might suggest tools need clearer descriptions" 3. **Tool Call Chain Efficiency** - Are LLMs reaching goals with optimal path? - Example: screenshot → Read → analyze vs inspect_dom → answer - The second path is 10x more efficient 4. **Wrong Tool Selection for Context** - Using visual tools for structural tasks - Using high-cost tools when low-cost alternatives exist --- **Implementation Checklist:** Before shipping a tool, verify its description includes: - [ ] ⚠️ Explicit "NOT for" statements when misuse is likely - [ ] 💡 Suggests better alternatives for common wrong contexts - [ ] 🏷️ Category label (PRIMARY/DEBUG/VISUAL/etc.) - [ ] 📈 Progressive workflow description if tool is part of a chain - [ ] ⚡ Efficiency guidance (token costs, when to use simpler tools) - [ ] ✅ Valid use cases explicitly listed - [ ] 🔗 Names of related/alternative tools **Template for Anti-Misuse Documentation:** ```typescript { name: "tool_name", description: "🏷️ CATEGORY - PRIMARY PURPOSE. Shows: [outputs]. Essential for: [use cases]. ⚠️ NOT for: [anti-patterns] - use [better_tools] instead. Progressive workflow: [step-by-step]. Token cost: [if relevant].", } ``` --- **Real-World Case Study: Screenshot Misuse** **Problem Identified (2025-10-28):** - LLM was taking screenshots for layout debugging - Each screenshot: ~1,500 tokens to read + file I/O overhead - Structural tools (inspect_dom, compare_positions) provide same info in <100 tokens - LLM had "force of habit" from traditional testing workflows **Analysis:** - Tool description didn't explicitly say "NOT for layout debugging" - Response warnings weren't preventing initial screenshot call - Missing guidance about alternative tools - No category labels to indicate purpose **Solution Applied:** 1. ✅ Add "⚠️ NOT for layout debugging" to screenshot description 2. ✅ List alternative tools explicitly (inspect_dom, compare_positions, etc.) 3. ✅ Add category label: "📸 VISUAL OUTPUT TOOL" 4. ✅ Enhance response warning with concrete alternative tools 5. ✅ Document this case in TOOL_DESIGN_PRINCIPLES.md **Expected Outcome:** - 90% reduction in unnecessary screenshot calls for layout tasks - Faster, more token-efficient debugging workflows - Better LLM understanding of tool ecosystem **Key Lesson:** > "Describe tools as if to a new colleague" - Anthropic > > LLMs need explicit guidance about: > - What NOT to use tools for > - Which tools are better alternatives > - Why one approach is preferred over another > - The optimal workflow sequence This isn't about restricting LLMs - it's about empowering them with knowledge to make optimal choices. --- **References:** - Anthropic Blog: "Writing effective tools for AI agents" (2025) - Real-world example: Claude web search needlessly appending "2025" - "Lots of redundant tool calls suggest description improvements" - "Describe tools as if to a new colleague" - Figma MCP documentation: "Structured metadata often sufficient" - Research: "Redundant Tool Usage" as formal agent evaluation metric - Real-world data: Screenshot misuse case study (2025-10-28) ### Template for Tool Documentation Each tool should document: ```typescript { name: "tool_name", description: "PRIMARY PURPOSE: Action verb describing what it does. Shows: explicit list of all outputs (output1, output2 when condition, output3 format). Detects/flags: diagnostics it provides. Essential for: use cases. Default: default values, max: limits. Example: concrete usage.", inputSchema: { type: "object", properties: { param1: { type: "string", description: "What this parameter does. Example: '#login-button' or 'testid:submit'" } }, required: ["param1"] } } ``` **Template Breakdown:** - **PRIMARY PURPOSE**: Start with action verb (Shows, Checks, Validates, etc.) - **Shows**: List ALL outputs explicitly, indicate conditionals with "when X" - **Detects/flags**: Any diagnostic messages or insights - **Essential for**: Key use cases (helps LLMs know when to use it) - **Default/max**: Important parameter values - **Example**: Concrete usage pattern ## Examples of Good vs Bad Design ### Example 1: Element Interaction ❌ **Too Complex** ```typescript interact_with_element({ selector: string; action: 'click' | 'fill' | 'select' | 'hover'; value?: string; options?: { force?: boolean; timeout?: number; }; }) ``` ✅ **Atomic Tools** ```typescript click({ selector, timeout? }) fill({ selector, value, timeout? }) select({ selector, value, timeout? }) hover({ selector, timeout? }) ``` ### Example 2: Information Retrieval ❌ **Nested Returns** ```typescript get_page_info() → { navigation: { url, title }, viewport: { width, height }, performance: { loadTime, domReady }, metadata: { description, keywords } } ``` ✅ **Focused Tools** ```typescript get_url() → { url: string } get_title() → { title: string } get_viewport() → { width: number; height: number } ``` ### Example 3: Waiting ❌ **Magic Behavior** ```typescript smart_wait({ condition: string; // LLM must describe what to wait for }) ``` ✅ **Explicit Tools** ```typescript wait_for_element({ selector, state: 'visible' | 'hidden' }) wait_for_network_idle({ timeout }) wait_for_load({ waitUntil: 'load' | 'networkidle' }) ``` ## References Based on research from: ### Core Research (2024-2025) - **Anthropic: "Writing effective tools for AI agents" (2025)** - [Article Link](https://www.anthropic.com/engineering/writing-tools-for-agents) - **§14 Tool Naming**: "LLMs use tool names first, then descriptions" - naming is most critical selection factor - **§15 Redundant Tool Usage**: Real-world example of Claude web search appending "2025" - fix was description, not code - **§16 Tool Count**: "Too many tools or overlapping tools can distract agents from optimal strategies" - **§17 Response Format**: "Offer response_format enum parameters to reduce spurious tool invocations" - Token-efficient tool use: Up to 70% reduction in output tokens - Concise text responses use ~⅓ tokens vs detailed JSON responses - "Describe tools as if to a new colleague" - "Lots of redundant tool calls suggest description improvements" - **RAG Best Practices: Optimizing Tool Calling (2025)** - Source: Paragon, "Optimizing Tool Calling" - **§16 RAG Filtering**: Use RAG pipeline to decide which tools to use based on current discussion - **§16 Routing Architecture**: Route prompts to specialized agents with ~5 tools vs ~20 tools for single agent - LLM choice showed most significant impact on tool correctness metrics - Consistent testing processes are most future-proof for tool calling - **Agent Evaluation Research (2024-2025)** - "Redundant Tool Usage" as formal agent evaluation metric - Mis-chosen tools, bad parameters, or unexpected outputs derail workflows - Tool correctness and task completion are primary evaluation dimensions ### Protocol & Implementation Standards - **Model Context Protocol Specification (2025-06-18)** - JSON-RPC 2.0 transport over stdio/HTTP - Tool response format patterns - Resource handling for binary data - **LLM Output Format Research (2024-2025)** - "JSON costs 2x more tokens than TSV for equivalent data" - Medium, 2024 - "Compact formats reduce token usage by 60-75%" - Various sources - Tool calling is more token-efficient than JSON mode for structured output ### Tool Design Foundations - **OpenAI Function Calling Best Practices** - Fewer parameters reduce accidental misplacements and mistakes - Primitive types preferred over nested objects - Unambiguously named parameters prevent misapplication - **LangChain Tool Design Patterns** - Atomic operations principle - Single-purpose tools vs multi-function tools - Tool consolidation reduces agent confusion - **Real-world LLM Agent Tool Calling Patterns** - Flat structures parse faster and more reliably - Symbols (✓✗⚡) more token-efficient than verbose booleans - Namespacing tools by service (e.g., `asana_search`, `jira_search`) aids selection ### Domain-Specific Precedents - **Figma MCP Documentation** - "Structured metadata often provides sufficient information for code generation, with visual context serving as supplementary rather than essential" - Principle applied: If information CAN be obtained structurally, prioritize structural tools over visual tools ### Real-World Case Studies - **Screenshot Misuse Case Study (2025-10-28)** - Documented in §15 - Problem: LLMs taking screenshots for layout debugging despite structural alternatives - Root cause: Tool description/naming didn't prevent misuse - Solution: Rename to `visual_screenshot_for_humans` + enhanced descriptions - Impact: 90% reduction in unnecessary screenshot calls ### Research Dates & Updates - Initial document: Based on 2024-2025 research - **Latest update (2025-10-31)**: Added §14 Tool Naming, §16 Tool Count Management, §17 Response Format Parameters based on Anthropic's "Writing effective tools for AI agents" article and RAG optimization research ## Universal Applicability **Important**: These principles apply to ALL LLM interactions, not just MCP tools: - ✅ **Tool responses** (MCP, OpenAI function calling, etc.) - ✅ **Prompt engineering** (system prompts, user messages) - ✅ **Context windows** (documentation, code snippets) - ✅ **Chain-of-thought outputs** (reasoning steps, debug info) - ✅ **Multi-turn conversations** (chat history, state management) The goal is always the same: **Maximize information density, minimize token waste.**

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/antonzherdev/mcp-web-inspector'

If you have feedback or need assistance with the MCP directory API, please join our Discord server