IA-QA — 130+ QA & Dev Tools for AI Agents
Server Details
130+ QA & dev tools for AI agents: prompt injection, RAG testing, VLM eval, guardrails. Free.
- Status
- Healthy
- Last Tested
- Transport
- Streamable HTTP
- URL
Glama MCP Gateway
Connect through Glama MCP Gateway for full control over tool access and complete visibility into every call.
Full call logging
Every tool call is logged with complete inputs and outputs, so you can debug issues and audit what your agents are doing.
Tool access control
Enable or disable individual tools per connector, so you decide what your agents can and cannot do.
Managed credentials
Glama handles OAuth flows, token storage, and automatic rotation, so credentials never expire on your clients.
Usage analytics
See which tools your agents call, how often, and when, so you can understand usage patterns and catch anomalies.
Tool Definition Quality
Average 4.2/5 across 139 of 139 tools scored. Lowest: 3.1/5.
With 139 tools covering overlapping domains (e.g., secret scanning with detect_secrets and secret_scan, multiple similarity functions, several CORS checkers), many tools have unclear boundaries. Descriptions help but the sheer volume causes confusion.
Tool names use a mix of conventions (mostly lowercase with underscores but some compound phrases). No strict verb_noun pattern is followed, and some names are vague (e.g., 'identify_caller'). Consistent within their categories but not across the set.
139 tools is justified by the server's promise of a comprehensive QA & dev toolkit. While large, each tool serves a niche purpose. A few tools could be consolidated, but the count fits the scope.
The tool surface covers a wide range: text, encoding, security, web, LLM evaluation, RAG, and more. Some minor gaps exist (e.g., no direct image processing), but the set is comprehensive for its stated QA and dev purpose.
Available Tools
148 toolsab_test_reportARead-onlyIdempotentInspect
Generate an A/B test report comparing two prompts or model configurations. Accepts arrays of scores and returns statistical comparison: mean, median, std deviation, winner, and improvement percentage.
| Name | Required | Description | Default |
|---|---|---|---|
| variant_a | Yes | First variant configuration with name and score array | |
| variant_b | Yes | Second variant configuration with name and score array |
Output Schema
| Name | Required | Description |
|---|---|---|
| max | No | |
| min | No | |
| mean | No | |
| count | No | |
| median | No | |
| winner | No | |
| std_dev | No | |
| variant_a | No | |
| variant_b | No | |
| recommendation | No | |
| improvement_percent | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnlyHint=true and destructiveHint=false. Description adds output details (statistical metrics) but no additional behavioral traits. No contradiction. With annotations, bar is lower and description is adequate.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, front-loaded with main action, no filler. Every word adds value.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With output schema present, description needn't detail return values. Covers input (arrays of scores) and output (statistics) adequately for a statistical tool. No gaps.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with property descriptions. Description reiterates that scores are arrays and mentions statistical output, but does not add meaning beyond schema. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description uses specific verb 'Generate' and resource 'A/B test report', clearly states it compares two prompts/model configurations, and lists outputs. Distinguishes from siblings like 'compare_models' by focusing on statistical report with specific metrics.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Description implies use case (when you need A/B test comparison), but provides no explicit guidelines on when to use or avoid, no mention of prerequisites or alternatives. Minimum viable.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
analyze_diff_bugsARead-onlyInspect
Detect potential bugs and code smells from a git diff or two code versions. Returns a list of issues with severity levels and test suggestions.
| Name | Required | Description | Default |
|---|---|---|---|
| context | No | Optional PR title or feature context for better analysis | |
| version1 | No | Original code (before changes). If omitted, only the new version is analysed. | |
| version2 | Yes | New/modified code (after changes) |
Output Schema
| Name | Required | Description |
|---|---|---|
| bugs | No | |
| summary | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already provide readOnlyHint=true, destructiveHint=false, so the description does not need to cover safety. It adds that the tool returns issues with severity levels and test suggestions, but does not disclose additional behavioral traits like error handling or limits, providing moderate transparency.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single, front-loaded sentence that efficiently conveys the tool's purpose and output without any extraneous text, earning a top score.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's moderate complexity with 3 parameters and existing output schema, the description covers the core functionality and return value. It lacks details on supported languages or input formats, but is sufficiently complete for an agent with good schema and annotations.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so the schema already documents all parameters. The description adds context that the tool works with a git diff or two versions, but does not enhance parameter semantics beyond the schema, meeting the baseline.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Detect potential bugs and code smells from a git diff or two code versions', providing a specific verb and resource. It distinguishes itself from sibling tools like secret_scan or bias_detect by focusing on code diffs and returning issues with severity and test suggestions.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage for code review but does not explicitly state when to use this over other tools or when not to use it. No alternatives or exclusions are mentioned, relying on the agent to infer context from the purpose.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
analyze_responsesARead-onlyIdempotentInspect
Semantically analyze N already-produced model outputs for the SAME task (the MCP counterpart to the LLM Sandbox). Without a reference: computes consensus — pairwise cosine agreement, the most-representative output, and the outlier. With a reference (ground truth): also ranks every output by closeness (token cosine + ROUGE-L composite) and names the closest. Deterministic, no LLM, no key — gate-able in CI. You bring the outputs (2+). For a 2-way head-to-head with structural JSON diff use compare_responses instead.
| Name | Required | Description | Default |
|---|---|---|---|
| reference | No | Optional ground-truth answer. If set, each output is also ranked by closeness to it and the closest one is named. | |
| responses | Yes | The outputs to analyze (same task, N models/prompts/versions). Each item is a plain string or { "label": "GPT-4o", "text": "..." }. At least 2 required. |
Output Schema
| Name | Required | Description |
|---|---|---|
| count | No | |
| summary | No | |
| consensus | No | |
| reference_ranking | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations declare readOnlyHint=true, idempotentHint=true, destructiveHint=false. Description adds 'Deterministic, no LLM, no key — gate-able in CI', providing additional behavioral context consistent with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Description is front-loaded with purpose, then conditional logic, then usage guidance. Each sentence adds value, though slightly dense.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Covers main behavior: consensus, pairwise cosine agreement, most representative, outlier, and with reference ranking. Output schema exists, so return values are not required. Could mention number of outputs min but schema covers it.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, baseline 3. Description adds meaning by explaining the optional 'reference' as ground truth and the 'responses' structure, and how behavior changes with/without reference.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description states 'Semantically analyze N already-produced model outputs for the SAME task' with clear verb and resource. It distinguishes from sibling 'compare_responses' by specifying that tool is for 2-way head-to-head with structural JSON diff.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly tells when to use with/without a reference. States 'Deterministic, no LLM, no key — gate-able in CI' for context. Clearly directs to use 'compare_responses' instead for 2-way head-to-head JSON diff.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
base64_decodeARead-onlyIdempotentInspect
Decode a Base64 string back to UTF-8 text. Use for inspecting Base64-encoded API responses, JWT payload claims, config file values, or attachment data.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Base64 string to decode |
Output Schema
| Name | Required | Description |
|---|---|---|
| decoded | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations declare readOnlyHint=true, destructiveHint=false, idempotentHint=true. The description adds that it decodes to UTF-8 text, which is consistent and adds output format context beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single sentence with examples, front-loaded with the core action. Every part adds value; no wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the low complexity and the existence of an output schema, the description is complete. It covers what the tool does, when to use it, and the output format without needing extra details.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The schema has 100% coverage and describes the single parameter clearly ('Base64 string to decode'). The description does not add further meaning beyond what the schema already provides, so baseline 3 applies.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it decodes Base64 to UTF-8 text. It gives specific use cases (API responses, JWT payloads, config files, attachments) and distinguishes from sibling tool base64_encode by specifying decode vs encode.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides explicit use cases (inspecting API responses, JWT claims, config values, attachment data) but does not indicate when not to use it or mention alternatives beyond the sibling encode tool.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
base64_encodeARead-onlyIdempotentInspect
Encode a UTF-8 string to Base64. Use when you need to embed binary data, multi-line text, or special characters safely inside JSON fields, HTTP headers, or data URIs.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Text to encode |
Output Schema
| Name | Required | Description |
|---|---|---|
| encoded | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnly, idempotent, non-destructive. The description adds that the input is UTF-8 and the output is Base64, which are useful behavioral traits. No contradiction with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two concise, front-loaded sentences that cover purpose and usage without superfluous words. Every sentence adds value.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple tool with one parameter and clear annotations, the description is complete. It explains the input type, output format, and appropriate use cases, fully addressing the tool's purpose.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, and the description adds only that the input is UTF-8, which marginally improves understanding. Baseline 3 is appropriate as the schema already describes the parameter sufficiently.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the action ('Encode a UTF-8 string to Base64') and distinguishes it from the sibling 'base64_decode'. The verb 'encode' and resource 'UTF-8 string to Base64' are specific and unique.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides explicit use cases ('embed binary data, multi-line text, or special characters safely inside JSON fields, HTTP headers, or data URIs'), guiding when to use. It lacks explicit when-not-to-use or alternative mentions, but the context is clear.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
bias_detectARead-onlyIdempotentInspect
Analyse a set of LLM responses generated from the same prompt template but with different demographic variants (gender, origin, age, tone). Returns a bias score (0-100), sentiment analysis per variant, pairwise Jaccard similarity, and a human-readable verdict. No API key needed — runs entirely locally.
| Name | Required | Description | Default |
|---|---|---|---|
| responses | Yes | Array of variant responses to compare for bias |
Output Schema
| Name | Required | Description |
|---|---|---|
| ratio | No | |
| verdict | No | |
| lengthCV | No | |
| negative | No | |
| positive | No | |
| biasScore | No | |
| sentiments | No | |
| avgSimilarity | No | |
| minSimilarity | No | |
| sentimentVariance | No | |
| pairwiseSimilarities | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations indicate read-only, idempotent, non-destructive. Description adds input requirements (same prompt template, demographic variants) and output details (score, sentiment, similarity, verdict). No contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, no redundancy, front-loaded with action and context.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With annotations and schema, the description adequately covers input, processing, output, and runtime behavior (local).
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Only one parameter with 100% schema coverage. Description adds context about demographic variants (gender, origin, age, tone) beyond schema's variantId description.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Clearly states verb 'Analyse', resource 'set of LLM responses', and specifies demographic variants. Distinct from sibling tools like compare_responses or consistency_check.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Implicitly for bias detection, notes local execution (no API key). Lacks explicit when-not-to-use or alternatives.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
bm25_scoreARead-onlyIdempotentInspect
Compute BM25 relevance score between a query and one or more documents. BM25 is the industry-standard keyword-based ranking algorithm used in Elasticsearch, OpenSearch, and Weaviate hybrid search. Returns ranked results with normalized scores.
| Name | Required | Description | Default |
|---|---|---|---|
| b | No | Length normalization factor (default: 0.75) | |
| k1 | No | Term frequency saturation (default: 1.5) | |
| query | Yes | The search query | |
| top_k | No | Return top K results (default: all) | |
| documents | Yes | Array of documents to rank |
Output Schema
| Name | Required | Description |
|---|---|---|
| b | No | |
| k1 | No | |
| index | No | |
| query | No | |
| results | No | |
| bm25_score | No | |
| doc_length | No | |
| doc_preview | No | |
| avg_doc_length | No | |
| documents_count | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate read-only, idempotent, and non-destructive behavior. The description adds that it returns ranked results with normalized scores, providing useful behavioral context beyond the annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description consists of two concise sentences that efficiently convey the tool's purpose and industry relevance without unnecessary information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given full schema coverage and annotations, the description adequately explains the tool's function and return format. It covers what BM25 is and its typical use, making it complete for this compute-oriented tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with parameter descriptions. The description does not add specific parameter details but provides algorithm context that aids understanding of how query and documents are used.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool computes BM25 relevance scores between a query and documents, and distinguishes it from sibling tools by specifying it's a keyword-based algorithm used in Elasticsearch, OpenSearch, and Weaviate hybrid search.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides clear context by labeling BM25 as a keyword-based algorithm, implying it's for keyword matching rather than semantic similarity. However, it does not explicitly exclude alternative use cases or mention when not to use it.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
build_rag_promptARead-onlyIdempotentInspect
Assemble a complete RAG (Retrieval-Augmented Generation) prompt from retrieved context chunks and a user query. Handles token budgeting, citation numbering, system instruction injection, and source attribution.
| Name | Required | Description | Default |
|---|---|---|---|
| query | Yes | The user question to answer | |
| chunks | Yes | Retrieved context chunks with .text (required), .source (optional), .score (optional) | |
| language | No | Response language instruction (e.g. "French", "Spanish") | |
| cite_sources | No | Add [1], [2] citation numbers (default: true) | |
| max_context_tokens | No | Max tokens for context section (default: 2000) | |
| system_instruction | No | Custom system instruction (default: standard RAG grounding instruction) |
Output Schema
| Name | Required | Description |
|---|---|---|
| prompt | No | |
| system_prompt | No | |
| chunks_included | No | |
| included_chunks | No | |
| chunks_truncated | No | |
| total_tokens_estimate | No | |
| context_tokens_estimate | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations (readOnlyHint=true, idempotentHint=true, destructiveHint=false) already cover safety. The description adds meaningful behavioral details like token budgeting, citation numbering, and system instruction injection, enhancing transparency beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single, well-structured sentence (20 words) that front-loads the main action and efficiently lists key features. Every part earns its place with no redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given 6 parameters, 2 required, and an output schema, the description covers the main aspects: token budgeting, citation numbering, system instruction, language, and source attribution. It doesn't explain return values (covered by output schema) and assumes query/chunks are self-explanatory, so it's nearly complete.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so baseline is 3. The description briefly mentions token budgeting, citations, and system instructions, but does not add specific parameter semantics beyond what the schema already provides (e.g., 'max_context_tokens' is already described).
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description uses a specific verb ('assemble') and resource ('RAG prompt'), and lists distinct features (token budgeting, citation numbering, system instruction injection, source attribution). This clearly distinguishes it from sibling tools like 'prompt_template_fill' or 'system_prompt_builder'.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies use after retrieval for building a final prompt, but does not explicitly state when to use or avoid this tool, nor does it mention alternatives. Guidance is somewhat implicit, so it scores as adequate but not explicit.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
calculate_readabilityARead-onlyIdempotentInspect
Calculate readability scores: Flesch Reading Ease, Flesch-Kincaid Grade Level, Coleman-Liau Index, and Automated Readability Index. Useful for evaluating LLM output quality.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Text to analyze for readability |
Output Schema
| Name | Required | Description |
|---|---|---|
| level | No | |
| stats | No | |
| coleman_liau_index | No | |
| flesch_reading_ease | No | |
| flesch_kincaid_grade | No | |
| automated_readability_index | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and idempotentHint=true, so the agent knows this is a safe, deterministic operation. The description adds no additional behavioral context beyond what the annotations provide, e.g., behavior on empty text or edge cases.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences, efficient and front-loaded. It states the tool's output in the first sentence and a primary use case in the second, with no redundant or irrelevant content.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity (single parameter, read-only, output schema exists), the description adequately covers the necessary information. It could be improved by mentioning supported text characteristics (e.g., language), but it is sufficient for agent use.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema covers 100% of parameters with a description for the single 'input' field. The main description does not add any new meaning beyond the schema, so baseline score 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description explicitly states the verb 'Calculate' and the resource 'readability scores', listing four specific indexes. It clearly distinguishes this tool from siblings, as none of the 150+ sibling tools are related to readability.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions 'useful for evaluating LLM output quality' as a usage context, but does not provide explicit when-to-use or when-not-to-use guidance, nor alternatives. The usage is implied rather than explicit.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
case_convertARead-onlyIdempotentInspect
Convert a string between naming conventions: camelCase, PascalCase, snake_case, kebab-case, UPPER_SNAKE_CASE, dot.case, Title Case. Essential for code generation and refactoring.
| Name | Required | Description | Default |
|---|---|---|---|
| to | Yes | Target case: "camel", "pascal", "snake", "kebab", "upper_snake", "dot", "title" | |
| input | Yes | String to convert (e.g., "myVariableName", "my-css-class") |
Output Schema
| Name | Required | Description |
|---|---|---|
| result | No | |
| from_words | No | |
| target_case | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnly, idempotent, non-destructive. Description adds that it converts strings, reinforcing stateless transformation. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, front-loaded with core functionality and list of cases. No extraneous information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Complete for a simple conversion tool with output schema. No missing context like prerequisites or side effects.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with clear parameter descriptions. The tool description lists naming conventions already present in schema, adding no new semantic depth.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Clearly states 'Convert a string between naming conventions' and lists all supported cases. Distinguishes from siblings (no other case converter in list).
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Mentions 'Essential for code generation and refactoring' providing context. Does not explicitly state when not to use or compare with alternatives.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
check_contrast_ratioARead-onlyIdempotentInspect
Calculate WCAG 2.1 contrast ratio between two colors. Returns ratio and compliance for AA/AAA normal and large text.
| Name | Required | Description | Default |
|---|---|---|---|
| background | Yes | Background color in hex (e.g., "#ffffff") | |
| foreground | Yes | Foreground color in hex (e.g., "#333333") |
Output Schema
| Name | Required | Description |
|---|---|---|
| ratio | No | |
| AA_large | No | |
| AAA_large | No | |
| AA_normal | No | |
| AAA_normal | No | |
| background | No | |
| foreground | No | |
| ratio_text | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare read-only, idempotent, non-destructive behavior. The description adds that it returns compliance levels but lacks further behavioral context such as edge cases or input validation.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
A single sentence that is front-loaded and contains all necessary information without waste. Every word earns its place.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given strong annotations and an output schema, the description adequately covers the tool's purpose and return value. It could mention output format explicitly but is sufficient for agent selection.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so the baseline is 3. The description does not add additional semantic meaning beyond referencing the two colors. It does not specify hex format details already present in the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool calculates WCAG 2.1 contrast ratio between two colors and returns ratio with compliance for AA/AAA levels. It distinguishes itself from sibling tools like color_convert and calculate_readability.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage for contrast checking but does not explicitly state when to use this tool versus alternatives, nor does it provide when-not-to-use scenarios.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
color_convertARead-onlyIdempotentInspect
Convert a color between HEX, RGB, and HSL formats. Use when translating design tokens between CSS notations, verifying color accessibility, or normalizing color values from user input. Accepts #rrggbb, #rgb, rgb(r,g,b), or hsl(h,s%,l%).
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Color value to convert, e.g. "#ff6b6b", "rgb(255,107,107)", "hsl(0,100%,71%)" |
Output Schema
| Name | Required | Description |
|---|---|---|
| b | No | |
| g | No | |
| r | No | |
| hex | No | |
| hsl | No | |
| rgb | No | |
| input | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and destructiveHint=false, so the description's 'Convert' is consistent. Adds accepted format details but doesn't disclose additional behavior like error handling or output structure beyond schema.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Four concise sentences with immediate action verb, front-loaded with purpose, then usage examples and format details. No filler.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple conversion tool with one parameter and an output schema (exists but not shown), the description covers input formats and use cases fully.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with a basic 'Color value to convert' description. The tool description adds specific format examples (#rrggbb, rgb(r,g,b), hsl(h,s%,l%)) that clarify valid input beyond the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Convert a color between HEX, RGB, and HSL formats' with specific verbs and resources, and distinguishes from sibling conversion tools (e.g., base64_decode, case_convert) by focusing on color formats.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Provides explicit when-to-use examples: 'translating design tokens', 'verifying color accessibility', 'normalizing color values'. Lacks explicit when-not-to-use but context is clear.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
compare_modelsARead-onlyIdempotentInspect
Compare 2-5 AI models side by side: context window, pricing, multimodal, reasoning capabilities, and provider. Returns a comparison table with a recommendation based on your use case.
| Name | Required | Description | Default |
|---|---|---|---|
| models | Yes | Array of 2-5 model names (e.g. ["gpt-4o","claude-3.5-sonnet","gemini-2.0-flash"]) | |
| use_case | No | Optimize recommendation for this criterion |
Output Schema
| Name | Required | Description |
|---|---|---|
| rows | No | |
| model | No | |
| use_case | No | |
| recommendation | No | |
| models_compared | No | |
| cost_per_1k_total | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and destructiveHint=false, so the safety profile is clear. The description adds that it returns a comparison table with recommendation, providing useful output context without contradicting annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences cover purpose, scope, and output. Front-loaded with key information. No unnecessary words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the schema and annotations, the description is complete: it specifies the action, input constraints (2-5 models), comparison dimensions, and output format. The presence of an output schema means return values need not be detailed.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with descriptions for both parameters. The description adds the overall purpose but does not provide additional meaning beyond the schema (e.g., enum options of use_case are already listed). Baseline of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states the verb 'Compare' and resource 'AI models side by side' with specific attributes (context window, pricing, multimodal, reasoning, provider). It distinguishes from siblings like 'compare_responses' and 'model_info' by focusing on cross-model comparison for selection.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage for comparing models but does not explicitly state when to use versus alternatives. Sibling tools like 'compare_responses' or 'model_info' exist, but no when-not guidance is provided.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
compare_responsesARead-onlyIdempotentInspect
Compare two ALREADY-PRODUCED outputs (e.g. model A vs model B on the same task) side by side. Returns deterministic metrics (token cosine, ROUGE-L, Jaccard, length/structure deltas, JSON diff) and a verdict. If a reference (ground truth) is given, scores each output against it and picks the closer one. If model + api_key are given, an LLM judge also picks a qualitative winner for the task. No re-execution — you bring the outputs.
| Name | Required | Description | Default |
|---|---|---|---|
| task | No | The task/prompt both outputs were answering — used by the LLM judge for context | |
| model | No | Optional judge model id (BYOK). When set with api_key, an LLM judge picks a qualitative winner. | |
| api_key | No | Optional API key for the judge model (BYOK). Used only for the judge call; never stored. | |
| label_a | No | Label for output A (e.g. "GPT-4o", "v1.0") | |
| label_b | No | Label for output B (e.g. "GPT-5-nano", "v1.1") | |
| reference | No | Optional ground-truth / expected answer. If set, each output is scored against it and the closer one wins (deterministic). | |
| check_json | No | Try to parse as JSON and compare structurally (keys, types, values) | |
| response_a | Yes | First output (e.g. model A's answer) | |
| response_b | Yes | Second output (e.g. model B's answer) |
Output Schema
| Name | Required | Description |
|---|---|---|
| judge | No | |
| labelA | No | |
| labelB | No | |
| metrics | No | |
| summary | No | |
| verdict | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description fully discloses behavioral traits: it is read-only, idempotent, non-destructive, and returns deterministic metrics. It also explains the optional LLM judge behavior. These align with annotations and add value beyond them.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise (4 sentences), front-loaded with the main purpose, and well-structured with clauses for optional features. Every sentence adds value without redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the complexity (9 parameters with optional features) and the existence of an output schema, the description covers all necessary information. It explains when and how to use each optional parameter, making the tool self-contained.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline is 3. The description adds meaning by explaining the role of each optional parameter: 'task' is for judge context, 'reference' for scoring against ground truth, 'model'/'api_key' for LLM judge. This goes beyond the schema's basic descriptions.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool compares two already-produced outputs side by side, specifying it returns deterministic metrics and a verdict. It distinguishes itself from siblings like diff_text, similarity_score, and json_diff by emphasizing 'no re-execution' and listing specific metrics (token cosine, ROUGE-L, Jaccard, etc.).
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides clear context on when to use the tool: for comparing two outputs with optional reference or LLM judge. It implicitly differentiates from raw text diff tools by focusing on outputs and metrics. However, it could explicitly state not to use this for simple text differencing or for re-executing tasks.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
consistency_checkARead-onlyIdempotentInspect
Compare multiple LLM responses to the same prompt and detect inconsistencies using Jaccard word-overlap similarity and fact drift (number comparison). Fast, deterministic, no API key needed. Limitations: relies on surface-level word matching — "Paris is the capital of France" vs "Paris is the French capital" may score low despite semantic equivalence. For true semantic consistency, use run_semantic_tests with embedding mode. Essential for determinism testing.
| Name | Required | Description | Default |
|---|---|---|---|
| responses | Yes | Array of 2+ LLM responses to compare (same prompt, different runs) | |
| check_facts | No | Check for contradictory numbers/facts across responses (default: true) |
Output Schema
| Name | Required | Description |
|---|---|---|
| verdict | No | |
| fact_drift | No | |
| avg_similarity | No | |
| response_count | No | |
| pairwise_scores | No | |
| fact_contradiction | No | |
| length_variance_percent | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate idempotent, read-only, non-destructive behavior. Description adds value by stating 'Fast, deterministic, no API key needed' and discloses the limitation of surface-level matching, which is crucial for proper usage.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three sentences with critical information front-loaded. No unnecessary words, and every sentence adds value: purpose, behavior, limitations, alternative.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool has an output schema and full parameter coverage, the description sufficiently covers behavior, limitations, and usage context. No gaps.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema covers 100% of parameters with descriptions. Description adds no additional detail beyond what the schema provides, so baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: comparing multiple LLM responses for inconsistencies using Jaccard similarity and fact drift. It distinguishes itself from sibling tool 'run_semantic_tests' by noting its focus on surface-level matching.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly advises when to use this tool (fast, deterministic, no API key) and when not to (for semantic consistency, use run_semantic_tests). Also clearly outlines limitations.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
context_window_checkARead-onlyIdempotentInspect
Given an array of message objects [{role, content}], estimate total token usage and check if it fits in the target model's context window. Warns about truncation risk.
| Name | Required | Description | Default |
|---|---|---|---|
| model | Yes | Target model name (e.g. gpt-4o, claude-3.5-sonnet) | |
| messages | Yes | Array of messages (system/user/assistant) | |
| max_output_tokens | No | Reserved tokens for output (default: 4096) |
Output Schema
| Name | Required | Description |
|---|---|---|
| fits | No | |
| role | No | |
| chars | No | |
| index | No | |
| model | No | |
| tokens | No | |
| warnings | No | |
| breakdown | No | |
| per_message | No | |
| total_tokens | No | |
| message_count | No | |
| context_window | No | |
| total_input_tokens | No | |
| utilization_percent | No | |
| reserved_output_tokens | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate read-only, idempotent, non-destructive. Description adds behavioral detail about warning on truncation risk, which is useful beyond annotations. No contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, front-loaded with purpose. Every sentence adds value with no wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given moderate complexity (3 params, no enums, output schema exists), the description is complete enough. Could mention output structure but not required due to output schema.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% and parameters are well-described. Description does not add significant extra meaning beyond what schema provides, so baseline of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states it estimates token usage and checks context window fit, with a specific warning about truncation risk. Differentiates from sibling tools like count_tokens by adding the context window check.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Implies usage for token estimation and context check but does not explicitly state when to use vs alternatives (e.g., count_tokens) or provide exclusions. No guidance on when not to use.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
conversation_analyzeARead-onlyIdempotentInspect
Analyze a multi-turn conversation for context retention, topic drift, instruction following, and repetition. Accepts messages array [{role, content}]. Essential for chatbot QA.
| Name | Required | Description | Default |
|---|---|---|---|
| messages | Yes | Conversation messages in order |
Output Schema
| Name | Required | Description |
|---|---|---|
| turn_count | No | |
| repetitions | No | |
| topic_drift | No | |
| user_messages | No | |
| context_retention | No | |
| has_system_prompt | No | |
| assistant_messages | No | |
| avg_response_length | No | |
| repetition_detected | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and destructiveHint=false, indicating safety. The description adds specific behavioral traits: the tool analyzes four distinct dimensions (context retention, topic drift, instruction following, repetition) beyond what annotations provide, enhancing transparency.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences: the first states the action and dimensions, the second specifies the input format and utility. Every sentence adds value, no redundancy, and information is front-loaded.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's single required parameter and the presence of an output schema, the description covers all necessary context: what is analyzed (four aspects), input format, and common use case (chatbot QA). It is sufficiently complete for an agent to select and invoke correctly.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema already describes the 'messages' parameter structure with enum roles and order, and schema coverage is 100%. The description restates this as 'messages array [{role, content}]' but adds no new semantic detail beyond reinforcing the conversational context. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly specifies the tool's purpose: analyzing multi-turn conversations for context retention, topic drift, instruction following, and repetition. It names the resource (conversation) and action (analyze), and distinguishes from siblings like 'hallucination_check' and 'consistency_check' by focusing on comprehensive conversation analysis.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage for chatbot QA but does not explicitly state when to use this tool over alternatives like 'response_quality_score' or 'context_window_check'. It lacks exclusion criteria or alternative recommendations, leaving the agent to infer context.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
cookie_security_auditARead-onlyInspect
Audit the security attributes of cookies set by any URL. Fetches the URL and inspects all Set-Cookie headers for: HttpOnly, Secure, SameSite, Domain scope, Path scope, Max-Age/Expires, __Host-/__Secure- prefixes. Flags insecure patterns: missing HttpOnly on session cookies, missing Secure flag, SameSite=None without Secure, overly broad Domain, and excessive TTL. Returns per-cookie grades and an overall security score (0–100).
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | Full URL to audit (e.g. https://example.com/login) |
Output Schema
| Name | Required | Description |
|---|---|---|
| url | No | |
| name | No | |
| path | No | |
| score | No | |
| domain | No | |
| issues | No | |
| secure | No | |
| cookies | No | |
| max_age | No | |
| message | No | |
| httpOnly | No | |
| sameSite | No | |
| host_prefix | No | |
| cookies_found | No | |
| secure_prefix | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and destructiveHint=false. The description adds detailed behavioral context: fetches the URL, inspects Set-Cookie headers, flags insecure patterns, and returns grades/score. This goes beyond annotations without contradicting them.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is four sentences, front-loaded with the core purpose, then lists specific attributes and flags, and ends with output summary. No redundant information; every sentence is informative and earns its place.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a tool with one parameter, existing annotations, and output schema, the description thoroughly covers what the tool does, how it inspects, what it flags, and what it returns. It provides sufficient context for an AI agent to select and invoke the tool correctly.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The schema has 100% coverage with a clear description for the url parameter. The tool description further explains how the URL will be used (fetched for cookie inspection), adding meaning beyond the schema alone.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool audits security attributes of cookies for any URL, specifies what it inspects (HttpOnly, Secure, SameSite, etc.), and what it returns (per-cookie grades and overall score). This distinguishes it from sibling tools like security_headers_check or ssl_certificate_check.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implicitly tells when to use: when needing to audit cookie security for a URL. No explicit exclusions or alternatives are mentioned, but the context's sibling list shows no other cookie-specific tool, so the purpose is clear.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
cors_checkerARead-onlyInspect
Check the CORS configuration of a URL the same way a browser would. Returns the main response status, all Access-Control-* headers, the tested origin, and the preflight OPTIONS response. Use this for direct CORS debugging, not just security auditing.
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | Full URL to test, e.g. https://api.example.com/resource | |
| method | No | HTTP method to simulate (default: GET) | |
| origin | No | Origin header to simulate (default: https://yourdomain.com) |
Output Schema
| Name | Required | Description |
|---|---|---|
| url | No | |
| method | No | |
| status | No | |
| preflight | No | |
| allHeaders | No | |
| corsHeaders | No | |
| testedOrigin | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and destructiveHint=false. The description adds behavioral context beyond annotations: it explains the tool simulates browser behavior and returns the preflight OPTIONS response, which is not indicated in the annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences long, with the first sentence stating the tool's functionality and the second providing usage context. No unnecessary words or repetition.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The tool has an output schema (not shown), and the description explains the return values (status, headers, origin, preflight). Given the annotations and schema coverage, the description is adequate for understanding the tool's behavior.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, and the schema provides sufficient descriptions for all three parameters. The description does not add additional semantic or formatting details about parameters beyond what the schema already provides.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool checks CORS configuration like a browser and returns specific response details. It distinguishes itself from the sibling 'cors_test' by specifying it is for direct CORS debugging, not just security auditing.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides a usage hint ('Use this for direct CORS debugging, not just security auditing') but does not explicitly state when not to use the tool or mention alternative tools like 'cors_test'. The guidance is implied rather than explicit.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
cors_testARead-onlyInspect
Test a URL for CORS misconfigurations. Sends preflight (OPTIONS) and cross-origin requests with various Origin headers to detect: wildcard origins with credentials, origin reflection (echoing any origin), null origin acceptance, subdomain wildcard bypass, and missing Vary headers. Returns risk level (safe/low/medium/high/critical), per-test results, and fix recommendations. Essential for API security audits.
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | Full URL to test (e.g. https://api.example.com/endpoint) | |
| origin | No | Custom Origin header to test (default: tests multiple origins automatically) |
Output Schema
| Name | Required | Description |
|---|---|---|
| url | No | |
| tests | No | |
| risk_level | No | |
| origins_tested | No | |
| total_findings | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations (readOnlyHint=true, openWorldHint=true) indicate safety and external requests. Description adds specifics about sending preflight and cross-origin requests and the types of misconfigurations detected, providing useful context beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three well-structured sentences that front-load purpose, list detection categories, and mention return value. No wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given complexity and presence of output schema, description covers what the tool returns (risk level, per-test results, fix recommendations) and its purpose in security audits. Complete for an MCP tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, and description adds context about how parameters are used (e.g., 'sends preflight and cross-origin requests with various Origin headers'). The optional origin parameter is explained.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it tests CORS misconfigurations and lists specific checks (wildcard origins, origin reflection, etc.). It distinguishes from siblings by focusing on security audits and providing detailed detection categories.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies use for API security audits but does not explicitly state when not to use it or mention alternatives like cors_checker. However, context is clear.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
cot_analyzerARead-onlyIdempotentInspect
Analyze a Chain-of-Thought (CoT) or reasoning trace from an LLM. Detects step count, logical flow, conclusion presence, backtracking, and estimates reasoning depth. Useful for o1/o3/DeepSeek-R1 evaluation.
| Name | Required | Description | Default |
|---|---|---|---|
| reasoning | Yes | The CoT / reasoning trace text (e.g. from <think> tags or step-by-step output) | |
| expected_conclusion | No | Expected final answer to check against (optional) |
Output Schema
| Name | Required | Description |
|---|---|---|
| markers | No | |
| step_count | No | |
| total_chars | No | |
| total_lines | No | |
| has_conclusion | No | |
| reasoning_depth | No | |
| backtracking_signals | No | |
| reasoning_depth_label | No | |
| conclusion_matches_expected | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate read-only and idempotent behavior. The description adds value by detailing what the tool extracts (step count, logical flow, conclusion presence, etc.), which are not apparent from annotations alone. It does not contradict annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences: first defines the tool's purpose and capabilities, second gives concrete usage context. No redundant words, and critical information is front-loaded. This is an example of efficient, well-structured description.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool has only two required parameters and an output schema, the description covers inputs and analysis functions adequately. It does not describe the output format, but the presence of an output schema relieves that burden. Minor gap: no mention of handling long or malformed traces.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline is 3. The description does not enhance parameter explanations beyond what the schema provides (e.g., details on expected_conclusion format or behavior). The extra analysis capabilities mentioned are not tied to specific parameters.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description explicitly states the tool analyzes Chain-of-Thought reasoning traces, listing specific detection capabilities like step count, logical flow, and backtracking. This clearly distinguishes it from sibling tools such as bias_detect or hallucination_check, which focus on other aspects of LLM output.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions it is 'useful for o1/o3/DeepSeek-R1 evaluation', providing clear context for when to use it. However, it does not advise against using it in other scenarios or name alternatives, such as conversation_analyze for full conversations.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
count_code_linesARead-onlyIdempotentInspect
Count lines of code: total, code lines, comment lines, blank lines, and comment density. Supports JS/TS, Python, Java/C/C++, Ruby, Go, Shell, HTML/XML, and CSS.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Source code to analyze | |
| language | No | Language hint: "js", "ts", "py", "java", "c", "rb", "go", "sh", "html", "css" (auto-detect if omitted) |
Output Schema
| Name | Required | Description |
|---|---|---|
| language | No | |
| code_lines | No | |
| blank_lines | No | |
| total_lines | No | |
| comment_lines | No | |
| comment_density | No | |
| code_to_comment_ratio | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, destructiveHint=false, so the description adds no further behavioral context beyond listing supported languages. No contradictions. The bar is lowered by annotations, and the description meets it minimally.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences with no fluff. The first sentence enumerates the outputs, and the second lists supported languages. Every sentence is informative and earns its place.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The tool has an output schema, so return values are covered. The description is complete for a simple counting tool, though it could mention if it handles file extensions or comment styles for each language. Still, it's adequate.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%: both 'input' and 'language' have descriptions. The description adds value by listing the exact language codes, but this is already implied by the schema's language hint. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool counts lines of code with specific breakdowns (total, code, comment, blank lines, density) and lists supported languages. The name is self-explanatory, and it is distinct from sibling tools like text stats or analysis tools.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description does not explicitly state when to use this tool versus alternatives. It implies usage for code analysis but lacks guidance on context, exclusions, or when another tool might be better. Given the sibling set includes many text processing tools, this is a gap.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
count_tokensARead-onlyIdempotentInspect
Estimate the token count of a text string using the cl100k_base approximation (~4 chars/token). Call this BEFORE sending any text to an LLM API to check if it fits within the model context window and to estimate cost. Returns token estimate, character count, and word count.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Text to count tokens for |
Output Schema
| Name | Required | Description |
|---|---|---|
| chars | No | |
| words | No | |
| tokens_estimate | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint, idempotentHint, destructiveHint, so safety profile is clear. The description adds valuable behavioral context: approximation method (cl100k_base, ~4 chars/token) and return values (token estimate, char count, word count). No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences plus an introductory line, all relevant and without redundancy. It efficiently covers purpose, usage, and output, with no wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple tool with one string parameter and an output schema (present but not shown), the description sufficiently covers the estimation method, use case, and return values. No missing critical information.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so baseline 3. The parameter 'input' is described in schema as 'Text to count tokens for', and the main description adds the encoding detail but not directly in the parameter context. No significant additional parameter-specific meaning.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it estimates token count of a text string using cl100k_base approximation, and distinguishes from siblings by specifying the method and output fields. The verb 'estimate' and resource 'token count' are specific.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly says to call this before sending text to an LLM API to check context window fit and estimate cost, providing clear when-to-use guidance. However, it does not explicitly exclude other tools or mention alternatives in the sibling set, slightly reducing differentiation.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
create_confluence_pageAInspect
Create a new Confluence page from the output of jira_to_test_suite. Formats Gherkin, E2E steps, API tests, and test data as a properly structured Confluence page with code blocks and tables. STATEFUL — creates a new page in the specified space.
| Name | Required | Description | Default |
|---|---|---|---|
| title | No | Page title. Defaults to "Test Plan: {issue_key}" | |
| issue_key | No | Source Jira issue key (for the page title and source link) | |
| issue_url | No | Source Jira issue URL (added as a link in the page) | |
| space_key | Yes | Confluence space key where the page will be created, e.g. "QA", "ENG" | |
| test_suite | Yes | The test_suite object from jira_to_test_suite result | |
| parent_page_id | No | Optional parent page ID — page will be created as a child of this page | |
| confluence_email | Yes | Atlassian account email | |
| confluence_token | Yes | Atlassian API token | |
| confluence_base_url | Yes | Atlassian base URL |
Output Schema
| Name | Required | Description |
|---|---|---|
| title | No | |
| page_id | No | |
| success | No | |
| page_url | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Adds value beyond annotations by describing the formatting behavior (code blocks, tables) and emphasizing statefulness. Annotations are all false, so no contradictions; description fills in behavioral details.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two focused sentences with no wasted words. Front-loaded with verb and resource, followed by specific details and statefulness warning.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Covers core purpose and integration point. Could benefit from mentioning prerequisite steps or page structure, but output schema likely handles return details. Adequate for a creation tool with rich schema.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, baseline 3. Description links the test_suite parameter to jira_to_test_suite, adding semantic context. Does not repeat schema descriptions but provides integration guidance.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Clearly states 'Create a new Confluence page' with specific content formatting (Gherkin, E2E steps, API tests). Distinguishes from sibling tools like fetch_confluence_page (read) and jira_to_test_suite (input generation).
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly mentions it takes output from jira_to_test_suite as input, establishing a prerequisite. Also notes statefulness. Lacks explicit when-not-to-use or alternative tools, but context is clear.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
cron_parseARead-onlyIdempotentInspect
Parse a cron expression into a human-readable schedule description. Supports standard 5-field cron (minute hour day month weekday).
| Name | Required | Description | Default |
|---|---|---|---|
| expression | Yes | Cron expression (e.g., "0 9 * * 1-5", "*/15 * * * *") |
Output Schema
| Name | Required | Description |
|---|---|---|
| fields | No | |
| expression | No | |
| human_readable | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false, so the tool's safety profile is clear. The description adds the context that the output is a 'human-readable schedule description', which is useful but does not go beyond the annotations. There is no contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences long, with no wasted words. It front-loads the primary action and resource, then adds the format constraint. Every sentence earns its place, and the structure is clear and efficient.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple single-parameter tool with an output schema, the description is adequately complete. It explains what the tool does, what format it accepts, and hints at the output. It does not mention error handling or unsupported cron variants, but the simplicity and the presence of an output schema make additional detail less critical.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema has 100% coverage for the required 'expression' parameter, with a description of example values. The description adds semantic value by explaining the supported format: 'standard 5-field cron (minute hour day month weekday)', which helps the agent understand valid expressions beyond the schema examples. This is meaningful context.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: 'Parse a cron expression into a human-readable schedule description.' It specifies the verb 'parse' and the resource 'cron expression', and it distinguishes from sibling tools like cron_validator by not mentioning validation. The specification of 'standard 5-field cron' adds precision.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description does not provide explicit guidance on when to use this tool versus alternatives such as cron_validator. It states that it supports standard 5-field cron but does not mention use cases, exclusions, or prerequisites. The usage context is implied but not direct.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
cron_validatorARead-onlyIdempotentInspect
Validate a 5-field cron expression, explain the schedule, and preview the next execution times. Use this to debug cron jobs before they reach production. Returns parsed fields, a human-readable description, and upcoming ISO timestamps.
| Name | Required | Description | Default |
|---|---|---|---|
| expression | Yes | Cron expression with 5 fields, e.g. "*/15 9-18 * * 1-5" | |
| next_runs_count | No | How many upcoming runs to return (1-50, default: 10) |
Output Schema
| Name | Required | Description |
|---|---|---|
| valid | No | |
| fields | No | |
| next_runs | No | |
| expression | No | |
| human_readable | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and destructiveHint=false. The description adds behavioral details: it returns parsed fields, a human-readable description, and upcoming ISO timestamps. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences with no wasted words. The first sentence explains the core functionality, the second gives usage context and output summary.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given full schema coverage and an output schema, the description is complete. It provides usage context and output details, making it easy for an agent to decide when to use this tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the schema documents both parameters. The description includes an example cron expression but does not add significant semantic meaning beyond the schema. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool validates a 5-field cron expression, explains the schedule, and previews next execution times. It distinguishes from the sibling tool `cron_parse` by focusing on validation and debugging.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly says 'Use this to debug cron jobs before they reach production,' which provides a clear when-to-use context. It does not mention when not to use or alternatives, but the specificity is sufficient.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
decode_jwtARead-onlyIdempotentInspect
Decode a JWT (JSON Web Token) and return its header and payload without verifying the signature. Also reports whether the token is expired and the exact expiry date. Use to inspect claims (sub, iss, exp, roles) during debugging or when integrating with an auth provider.
| Name | Required | Description | Default |
|---|---|---|---|
| token | Yes | The JWT string to decode (header.payload.signature) |
Output Schema
| Name | Required | Description |
|---|---|---|
| note | No | |
| header | No | |
| expired | No | |
| payload | No | |
| expiresAt | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description adds behavioral context beyond annotations: it explains that signature verification is not performed and that the tool reports expiration status and date. This complements the readOnlyHint and idempotentHint annotations without contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two precise sentences: first defines core functionality, second states use case. No extraneous content, front-loaded with key information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the presence of an output schema (not shown but indicated), the description fully covers what the tool returns (header, payload, expiration info) and its use case, making it complete.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
With 100% schema coverage, the description adds minimal value over the schema's parameter description, only restating the token format. Baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool decodes a JWT, returns header and payload without verification, and reports expiration. It uses specific verb 'decode' and resource 'JWT', distinguishing it from siblings like base64_decode or hash_text.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description advises using the tool to inspect claims during debugging or auth integration, providing clear context. However, it does not explicitly state when not to use it or mention alternatives, though no direct sibling exists.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
detect_languageARead-onlyIdempotentInspect
Detect the natural language of a text using n-gram frequency analysis and common word markers. Supports 15 languages: English, French, Spanish, German, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Korean, Arabic, Polish, Turkish, Swedish.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Text to detect language from (min 20 chars for accuracy) |
Output Schema
| Name | Required | Description |
|---|---|---|
| lang | No | |
| name | No | |
| score | No | |
| method | No | |
| matched | No | |
| language | No | |
| confidence | No | |
| top_candidates | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description discloses the analysis method (n-gram frequency analysis and common word markers) and the minimum character requirement, adding value beyond the annotations (which only indicate read-only, idempotent, non-destructive). It does not contradict annotations. However, it does not detail handling of out-of-list languages or short inputs.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences: the first explains the core function and method, the second lists supported languages and a usage hint. It is front-loaded with the action, no redundancy, and every sentence serves a purpose.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity (one parameter, good annotations, output schema exists), the description covers purpose, method, language range, and a usage hint. It is fairly complete, though it could mention expected output format or error cases. The output schema likely covers return values.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% for the single parameter 'input'. The description adds the 'min 20 chars for accuracy' detail, which is not in the schema description. This enhances understanding beyond the schema alone. Baseline 3 increased to 4 for the added value.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the action ('Detect the natural language') and the resource ('a text'). It lists 15 supported languages, distinguishing it from sibling tools like 'calculate_readability' or 'text_stats'. The verb and resource are specific and unambiguous.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions a minimum character length (20 chars) for accuracy, which helps in usage, but it does not provide explicit guidance on when to use this tool versus alternatives (e.g., other text analysis tools). No 'when not to use' or alternative tool references are given.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
detect_secretsARead-onlyIdempotentInspect
Scan code or config files for hardcoded secrets: AWS keys, GitHub tokens, OpenAI/Anthropic API keys, Stripe secrets, JWTs, database connection strings, and generic passwords. Returns findings with severity. Run before every commit.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Code or config content to scan (max 500KB) | |
| filename | No | Optional filename for context (e.g. ".env", "config.js") |
Output Schema
| Name | Required | Description |
|---|---|---|
| filename | No | |
| findings | No | |
| risk_level | No | |
| recommendation | No | |
| total_findings | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare read-only and idempotent behavior. The description adds that it returns findings with severity, which complements the annotations without contradiction. It does not elaborate on all behaviors (e.g., output structure), but the output schema exists to cover that.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is extremely concise: two sentences total. The first sentence introduces the action and key examples, and the second sentence mentions return format and usage recommendation. No redundant words or unnecessary details.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The description covers the core purpose, return format (findings with severity), and usage recommendation. Given the presence of an output schema, it does not need to detail return values. It is complete enough for a scanning tool with two well-documented parameters.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Both 'input' and 'filename' parameters are fully described in the input schema with 100% coverage. The tool description does not add new semantic information beyond the schema (e.g., 'input' is the content to scan, 'filename' provides context). Baseline score is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool scans code/config files for hardcoded secrets, listing specific types like AWS keys, GitHub tokens, and API keys. This verb+resource combination is highly specific and distinguishes it from generic sibling tools like 'secret_scan'.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description includes the explicit recommendation 'Run before every commit,' which provides a clear usage context. However, it does not specify when not to use this tool or mention alternatives such as 'secret_scan,' leaving some ambiguity.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
diff_textARead-onlyIdempotentInspect
Compute a unified line-by-line diff between two text strings (LCS algorithm). Returns added/removed/unchanged line counts and formatted diff hunks with configurable context lines (0–20). Use to compare versions of prompts, configs, code snippets, or any text where you need to see exactly what changed.
| Name | Required | Description | Default |
|---|---|---|---|
| a | Yes | Original (before) text | |
| b | Yes | Modified (after) text | |
| context | No | Context lines around each change (0–20, default: 3) |
Output Schema
| Name | Required | Description |
|---|---|---|
| diff | No | |
| added | No | |
| removed | No | |
| unchanged | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false. The description adds value by disclosing the algorithm (LCS), the output format (counts and diff hunks), and configurable context lines, providing useful behavioral context beyond the annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two concise sentences with no wasted words. It front-loads the main action and outcomes, making it easy to grasp quickly.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the presence of an output schema that likely documents return values, the description sufficiently covers the tool's purpose, input parameters, output summary, and use cases. It is complete for a relatively simple diff tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the description adds no additional meaning beyond the schema. It mentions 'configurable context lines (0–20)' but does not elaborate on parameter usage or constraints beyond what the schema already provides.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool computes a unified line-by-line diff between two text strings using LCS algorithm. It lists specific use cases (prompts, configs, code snippets) but does not explicitly differentiate it from sibling comparison tools like compare_responses or json_diff.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides clear contexts for when to use the tool ('compare versions of prompts, configs, code snippets, or any text where you need to see exactly what changed') but does not specify when not to use it or mention alternative tools for similar tasks.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
embedding_similarityARead-onlyIdempotentInspect
Compute text similarity using local algorithms (Bag of Words, TF-IDF, Character N-grams). No API key needed — runs entirely in-process. NOT real embeddings: for true semantic similarity with vector embeddings, use run_semantic_tests with mode="embeddings" and your OpenAI API key. Supports single pair or batch mode with pipe-separated pairs. Useful for RAG retrieval testing, semantic search evaluation, and text deduplication.
| Name | Required | Description | Default |
|---|---|---|---|
| batch | No | Batch mode: array of { text_a, text_b } pairs. Overrides text_a/text_b if provided. | |
| text_a | No | First text to compare (single-pair mode) | |
| text_b | No | Second text to compare (single-pair mode) | |
| methods | No | Algorithms to use (default: all three). Options: "bow", "tfidf", "ngram" |
Output Schema
| Name | Required | Description |
|---|---|---|
| mode | No | |
| count | No | |
| scores | No | |
| text_a | No | |
| text_b | No | |
| results | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false. The description adds valuable behavioral context: it runs entirely in-process, requires no API key, and explicitly states it is not real embeddings. This goes beyond the annotations by explaining the computational model and limitations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise at three sentences. The first sentence immediately states the core function and algorithms. The second adds the key distinctions (no API key, not real embeddings). The third lists use cases. Every sentence is purposeful with no redundancy or fluff.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity (4 optional params, no required, output schema exists), the description covers all essential aspects: purpose, usage guidelines, behavioral transparency, and use cases. It does not need to describe return values because the output schema is present and handles that.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the baseline is 3. The description adds context about single pair vs batch mode and mentions 'pipe-separated pairs', which is slightly misleading as the schema defines batch as an array of objects. However, the schema is authoritative, so the description adds marginal value beyond the schema's parameter descriptions.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb 'Compute', the resource 'text similarity', and specifies the local algorithms (Bag of Words, TF-IDF, Character N-grams). It distinguishes from semantic embedding tools like run_semantic_tests, providing a specific and distinct purpose.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides explicit guidance on when to use (no API key needed, in-process) and when not to use (NOT real embeddings, alternative run_semantic_tests for true semantic similarity). It lists concrete use cases (RAG retrieval testing, semantic search evaluation, text deduplication), making the decision clear.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
env_parseARead-onlyIdempotentInspect
Parse a .env file content into a JSON object. Handles quoted values (single and double), inline comments, export prefix, and escaped sequences (\n, \t inside double quotes). Returns all key-value pairs. Use in CI/CD pipelines, agent config loaders, or when processing dotenv files programmatically.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | .env file content to parse (e.g. the output of `cat .env`) |
Output Schema
| Name | Required | Description |
|---|---|---|
| vars | No | |
| count | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false, indicating a safe, non-mutating operation. The description adds value by detailing handling of quotes, comments, export, and escape sequences, which goes beyond the annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise, with a clear first sentence stating purpose, followed by details and use cases. Every sentence adds value, and the structure is well front-loaded.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool has a single parameter, high schema coverage, and an output schema, the description provides all necessary context: input format, parsing behavior, and expected output (JSON object). It is complete for the tool's complexity.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, and the parameter description is clear. The description adds extra context about parsing behavior and edge cases, improving parameter understanding beyond the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description explicitly states the tool parses .env file content into a JSON object, listing supported features (quoted values, comments, export prefix, escapes). It clearly distinguishes from sibling tools, as no other tool in the list specifically parses .env files.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides explicit use cases: CI/CD pipelines, agent config loaders, or processing dotenv files programmatically. It does not state when not to use, but the tool is sufficiently specialized that this is acceptable.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
escape_htmlARead-onlyIdempotentInspect
Escape HTML special characters (&, <, >, ", ') to their safe HTML entities. ALWAYS call this before inserting any user-provided or LLM-generated content into an HTML template to prevent cross-site scripting (XSS) attacks.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | String to HTML-escape |
Output Schema
| Name | Required | Description |
|---|---|---|
| escaped | No | |
| original_length | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false, so the safety profile is clear. The description adds that it escapes specific characters, which is consistent with annotations. No additional behavioral traits are needed beyond stating the transformation.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences long, front-loaded with the action and purpose. Every sentence adds value: first explains what it does, second gives explicit usage guidance. No wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity (one parameter, output schema present), the description covers everything needed: the operation, the specific characters, and the critical security context. No gaps remain.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% and the parameter 'input' has a clear description. The tool description does not add new information about the parameter beyond what the schema provides, so baseline of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool escapes HTML special characters to safe entities, specifying the exact characters affected. It also differentiates by emphasizing security (XSS prevention) and implies a clear use case, distinguishing it from siblings like 'unescape_html' or other encoding tools.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description gives explicit when-to-use guidance: 'ALWAYS call this before inserting any user-provided or LLM-generated content into an HTML template to prevent XSS attacks.' It does not explicitly mention alternatives, but the context of sibling tools (e.g., unescape_html, strip_markdown) makes the exclusion clear.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
estimate_llm_costARead-onlyIdempotentInspect
Estimate the API cost in USD for a given model and token counts. Supports all major 2024–2026 models: GPT-4o, GPT-4.1, o3, o4-mini, Claude Opus 4, Claude Sonnet 4/4.5, Gemini 2.5 Pro/Flash, DeepSeek V3/R1, Grok 3, and legacy models.
| Name | Required | Description | Default |
|---|---|---|---|
| model | Yes | Model name, e.g. "gpt-4o", "claude-3.5-sonnet", "deepseek-v3" | |
| input_tokens | Yes | Number of input/prompt tokens | |
| output_tokens | No | Number of output/completion tokens (default: 0) |
Output Schema
| Name | Required | Description |
|---|---|---|
| model | No | |
| rates | No | |
| input_tokens | No | |
| output_tokens | No | |
| input_cost_usd | No | |
| total_cost_usd | No | |
| output_cost_usd | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description is consistent with annotations (readOnly, idempotent) and adds context about supported models and the nature of the computation. No contradictions or missing behavioral traits.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single sentence that efficiently conveys the purpose and scope, with no wasted words. It is front-loaded with the core action.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple cost estimation tool with three well-described parameters and an output schema (not shown), the description covers all necessary context without redundancy.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the schema already sufficiently describes each parameter. The description adds no additional semantic nuance beyond the schema definitions, earning a baseline score.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool estimates API cost in USD for a given model and token counts, with a specific verb and resource. It distinguishes from sibling tools like count_tokens or token_budget_calculator by focusing on cost estimation and listing supported model families.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides clear context for when to use the tool (cost estimation for major models) but does not explicitly state when not to use it or recommend alternatives. However, the name and limited description offer sufficient implicit guidance.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
extract_json_from_textARead-onlyIdempotentInspect
Extract the first valid JSON object or array embedded in chaotic LLM output (surrounded by markdown fences, prose, or explanatory text). Handles ```json blocks and inline JSON. Call this whenever an LLM returns structured data mixed with explanation text instead of raw JSON.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Raw text (e.g., LLM output) that may contain a JSON object or array |
Output Schema
| Name | Required | Description |
|---|---|---|
| json | No | |
| source | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations indicate read-only and idempotent behavior. The description adds that it extracts the 'first valid' JSON, handles specific formats like ```json blocks, and is designed for chaotic output, adding useful behavioral context beyond the annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences: first defines action and input type, second provides usage guidance. No unnecessary words, front-loaded with core purpose.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity (one parameter, simple extraction), the description fully covers what it does, when to use it, and how input is handled. The presence of an output schema further reduces need for return value details.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
With 100% schema coverage, the baseline is 3. The description adds value by describing the input as 'chaotic LLM output' and detailing the types of surrounding text, which enriches the parameter meaning.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool extracts the first valid JSON object/array from chaotic text, including handling markdown fences and inline JSON. It distinguishes itself from siblings like 'extract_json_path' by focusing on embedded JSON in prose.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly says to call it 'whenever an LLM returns structured data mixed with explanation text instead of raw JSON'. Provides clear context but does not mention when not to use or alternatives explicitly.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
extract_json_pathARead-onlyIdempotentInspect
Extract a value from a JSON string using dot-notation path (e.g., "user.address.city", "items.0.name", "meta.tags"). Supports array index access via numeric path segments.
| Name | Required | Description | Default |
|---|---|---|---|
| path | Yes | Dot-notation path, e.g. "user.address.city" or "items.0.name" | |
| input | Yes | A valid JSON string to traverse |
Output Schema
| Name | Required | Description |
|---|---|---|
| path | No | |
| type | No | |
| value | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnly, idempotent, and non-destructive behavior. The description adds that array index access via numeric path segments is supported, which provides some additional behavioral context beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single concise sentence with illustrative examples. It is front-loaded with the core purpose. However, it could be slightly more structured (e.g., listing supported features).
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity (2 required params, output schema exists), the description covers the essential functionality. It does not detail error handling or edge cases, but for a basic JSON extraction tool, it is sufficiently complete.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema has 100% description coverage for both parameters. The description adds example paths for the 'path' parameter, but this is minimal added value since the schema already defines the format.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the action (extract), the resource (JSON string), and the method (dot-notation path). It effectively distinguishes from sibling tools like 'extract_json_from_text' which extracts an entire JSON object from text, and 'json_diff' which compares JSONs.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides no guidance on when to use this tool versus alternatives. It lacks explicit 'when to use' or 'when not to use' instructions, and does not mention any prerequisites or context for selection.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
extract_linksARead-onlyIdempotentInspect
Extract all URLs, email addresses, and domain names from text. Returns categorized and deduplicated results. Useful for content auditing, link checking, and web scraping validation.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Text to extract links from | |
| types | No | Types to extract (default: all three) |
Output Schema
| Name | Required | Description |
|---|---|---|
| total | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint, idempotentHint, destructiveHint. Description adds useful behavioral info: 'categorized and deduplicated results'. No contradictions, but could detail output structure or limits.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three sentences, no redundant words. Front-loaded with the core action, followed by output characteristics and use cases. Highly efficient.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the simple parameters (2, 1 required), complete annotations, and presence of output schema, the description provides all necessary context without gaps.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline is 3. Description adds default behavior for 'types' parameter ('default: all three'), which goes beyond schema. Provides meaningful extra context.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
States specific verb 'Extract' and resources 'URLs, email addresses, and domain names'. Clearly distinguishes from sibling tools (e.g., url_decode, domain-specific extractors) by specifying the exact types extracted and categorized output.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Provides use cases like 'content auditing, link checking, and web scraping validation', giving clear context. However, does not explicitly state when not to use or mention alternatives, missing some guidance.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
extract_todosARead-onlyIdempotentInspect
Extract TODO, FIXME, HACK, BUG, NOTE, OPTIMIZE, and custom tags from any source code or text. Returns line numbers, tag types, and message text. Essential for technical debt auditing.
| Name | Required | Description | Default |
|---|---|---|---|
| tags | No | Custom tags to add (default set: TODO, FIXME, HACK, NOTE, BUG, OPTIMIZE, XXX) | |
| input | Yes | Code or text to scan | |
| include_context | No | Include full line text (default: true) |
Output Schema
| Name | Required | Description |
|---|---|---|
| items | No | |
| total | No | |
| counts | No | |
| has_critical | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare the tool as read-only, idempotent, and non-destructive. The description complements this by stating the return data (line numbers, tag types, message text) and its broad applicability, adding value beyond the annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences long, front-loaded with the action, and contains no unnecessary words. Every sentence contributes to understanding the tool's purpose.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the simple input schema (3 parameters, 1 required) and the presence of an output schema (which explains return values), the description provides sufficient context. The mention of default tag set and return fields completes the picture.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
All three parameters have descriptions in the input schema (100% coverage), so the description adds no new information beyond what the schema provides. Baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's function: extract TODO, FIXME, HACK, etc., tags from source code or text, and specifies what it returns (line numbers, tag types, message text). It differentiates from sibling tools by focusing on technical debt auditing, which is unique among the many text processing tools.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage for technical debt auditing but does not explicitly specify when to use this tool over alternatives or when not to use it. No exclusions are mentioned, but the context is clear enough for an agent to infer its purpose.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
fetch_confluence_pageARead-onlyInspect
Fetch a Confluence page and return its content as clean Markdown. Accepts a numeric page_id or a full page URL. Optionally lists direct child pages. BYOK — credentials transit in-memory only, never stored.
| Name | Required | Description | Default |
|---|---|---|---|
| page_id | No | Confluence page ID (numeric string), e.g. "123456789" | |
| page_url | No | Full Confluence page URL (alternative to page_id), e.g. "https://mycompany.atlassian.net/wiki/spaces/ENG/pages/123456789" | |
| confluence_email | Yes | Atlassian account email (same credentials as Jira) | |
| confluence_token | Yes | Atlassian API token | |
| include_children | No | List direct child pages (id + title) (default: false) | |
| confluence_base_url | Yes | Atlassian base URL, e.g. "https://mycompany.atlassian.net" |
Output Schema
| Name | Required | Description |
|---|---|---|
| url | No | |
| title | No | |
| page_id | No | |
| children | No | |
| markdown | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, so the agent knows it's safe. The description adds behavioral context: credentials are transient, and children listing is optional. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three concise sentences with no wasted words. Front-loaded with purpose, then input options, then security note.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the output schema exists, the description does not need to detail returns. It covers all key aspects: input modes, optional behavior, and security. Complete for a fetch tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
With 100% schema coverage, the description adds value by clarifying the relationship between page_id and page_url ('accepts a numeric page_id or a full page URL') and explaining include_children ('optionally lists direct child pages').
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb 'fetch', the resource 'Confluence page', and the output 'clean Markdown'. It also specifies input options (page_id or URL) and optional behavior (list children). This distinguishes it from sibling tools like create_confluence_page.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides context for use (fetching, reading) and mentions security ('BYOK...never stored'). However, it does not explicitly state when not to use or name alternatives (e.g., create_confluence_page for writing).
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
fetch_jira_issueARead-onlyInspect
Fetch a complete Jira issue: summary, description converted to Markdown, status, assignee, priority, labels, custom fields, and optionally comments and attachment metadata. BYOK — credentials transit in-memory only, never stored on ia-qa.com.
| Name | Required | Description | Default |
|---|---|---|---|
| fields | No | Specific Jira field names to return. Omit for all standard fields. | |
| issue_key | Yes | Jira issue key, e.g. "PROJ-123" | |
| jira_email | Yes | Atlassian account email | |
| jira_token | Yes | Atlassian API token (from id.atlassian.com > Security > API tokens) | |
| jira_base_url | Yes | Atlassian base URL, e.g. "https://mycompany.atlassian.net" | |
| include_comments | No | Include issue comments, up to 20 (default: true) | |
| include_attachments | No | Include attachment metadata list (default: false) |
Output Schema
| Name | Required | Description |
|---|---|---|
| key | No | |
| url | No | |
| type | No | |
| labels | No | |
| status | No | |
| summary | No | |
| assignee | No | |
| priority | No | |
| reporter | No | |
| description | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description discloses important behavioral traits beyond annotations: credentials are handled in-memory only (BYOK), and descriptions are converted to Markdown. These details add significant transparency. No contradictions with annotations (readOnlyHint, destructiveHint).
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, front-loaded with essential information (returned fields) and a critical security note. Every word earns its place—no redundancy or filler.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Covers core functionality and security adequately. Lacks error handling information (e.g., invalid credentials, issue not found), but the presence of an output schema mitigates the need for full return-value descriptions. Enough for agent decision-making.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, and the description does not add parameter-level meaning beyond the schema. The description focuses on output rather than input details, so it meets the baseline without further enhancement.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb 'Fetch' and resource 'Jira issue', listing specific fields returned (summary, status, assignee, etc.). It is distinct from sibling tools like search_jira_issues and post_jira_comment, making the purpose unmistakable.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No explicit usage guidelines or comparisons with alternatives are provided. The purpose is implied for fetching a single issue by key, but there is no directive on when to use this tool versus search_jira_issues or post_jira_comment, leaving room for ambiguity.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
fetch_veille_feedARead-onlyInspect
Fetch the latest QA & AI/LLM articles aggregated from curated RSS sources (Google Testing Blog, DEV.to Testing/QA/AI/LLM/Agents, Hugging Face Blog, Simon Willison). Perfect for agents monitoring the QA & AI landscape.
| Name | Required | Description | Default |
|---|---|---|---|
| limit | No | Max articles to return (default: 20, max: 50) | |
| category | No | Filter: "qa" (testing/quality), "ai" (AI/LLM/agents), "all" (default — both) |
Output Schema
| Name | Required | Description |
|---|---|---|
| articles | No | |
| category | No | |
| total_found | No | |
| sources_queried | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already mark readOnlyHint=true and destructiveHint=false, so the agent knows it's a safe read. Description adds source list but no additional behavioral traits like rate limits or pagination.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two focused sentences with no wasted words. Front-loads the action and key sources.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Has output schema (not shown), so return format is documented. With only 2 optional parameters and clear context, the description is sufficient for an agent to understand and invoke the tool correctly.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Input schema covers both parameters (limit, category) with descriptions and defaults. Schema coverage is 100%, so description adds no extra meaning beyond what the schema provides.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states verb 'Fetch' and resource 'latest QA & AI/LLM articles' with specific RSS sources (Google Testing Blog, DEV.to, etc.). It distinguishes from sibling tools as a unique feed fetcher.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Description provides clear context: 'Perfect for agents monitoring the QA & AI landscape.' It implicitly suggests when to use, but lacks explicit when-not-to-use or alternatives.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
few_shot_formatterARead-onlyIdempotentInspect
Format few-shot examples for LLM prompts. Converts example pairs into formatted blocks. Supports chat format (User/Assistant), XML tags, Markdown, or plain text.
| Name | Required | Description | Default |
|---|---|---|---|
| format | No | Output format (default: chat) | |
| examples | Yes | Array of {input, output} pairs | |
| input_label | No | Label for input (default: User / <input>) | |
| output_label | No | Label for output (default: Assistant / <output>) |
Output Schema
| Name | Required | Description |
|---|---|---|
| format | No | |
| formatted | No | |
| example_count | No | |
| token_estimate | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations provide strong safety signals (readOnly, idempotent, non-destructive). Description adds context about output format options and conversion behavior, adding value beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two concise sentences, front-loaded with key action and scope. Every sentence is informative with no redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Covers purpose and format variations adequately. With an output schema present, return value explanation is unnecessary. Lacks edge cases but sufficient for a simple formatter.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema has 100% coverage with descriptions for all parameters. Description summarizes the tool's action but does not significantly add meaning beyond the schema. Baseline score applies.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states the tool formats few-shot examples for LLM prompts, with specific verb 'Format' and 'Converts'. It also lists supported formats, distinguishing it from related siblings like build_rag_prompt.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description does not explicitly state when to use or avoid this tool versus alternatives. It implies usage for formatting few-shot examples, but lacks guidance on exclusions or comparisons.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
find_toolARead-onlyIdempotentInspect
Search available MCP tools by keyword or category before calling them. Returns matching tool names, descriptions, and optionally their inputSchemas. Call this when you are unsure which tool to use or want to explore the catalogue. Categories: data, encoding, text, llm, qa, rag, dev, security, web.
| Name | Required | Description | Default |
|---|---|---|---|
| query | Yes | Keyword(s) to search in tool name and description (e.g. "cors", "token", "vector", "json") | |
| category | No | Optional: filter by category — data | encoding | text | llm | qa | rag | dev | security | web | |
| with_schema | No | Set true to include inputSchema in results (default: false) |
Output Schema
| Name | Required | Description |
|---|---|---|
| hint | No | |
| tool | No | |
| count | No | |
| query | No | |
| score | No | |
| tools | No | |
| category | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and destructiveHint=false, so the agent knows this is a safe read operation. The description adds the search and return behavior context, which is consistent and transparent.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences plus a list of categories. Every sentence serves a purpose: first defines the tool, second gives usage advice, and the list clarifies categories. No wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool has 3 parameters with 100% schema coverage, annotations, and an output schema, the description covers all necessary context: what it does, when to use, how parameters work, and what results to expect.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, but the description adds meaning by explaining that 'query' is for keyword search, 'category' is a filter, and 'with_schema' controls whether input schemas are returned. This adds value beyond the schema alone.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool searches available MCP tools by keyword or category, returning matching names, descriptions, and optionally schemas. This function is unique among the sibling tools, which are all domain-specific operations.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly advises when to use it ('unsure which tool to use' or 'explore the catalogue') and lists categories. It does not explicitly say when not to use it, but the positive guidance is clear.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
fix_gherkinARead-onlyInspect
Fix Gherkin syntax warnings from a jira_to_test_suite result. Takes the current gherkin text and the _gherkin_warnings array, calls your LLM to fix ONLY the flagged issues (adds missing Given/When/Then steps, etc.), and returns the corrected Gherkin. Lightweight — uses ~300-500 tokens vs ~5k for a full regeneration. Requires BYOK LLM key.
| Name | Required | Description | Default |
|---|---|---|---|
| model | Yes | LLM model to use for the fix, e.g. "gpt-4o-mini". | |
| api_key | Yes | Your LLM provider API key. | |
| gherkin | Yes | The current Gherkin text from the jira_to_test_suite result (test_suite.gherkin). | |
| warnings | Yes | The _gherkin_warnings array from the jira_to_test_suite result. |
Output Schema
| Name | Required | Description |
|---|---|---|
| latency_ms | No | |
| model_used | No | |
| fixed_gherkin | No | |
| warnings_after | No | |
| warnings_before | No | |
| remaining_warnings | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description adds significant behavioral context beyond annotations: it discloses that the tool calls an LLM ('calls your LLM to fix'), specifies token cost ('uses ~300-500 tokens'), and notes the requirement for an external API key ('Requires BYOK LLM key'). Annotations already indicate readOnlyHint and other non-destructive properties, so no contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is four sentences, each earning its place: purpose, inputs/action, benefit, and requirement. It is front-loaded with the verb+resource and avoids redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the moderate complexity (4 parameters, LLM call) and the presence of an output schema, the description covers inputs, process, token cost, and prerequisite. It provides enough information for an agent to decide and invoke the tool correctly.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the description does not need to add parameter details. However, it adds value by explaining how the parameters map to the tool's workflow: 'Takes the current gherkin text and the _gherkin_warnings array' (matching gherkin and warnings) and notes that it 'calls your LLM' (implying api_key and model). This extra context justifies a 4.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: 'Fix Gherkin syntax warnings from a jira_to_test_suite result.' It specifies the resource ('Gherkin syntax warnings'), the action ('fix'), and the scope ('ONLY the flagged issues'), distinguishing it from other tools like jira_to_test_suite or full regeneration.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides context for when to use this tool (after jira_to_test_suite) and highlights a key trade-off: 'Lightweight — uses ~300-500 tokens vs ~5k for a full regeneration.' It also mentions a prerequisite: 'Requires BYOK LLM key.' However, it does not explicitly state when not to use it or list alternative tools.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
flatten_jsonARead-onlyIdempotentInspect
Flatten a nested JSON object to single-level dot-notation keys (e.g. {"a":{"b":1}} → {"a.b":1}), or unflatten dot-notation keys back to a nested object. Supports custom separators.
| Name | Required | Description | Default |
|---|---|---|---|
| mode | No | "flatten" (default) or "unflatten" | |
| input | Yes | JSON string to flatten or unflatten | |
| separator | No | Key separator (default: ".") |
Output Schema
| Name | Required | Description |
|---|---|---|
| result | No | |
| key_count | No | |
| max_depth | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and destructiveHint=false, so the safety profile is clear. The description adds value by noting custom separator support and the dual mode (flatten/unflatten), but does not contradict annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences, concise, and front-loaded with the core purpose and an example. Every sentence adds value without repetition or fluff.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity, the description covers the essential behavior. The presence of an output schema (stated but not shown) means return values do not need explanation. The description is complete for an agent to use correctly.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the schema already describes each parameter. The description adds context by explaining the purpose of the separator and the two modes, enhancing understanding beyond the raw schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: flatten nested JSON to dot-notation or unflatten back. It provides a concrete example and mentions custom separators, leaving no ambiguity. No sibling tool performs this exact function.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage by describing what the tool does, but it does not explicitly state when to use this tool over alternatives like format_json or merge_json. It lacks guidance on choosing between flatten and unflatten modes.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
format_bytesARead-onlyIdempotentInspect
Convert raw byte counts to human-readable sizes in SI (KB=1000) or IEC (KiB=1024) units, or parse size strings back to bytes. Covers B, KB/KiB, MB/MiB, GB/GiB, TB/TiB, PB/PiB.
| Name | Required | Description | Default |
|---|---|---|---|
| bytes | No | Number of bytes to format | |
| standard | No | Output standard (default: both) | |
| size_string | No | Size string to parse to bytes (e.g. "1.5 GB", "512 MiB") |
Output Schema
| Name | Required | Description |
|---|---|---|
| bytes | No | |
| original | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate read-only and idempotent behavior. The description adds transparency by listing the unit coverage and the two conversion directions (formatting and parsing), which helps the agent understand the tool's scope without conflicting with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise at two sentences, front-loading the key actions. No unnecessary words or repetition. Every sentence adds value.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a tool with two modes, the description covers the essential functionality and units. It does not explain what happens if both parameters are provided, but that is a minor gap. The presence of an output schema reduces the need to describe return values.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The schema covers all parameters with descriptions, and the tool description adds meaning by explaining that 'bytes' is for formatting and 'size_string' is for parsing. It also mentions the 'standard' parameter implicitly by naming SI and IEC. This goes beyond the schema's enum description.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's dual function: converting byte counts to human-readable sizes and parsing strings back to bytes. It specifies the units covered (B, KB/KiB, MB/MiB, etc.) and the standards (SI and IEC). This distinguishes it from sibling tools, which are mostly unrelated.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explains what the tool does but does not provide explicit guidance on when to use it or when alternatives might be appropriate. There are no exclusion criteria or recommended contexts mentioned.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
format_jsonARead-onlyIdempotentInspect
Format, validate, and pretty-print a JSON string. Returns the formatted JSON or a detailed parse error.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Raw JSON string to format | |
| indent | No | Indent size (default: 2) |
Output Schema
| Name | Required | Description |
|---|---|---|
| valid | No | |
| formatted | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, destructiveHint=false. The description adds that on failure it returns a detailed parse error, which is useful behavioral context beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences with no wasted words. The description is front-loaded with the core purpose and immediately follows with the return behavior.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the low complexity (2 parameters, simple operation) and the presence of an output schema, the description adequately covers purpose, error behavior, and return values. No gaps identified.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the schema already defines both parameters. The description does not add additional meaning beyond what the schema provides, meeting the baseline for this dimension.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb 'Format, validate, and pretty-print' and the resource 'a JSON string'. It distinguishes itself from sibling tools like json_schema_validate or json_diff by focusing on formatting and pretty-printing.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description does not provide any guidance on when to use this tool over alternatives (e.g., for validation vs. schema validation, or for formatting vs. json_to_yaml). No exclusions or context are given.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
format_tableARead-onlyIdempotentInspect
Convert a JSON array of objects into a Markdown table. Automatically detects columns, aligns headers, and fills missing keys with empty cells. Use when an agent needs to present structured data — tool results, model comparisons, test reports — as a readable table in a response or document.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | JSON array of objects to convert to a Markdown table | |
| columns | No | Column names and order (default: all keys from first row) |
Output Schema
| Name | Required | Description |
|---|---|---|
| rows | No | |
| table | No | |
| columns | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false. The description adds behavioral details beyond these: 'Automatically detects columns, aligns headers, and fills missing keys with empty cells.' This enriches the agent's understanding without contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise with three well-structured sentences. It front-loads the core purpose, adds behavioral details, and closes with usage guidance. Every sentence contributes meaningfully without redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity, the description covers all needed context: purpose, behavior, and usage scenarios. The presence of an output schema (not visible but given) reduces the need to describe returns. The description is complete for effective agent decision-making.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, and both parameters (input and columns) are described with clear semantics. The columns parameter explains default behavior ('default: all keys from first row'), adding value beyond the schema type constraints.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Convert a JSON array of objects into a Markdown table' which is a specific verb and resource. It further distinguishes itself from siblings like json_to_csv or format_json by specifying the output format and automatic column detection.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides explicit guidance: 'Use when an agent needs to present structured data — tool results, model comparisons, test reports — as a readable table in a response or document.' While it doesn't list exclusions, the context is clear. Sibling tools exist for other formats, but this one is specifically for Markdown tables.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
function_call_validateARead-onlyIdempotentInspect
Validate an LLM function call / tool_use output: check that function name is in allowed list, arguments match expected schema, no extra/missing args. For OpenAI function calling & MCP tool_use testing.
| Name | Required | Description | Default |
|---|---|---|---|
| function_call | Yes | The function call object from LLM (e.g. { "name": "get_weather", "arguments": {"city":"Paris"} }) | |
| allowed_functions | Yes | List of allowed function definitions |
Output Schema
| Name | Required | Description |
|---|---|---|
| valid | No | |
| errors | No | |
| error_count | No | |
| function_name | No | |
| provided_args | No | |
| required_args | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and idempotentHint=true. The description adds specific validation steps (name check, args match, no extra/missing args), offering useful behavioral context beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences: first states purpose and primary checks, second adds context. No wasted words, information is front-loaded.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the annotations and schema, the description covers the tool's function and context adequately. It explains what validation is performed and for which use cases, leaving no significant gaps.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% and detailed. Description adds meaning by explaining that the function call is validated against allowed_functions, clarifying the role of each parameter. A 4 is appropriate as the schema already does significant work.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states it validates LLM function call / tool_use output, checks function name against allowed list, arguments match schema, and no extra/missing args. It distinguishes from generic JSON schema validation tools and is specific to OpenAI/MCP contexts.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Provides context: 'For OpenAI function calling & MCP tool_use testing.' It implies the intended use case but does not explicitly state when not to use it or mention alternative tools like json_schema_validate.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
generate_curlARead-onlyIdempotentInspect
Generate a curl command from request parameters. Supports GET/POST/PUT/DELETE, custom headers, JSON body, and form data. Useful for documentation, sharing, and debugging API calls.
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | Request URL (must be http/https) | |
| body | No | Raw request body string | |
| method | No | HTTP method (default: GET) | |
| headers | No | Request headers as key-value object | |
| verbose | No | Add -v for verbose output (default: false) | |
| body_json | No | JSON body (auto-adds Content-Type: application/json) | |
| follow_redirects | No | Follow redirects with -L flag (default: true) |
Output Schema
| Name | Required | Description |
|---|---|---|
| url | No | |
| curl | No | |
| method | No | |
| header_count | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations declare readOnlyHint, idempotentHint, destructiveHint, ensuring safe operation. Description adds behavioral detail like support for GET/POST/PUT/DELETE, custom headers, JSON body, and form data, enhancing transparency beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences: first states core purpose, second lists capabilities and use cases. No redundancy, every sentence adds value.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With an output schema present, the description adequately covers tool functionality. It describes key features (methods, headers, body types) but leaves output format to the schema. Complete for a simple generation tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema covers 100% of parameters. Description mentions 'custom headers, JSON body, and form data' which align with headers and body_json parameters but adds no new semantic meaning beyond schema descriptions.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Generate a curl command from request parameters' with specific verbs and resource. It lists supported HTTP methods and body types, effectively distinguishing it from sibling tools which are diverse utilities.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly mentions use cases: 'documentation, sharing, and debugging API calls', providing clear context for when to apply this tool. No direct alternatives or exclusions given, but the context is sufficient for selection.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
generate_eval_yamlARead-onlyInspect
Generate a complete .ia-eval.yaml evaluation contract from a plain-language description of what your LLM should do. Uses Groq llama-3.3-70b (server-side, no API key needed). Returns ready-to-run YAML for the LLM Test Runner (run_eval_contract). Picks appropriate evaluators (cosine_similarity, contains_check, hallucination_check, etc.) based on the task type.
| Name | Required | Description | Default |
|---|---|---|---|
| task_type | No | Optional task type hint to guide evaluator selection. | |
| description | Yes | Plain-language description of what the LLM under test should do. Be specific: describe inputs, expected behaviour, and constraints. | |
| system_prompt | No | Optional system prompt of the LLM under test. Helps generate more accurate test cases. | |
| scenario_count | No | Number of scenarios to generate (default: 5). Covers happy path + edge cases + adversarial. |
Output Schema
| Name | Required | Description |
|---|---|---|
| yaml | No | |
| task_type | No | |
| model_used | No | |
| scenario_count | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description adds significant behavioral context beyond the annotations: it reveals that a server-side LLM (Groq llama-3.3-70b) is used, no API key is needed, and it selects appropriate evaluators based on task type. This informs the agent about external dependencies and processing, which the annotations (readOnlyHint, openWorldHint) only partially cover.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences, each providing critical information: the primary function and the operational details (model, evaluators, output compatibility). No extraneous words or redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the presence of an output schema (not shown but confirmed) and high schema coverage, the description adequately covers the tool's purpose, inputs, process, and output expectations. It lacks mention of network dependency or failure modes, but these are minor omissions for a tool with good annotations.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The schema covers all parameters with descriptions (100% coverage). The tool description adds little new semantic information beyond the schema, mostly restating parameter purposes (e.g., 'Optional task type hint', 'Optional system prompt'). Thus, baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly specifies the action ('Generate a complete .ia-eval.yaml evaluation contract'), the input ('plain-language description of what your LLM should do'), and the output ('ready-to-run YAML'). It distinguishes itself from sibling tools like 'run_eval_contract' by stating that it generates the contract, whereas 'run_eval_contract' runs it.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies that the tool is used to create evaluation contracts from plain-language descriptions. It mentions that the result is ready for 'run_eval_contract', providing a clear context of use. However, it does not explicitly state when not to use this tool versus alternatives like 'prompt_test_suite' or 'llm_generate', relying on sibling context.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
generate_hmacARead-onlyIdempotentInspect
Compute an HMAC signature for a message using a secret key. Supports SHA-256 (default), SHA-512, SHA-1, and MD5. Used for API request signing, webhook verification (GitHub, Stripe, Twilio), and JWT validation.
| Name | Required | Description | Default |
|---|---|---|---|
| secret | Yes | Secret key | |
| message | Yes | Message to sign | |
| encoding | No | Output encoding (default: hex) | |
| algorithm | No | Hash algorithm: sha256 (default), sha512, sha1, md5 |
Output Schema
| Name | Required | Description |
|---|---|---|
| hmac | No | |
| encoding | No | |
| algorithm | No | |
| message_length | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnlyHint=true, idempotentHint=true, destructiveHint=false. The description adds value by detailing that the tool computes HMAC, supports multiple algorithms and output encodings, and is intended for security-related tasks. No contradiction with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise with two sentences: first defines the core function, second lists common use cases. No redundant information. Information is front-loaded.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The tool is straightforward. Description covers purpose, supported algorithms, use cases, and output encoding. Output schema exists, so no need to explain return values. Complete for the tool's complexity.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% and parameter descriptions are present. The description repeats default algorithm (SHA-256) and default encoding (hex), which are already in schema. It does not add new semantic meaning beyond what schema provides.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool computes an HMAC signature, specifies supported algorithms, and lists concrete use cases like API request signing and webhook verification. It distinguishes from siblings like hash_text (which hashes without a key) and decode_jwt (which verifies JWT).
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly mentions common use cases (API request signing, webhook verification, JWT validation). However, it does not provide when-not-to-use or alternative tools, though the context of siblings implies alternatives for other hashing tasks.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
generate_html_reportARead-onlyIdempotentInspect
Convert a run_eval_contract() LLM Test Runner JSON result into a fully self-contained dark-themed HTML report with Pass/Fail badges, side-by-side Input/Output/Ground-Truth panels, evaluator score bars, and a radar chart. Returns the HTML as a string.
| Name | Required | Description | Default |
|---|---|---|---|
| results | Yes | The JSON object returned by run_eval_contract() |
Output Schema
| Name | Required | Description |
|---|---|---|
| html | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint, idempotentHint, and destructiveHint false. The description adds that the output is an HTML string, but does not disclose potential resource usage, rate limits, or handling of malformed input.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single, well-structured sentence that immediately states the core purpose, then lists key visual features. No wasted words, front-loaded with the most important information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The description covers the input (run_eval_contract JSON) and output (HTML string). With an output schema present and annotations covering safety, the description is mostly complete, though it omits error handling or edge case details.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with clear description for the sole parameter 'results'. The tool description does not add parameter-specific details beyond what is in the schema, so baseline score of 3 applies.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool converts run_eval_contract() JSON into an HTML report, listing specific visual features (dark theme, Pass/Fail badges, panels, score bars, radar chart). It is distinct from sibling tools like ab_test_report or compare_models, which handle different data.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage after run_eval_contract(), but does not explicitly state when to use this vs. alternatives, nor does it provide exclusions or prerequisites.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
generate_json_ldARead-onlyIdempotentInspect
Generate a ready-to-paste snippet for GEO / structured data optimization. Supported types: WebSite, FAQPage, Article, Person, Organization, SoftwareApplication, HowTo.
| Name | Required | Description | Default |
|---|---|---|---|
| type | Yes | Schema @type: "WebSite", "FAQPage", "Article", "Person", "Organization", "SoftwareApplication", "HowTo" | |
| fields | No | Schema fields as key-value pairs (name, url, description, author, datePublished, etc.) | |
| faq_items | No | For FAQPage/HowTo: array of { question, answer } objects |
Output Schema
| Name | Required | Description |
|---|---|---|
| name | No | |
| schema | No | |
| snippet | No | |
| acceptedAnswer | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnly and idempotent. The description adds that it produces a 'ready-to-paste' script tag, but doesn't elaborate on behavior (e.g., validation, error handling). No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences efficiently communicate purpose and supported types with no redundancy. Every sentence serves a clear purpose.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With three parameters, nested objects, and an output schema, the description is sufficient for a simple data generation tool. It covers the essence, though could mention that the output is raw script text. Not a major gap.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The schema covers all parameters with descriptions. The description repeats the type options but does not add new semantic meaning beyond the schema. Baseline of 3 applies due to high schema coverage.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool generates a JSON-LD script snippet for structured data, listing seven supported schema types. This distinguishes it from any sibling tool that might generate other formats.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description indicates the tool is for generating JSON-LD snippets for specific schema types, implying its usage context. However, it lacks explicit guidance on when to use this tool over alternatives or when not to use it, which is acceptable given no competing sibling.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
generate_passwordARead-onlyInspect
Generate a cryptographically secure random password using crypto.randomBytes. Configurable length (4–128), uppercase letters, digits, and symbols. Use when resetting user passwords, seeding test accounts, or generating API secrets.
| Name | Required | Description | Default |
|---|---|---|---|
| length | No | Password length (4–128, default: 16) | |
| numbers | No | Include digits (default: true) | |
| symbols | No | Include symbols like !@#$ (default: false) | |
| uppercase | No | Include uppercase letters (default: true) |
Output Schema
| Name | Required | Description |
|---|---|---|
| length | No | |
| password | No | |
| charset_size | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations indicate readOnlyHint=true, so the tool is safe and non-destructive. The description adds value by specifying the method (crypto.randomBytes) and configurability of length/character sets. No contradictions. Slightly less than 5 because it doesn't detail output format or strength guarantees beyond cryptographic security.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two concise sentences: first states action and method, second states use cases. No wasted words. Front-loaded with essential information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity, schema coverage, annotations, and output schema existence, the description is complete. It covers purpose, method, parameters, and use cases. No obvious gaps.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with descriptions for all 4 parameters. The description mentions 'Configurable length (4–128), uppercase letters, digits, and symbols,' which aligns with schema fields but doesn't add new semantic meaning beyond what the schema already provides. Baseline of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Generate a cryptographically secure random password' - a specific verb+resource. It distinguishes itself from sibling tools (e.g., generate_uuid) by focusing on password generation with cryptographic security. The mention of 'crypto.randomBytes' adds specificity.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly provides use cases: 'Use when resetting user passwords, seeding test accounts, or generating API secrets.' This gives clear context for when to use. However, it does not explicitly state when not to use or mention alternatives, keeping it from a 5.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
generate_slugARead-onlyIdempotentInspect
Convert any string into a URL-friendly slug: lowercase, ASCII-normalized (é→e), special characters removed, spaces replaced with hyphens. Use for generating SEO-friendly URL paths, file names, or identifier keys from user-provided titles or labels.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | String to slugify | |
| separator | No | Separator character (default: "-") |
Output Schema
| Name | Required | Description |
|---|---|---|
| slug | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already provide readOnlyHint, idempotentHint, destructiveHint. The description adds specific behavioral details: lowercase conversion, ASCII normalization, removal of special characters, and hyphen replacement. No contradictions with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences: first describes the action and transformation details, second states use cases. No wasted words, front-loaded with key information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple 2-parameter tool with full schema coverage and an output schema, the description covers the transformation algorithm and use cases completely. No gaps.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, fully documenting both parameters (input string and separator with default). The description mentions that spaces are replaced with hyphens, which aligns with the separator default, but does not add new semantic meaning beyond the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states the tool converts any string into a URL-friendly slug with specific transformation steps (lowercase, ASCII-normalized, special chars removed, spaces replaced with hyphens). It also lists use cases: SEO-friendly URLs, file names, identifier keys. This distinguishes it from all sibling tools.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Description explicitly says 'Use for generating SEO-friendly URL paths, file names, or identifier keys from user-provided titles or labels.' It gives clear when-to-use guidance but does not mention when not to use or alternatives. However, sibling tools do not include a similar slugify function, so no direct alternatives exist.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
generate_test_casesBRead-onlyInspect
Generate a set of test cases (valid, edge, invalid) for a given feature description. Returns test matrix with Gherkin scenarios ready to use.
| Name | Required | Description | Default |
|---|---|---|---|
| inputs | No | Optional: list of input parameters (one per line, e.g. "email: string [required]") | |
| feature | Yes | Feature or function to test. Be specific: describe inputs, expected behaviour, context. |
Output Schema
| Name | Required | Description |
|---|---|---|
| feature | No | |
| test_cases | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations (readOnlyHint, openWorldHint) already indicate it is read-only and may depend on external data. The description adds that it returns a test matrix with Gherkin scenarios, but does not elaborate on behavioral traits like determinism or rate limits.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences, front-loaded with the core action, and contains no wasted words. It is appropriately sized for the tool's simplicity.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With an output schema, the description need not detail return values, but it omits the optional 'inputs' parameter and lacks usage guidelines. It is adequate but not fully complete for a tool with 2 parameters and sibling overlap.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so parameters are already described. The description does not add extra meaning beyond the schema; the optional 'inputs' parameter is not mentioned. Baseline of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it generates test cases (valid, edge, invalid) for a feature description and returns Gherkin scenarios. It distinguishes from siblings by its specific output format, but does not explicitly contrast with other testing tools.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides no guidance on when to use this tool versus alternatives like get_testing_guidelines or jira_to_test_suite. There are no prerequisites or context about when it is appropriate.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
generate_uuidARead-onlyInspect
Generate one or more cryptographically random UUID v4 identifiers. Use this when you need unique IDs for test fixtures, database records, session tokens, or any scenario requiring a guaranteed-unique string. Returns up to 100 UUIDs in one call.
| Name | Required | Description | Default |
|---|---|---|---|
| count | No | Number of UUIDs to generate (1–100, default: 1) |
Output Schema
| Name | Required | Description |
|---|---|---|
| count | No | |
| uuids | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnly=true, destructive=false. Description adds 'cryptographically random' and 'up to 100' which are useful but not essential. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three sentences: purpose, usage, capacity. No unnecessary words, very efficient.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given simple tool with one parameter and output schema present, description covers all needed context: what it generates, why, and constraints.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% and already describes the 'count' parameter. Description does not add additional parameter meaning or examples beyond the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Generate one or more cryptographically random UUID v4 identifiers', specifying the verb, resource, and capability of generating multiple. It distinguishes itself clearly from sibling tools that have different purposes.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Provides explicit usage scenarios: unique IDs for test fixtures, database records, session tokens. Mentions maximum count (100). Lacks explicit when-not-to-use, but context is clear enough.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
get_testing_guidelinesARead-onlyIdempotentInspect
Query the IA-QA methodology knowledge base. Returns structured testing guidelines, assertion strategies, thresholds, best practices, and relevant MCP tools for a given topic. Call without a topic to list all available topics. Topics: llm-unit-testing, rag-pipeline, prompt-stability, prompt-ab-testing, embedding-quality, eval-framework, semantic-testing, auto-testing, security, api-testing, ci-cd, multimodal, llm-data-security, agent-observability, pro-tips, learning-paths, golden-dataset.
| Name | Required | Description | Default |
|---|---|---|---|
| topic | No | The testing topic to retrieve guidelines for. Omit to get the full list of available topics. |
Output Schema
| Name | Required | Description |
|---|---|---|
| tip | No | |
| topic | No | |
| usage | No | |
| keywords | No | |
| available_topics | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, destructiveHint=false. The description adds value by detailing the return content (structured guidelines, strategies, thresholds, best practices, MCP tools) and the behavior when no topic is provided (list all topics). This goes beyond what annotations provide, though the core safety profile is already covered.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is three sentences long, front-loaded with the core purpose, followed by return details and topic enumeration. Every sentence adds value with no redundancy or fluff. Perfectly sized for quick comprehension.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity (one optional parameter, read-only, idempotent) and the presence of an output schema, the description provides all necessary context. It explains the return content, the default behavior without topic, and lists all possible topics. No gaps remain for selecting and invoking the tool correctly.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with one parameter having an enum and description. The description repeats the enum values in a list, adding minimal new meaning. It does provide context that the topics are 'all available topics', but essentially duplicates information already in the schema. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool queries a knowledge base and returns structured guidelines, strategies, thresholds, and best practices for a given topic. It explicitly specifies the verb (query) and resource (IA-QA methodology knowledge base), distinguishing it from sibling testing tools that perform actions rather than retrieve knowledge. The enumeration of topics adds specificity.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides some usage guidance: calling without a topic lists all available topics. However, it fails to explicitly state when not to use this tool or mention alternatives among the many sibling testing tools. While the purpose is clear, guidance on tool selection relative to siblings is lacking.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
guardrail_testARead-onlyIdempotentInspect
Test an LLM response against a set of guardrail rules: must-include, must-not-include, max length, required format, language, forbidden patterns, and custom regex. Returns pass/fail per rule.
| Name | Required | Description | Default |
|---|---|---|---|
| rules | Yes | Array of guardrail rules to check | |
| response | Yes | The LLM response to test |
Output Schema
| Name | Required | Description |
|---|---|---|
| pass | No | |
| rule | No | |
| label | No | |
| value | No | |
| detail | No | |
| failed | No | |
| passed | No | |
| results | No | |
| all_passed | No | |
| total_rules | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false, indicating a safe, non-mutating operation. The description adds behavioral context by describing the rule testing logic and the return format (pass/fail per rule). There is no contradiction with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single, well-structured sentence that immediately states the tool's action, then lists the rule types. It is concise, front-loaded, and contains no unnecessary words or details.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given that there is an output schema (not provided but flagged as present), the description does not need to explain return values in depth. It mentions 'pass/fail per rule', which is sufficient for an agent to understand what to expect. The tool's complexity (multiple rule types) is adequately covered.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100% with clear descriptions for both parameters (response and rules). The description mentions rule types and examples, but these are already enumerated in the schema's enum for the 'type' property. Thus, the description adds minimal extra meaning beyond what the schema provides.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: 'Test an LLM response against a set of guardrail rules'. It enumerates specific rule types (must-include, must-not-include, max length, etc.) and mentions the output format (pass/fail per rule). This distinguishes it from sibling tools, many of which are general analysis or generation tools.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description does not provide explicit guidance on when to use this tool over alternatives or when not to use it. The purpose is implied by the name and description, but no exclusions or alternatives are mentioned. An agent would need to infer usage context from the sibling list.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
hallucination_checkARead-onlyIdempotentInspect
Word-overlap based hallucination check: verifies if an LLM answer's words and numbers appear in the provided source/context. Fast, deterministic, no API key needed. Limitations: not semantic — does not understand synonyms or paraphrases. For true semantic grounding, use run_semantic_tests with embedding mode. Essential for quick RAG accuracy testing.
| Name | Required | Description | Default |
|---|---|---|---|
| answer | Yes | The LLM-generated answer to verify | |
| strict | No | If true, every sentence in the answer must be supported (default: false) | |
| context | Yes | The source/reference text that should ground the answer |
Output Schema
| Name | Required | Description |
|---|---|---|
| detail | No | |
| message | No | |
| numbers | No | |
| overlap | No | |
| verdict | No | |
| analysis | No | |
| entities | No | |
| grounded | No | |
| sentence | No | |
| total_words | No | |
| matched_words | No | |
| grounded_count | No | |
| grounding_score | No | |
| total_sentences | No | |
| ungrounded_count | No | |
| unsupported_claims | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate safety (readOnlyHint=true, idempotentHint=true, destructiveHint=false). The description adds behavioural context: fast, deterministic, no API key needed, and its word-overlap nature. It does not contradict annotations and provides useful operational details beyond the annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is three sentences, front-loaded with the key purpose, then limitations, alternative, and use case. Each sentence adds value without redundancy. Highly efficient.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity and the presence of an output schema, the description adequately covers purpose, usage, and limitations. It provides sufficient context for an agent to correctly invoke the tool and interpret its output.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline is 3. The description does not add new parameter details beyond the schema's own descriptions (e.g., 'answer', 'context', 'strict'). It implies the role of parameters but provides no additional semantic depth or format constraints.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description explicitly states the tool performs a 'word-overlap based hallucination check', clearly identifying the verb ('check') and resource ('hallucination' via overlap). It distinguishes itself from siblings by contrasting with 'run_semantic_tests' and describing its deterministic, fast nature.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides explicit guidance: use for quick RAG accuracy testing, but not for semantic understanding ('does not understand synonyms or paraphrases'). It directs users to an alternative ('run_semantic_tests') for semantic grounding, offering clear context on when to choose this tool.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
hash_textARead-onlyIdempotentInspect
Compute a cryptographic hash of a text string. Use when you need to verify data integrity, generate content fingerprints, hash passwords (prefer SHA-256+), or produce a fixed-length digest of any input. Supports SHA-256 (default), SHA-512, SHA-1, and MD5.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Text to hash | |
| algorithm | No | Hash algorithm: sha256 (default), sha512, sha1, md5 |
Output Schema
| Name | Required | Description |
|---|---|---|
| hash | No | |
| algorithm | No | |
| input_length | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint, idempotentHint, and non-destructive. Description adds context about supported algorithms and default (SHA-256), plus a caution for password hashing. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, front-loaded with action and resource. Every sentence contributes value: first states purpose, second gives use cases and algorithm details. No wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the presence of an output schema and annotations, the description covers purpose, algorithms, and use cases sufficiently. Could mention that it returns a hex-encoded string (though likely in output schema). Overall complete for a simple hashing tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100% (both parameters have descriptions). The description repeats algorithm options but does not add significant new meaning beyond what the schema provides. Baseline of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Clearly states the verb 'compute' and resource 'cryptographic hash of a text string'. Lists specific use cases (data integrity, content fingerprints, password hashing) which distinguish it from sibling text tools like base64_encode or diff_text.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly describes when to use: for data integrity, fingerprints, password hashing (with algorithm recommendation), and fixed-length digest. No explicit when-not-to-use or alternatives, but the breadth of use cases provides clear context.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
html_to_markdownARead-onlyIdempotentInspect
Convert HTML to clean Markdown. Strips scripts, styles, nav, ads, and comments. Converts headings, lists, links, images, code blocks. Ideal for preparing web content as LLM context.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | HTML string to convert | |
| strip_links | No | Strip link URLs, keep text only (default: false) |
Output Schema
| Name | Required | Description |
|---|---|---|
| markdown | No | |
| markdown_length | No | |
| original_length | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare non-destructive, read-only, idempotent behavior. The description adds specific details on what is stripped (scripts, styles, nav, ads, comments) and converted, providing additional transparency beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences with no wasted words. The first sentence states the core purpose and key behaviors, the second adds a typical use case. Highly efficient and front-loaded.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity, the description covers the main behavior and use case. It does not address edge cases like malformed HTML, but the presence of output schema makes return value description unnecessary.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Input schema has 100% description coverage for both parameters. The tool description does not add any semantic context beyond what the schema already provides for the parameters.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Convert HTML to clean Markdown' and lists specific conversion details (headings, lists, etc.) and stripping behaviors. It distinguishes from any sibling tools like strip_markdown which does the opposite.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No explicit alternatives or when-not-to-use is provided. The phrase 'Ideal for preparing web content as LLM context' implies a use case but does not guide against using other tools or warn of limitations.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
http_status_lookupARead-onlyIdempotentInspect
Look up detailed information about any HTTP status code: class, name, description, cacheability, typical causes, and handling best practices. Covers all standard 1xx-5xx codes.
| Name | Required | Description | Default |
|---|---|---|---|
| code | Yes | HTTP status code (e.g. 200, 404, 429, 503) |
Output Schema
| Name | Required | Description |
|---|---|---|
| code | No | |
| desc | No | |
| name | No | |
| class | No | |
| cacheable | No | |
| description | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations provide readOnlyHint=true and idempotentHint=true. The description adds value by detailing the type of information returned (e.g., best practices, cacheability), beyond what annotations convey. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description consists of two concise sentences. The first sentence front-loads the action and output details; the second sentence clarifies coverage. No extraneous information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The description covers the tool's purpose, parameter, and output fully. Given the simple one-parameter tool with an output schema, the description is complete and leaves no gaps.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with a single parameter 'code' described as 'HTTP status code (e.g. 200, 404, 429, 503)'. The description does not add additional parameter semantics beyond the schema, so baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the action ('Look up detailed information') and the resource ('HTTP status code'). It lists specific data returned (class, name, description, cacheability, etc.) and covers all standard codes, distinguishing it from sibling tools.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies when to use the tool (when needing HTTP status code details). It lacks explicit when-not-to-use or alternatives, but given the unique purpose among siblings, it is sufficiently clear.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
identify_callerARead-onlyIdempotentInspect
Returns what the server knows about the current MCP client: clientInfo captured during initialize, User-Agent, and any _meta fields sent with this request. Useful for debugging caller identification.
| Name | Required | Description | Default |
|---|---|---|---|
| _meta | No | Optional self-identification. Keys: agent (string), model (string), version (string). |
Output Schema
| Name | Required | Description |
|---|---|---|
| note | No | |
| session | No | |
| meta_override | No | |
| effective_agent | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false, so the description adds limited behavioral context. It mentions the return contents (clientInfo, User-Agent, _meta) which is useful but not essential beyond the annotations. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences: first explains functionality, second states use case. No wasted words, front-loaded with key information. Excellent conciseness.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple read-only tool with one optional parameter and an output schema (not shown but present), the description sufficiently covers purpose, behavior, and usage context. No additional details are needed.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, and the parameter _meta is fully described in the schema. The description merely mentions _meta fields without adding new detail, so it meets the baseline 3 but does not exceed it.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool returns server knowledge about the current MCP client, listing exact items (clientInfo, User-Agent, _meta). This specific verb-resource pairing distinguishes it from sibling tools, which are mostly about text processing or other unrelated tasks.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides a clear use case ('Useful for debugging caller identification') but does not explicitly mention when not to use it or suggest alternatives. Given the sibling tools are largely unrelated, the context is sufficient for an agent to infer appropriate usage.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
jira_to_test_suiteARead-onlyInspect
Transform a Jira ticket into a complete test suite: Gherkin scenarios, E2E steps, API test cases, test data matrix, and ambiguity detection. Accepts either Jira credentials (auto-fetch) or a pre-fetched issue object. The returned test_suite includes _gherkin_warnings (deterministic syntax validation — empty if clean). Requires BYOK LLM key (OpenAI, Anthropic, etc.).
| Name | Required | Description | Default |
|---|---|---|---|
| issue | No | Pre-fetched issue object from fetch_jira_issue, OR a mock object with fields: key, summary, description (plain text or Markdown), status, issue_type, priority, labels, comments. Use this for offline/CI testing without Jira credentials. | |
| model | Yes | LLM model to use, e.g. "gpt-4o-mini", "claude-3-5-haiku-20241022", "gemini-2.0-flash". | |
| api_key | Yes | Your LLM provider API key (OpenAI sk-, Anthropic sk-ant-, Google AIzaSy-, etc.). | |
| issue_key | No | Jira issue key to fetch automatically, e.g. "PROJ-123". Required if issue is not provided. | |
| jira_email | No | Atlassian account email. Required for auto-fetch mode. | |
| jira_token | No | Atlassian API token. Required for auto-fetch mode. | |
| max_tokens | No | Maximum tokens for the LLM response. Default: 8192. Increase for large tickets with many ACs; decrease to reduce cost on simple tickets. | |
| jira_base_url | No | Atlassian base URL. Required for auto-fetch mode. | |
| confluence_pages | No | Optional array of pre-fetched Confluence page objects from fetch_confluence_page, used as documentation context. |
Output Schema
| Name | Required | Description |
|---|---|---|
| summary | No | |
| issue_key | No | |
| issue_url | No | |
| latency_ms | No | |
| model_used | No | |
| test_suite | No | |
| tokens_used | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations declare readOnlyHint true, so no modification of external state. Description adds key context: requires BYOK LLM key (side effect), outputs include deterministic syntax warnings. No contradiction with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three sentences, front-loaded with purpose and outputs, then input modes, then warnings and key requirement. No redundancy; every sentence adds value.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given 9 parameters, nested objects, and existing output schema, the description covers input modes, output structure, and a crucial prerequisite (LLM key). Missing details on error handling or edge cases, but sufficient for typical usage.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline 3. The description provides high-level parameter rationale (e.g., two input modes) but does not add detailed semantics beyond the schema descriptions. Adequate but not exceptional.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool transforms a Jira ticket into a complete test suite, listing specific outputs (Gherkin scenarios, E2E steps, API test cases, test data matrix, ambiguity detection). It distinguishes itself from siblings like generate_test_cases and fix_gherkin by offering a comprehensive, integrated generation.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Description explicitly mentions two input modes (auto-fetch with Jira credentials vs. pre-fetched issue object) and the BYOK LLM requirement. While it doesn't explicitly state when not to use it or name specific alternatives, the context is clear enough for most agents.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
json_diffARead-onlyIdempotentInspect
Compute a deep structural diff between two JSON values. Returns added, removed, and changed keys with dot-notation paths. Like git diff but for JSON objects — perfect for API response regression testing.
| Name | Required | Description | Default |
|---|---|---|---|
| after | Yes | Modified JSON string (after) | |
| before | Yes | Original JSON string (before) | |
| max_depth | No | Max nesting depth to recurse (default: 10) |
Output Schema
| Name | Required | Description |
|---|---|---|
| added | No | |
| changes | No | |
| removed | No | |
| modified | No | |
| identical | No | |
| total_changes | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, destructiveHint=false. The description adds value by specifying the output format (added/removed/changed keys with dot-notation paths), which goes beyond annotation hints. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences achieve maximum efficiency: first sentence states purpose and output, second provides analogy and use case. No redundant information; every word earns its place.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity (3 parameters, no nested objects) and the presence of an output schema that explains return values, the description covers purpose, behavior, and use case adequately. No gaps remain.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
All three parameters (before, after, max_depth) are fully described in the input schema with clear descriptions. The description does not add additional parameter details, so the schema carries the burden. Baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states the tool computes a deep structural diff between two JSON values, listing added/removed/changed keys with dot-notation paths. The analogy to 'git diff for JSON' and specific use case (API response regression testing) distinguishes it from sibling tools like diff_text or json_to_csv.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description gives context for when to use (API response regression testing) but does not explicitly exclude other scenarios or mention alternatives. The analogy helps, but a clear 'when not to use' or comparison to similar siblings would improve this dimension.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
json_schema_generateARead-onlyIdempotentInspect
Infer a JSON Schema (draft-07) from a sample JSON value. Detects types, required fields, array item shapes, nested objects, and common string formats (email, uri, date, date-time, uuid). Returns a ready-to-use schema compatible with json_schema_validate. Use when you have a sample API response or LLM output and want to auto-generate a validation schema for CI/CD testing.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Sample JSON value (object, array, or scalar) to infer the schema from | |
| required_all | No | Mark all detected object properties as required (default: true) |
Output Schema
| Name | Required | Description |
|---|---|---|
| type | No | |
| items | No | |
| format | No | |
| schema | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false. The description adds behavioral details beyond this: it mentions detection of types, required fields, array item shapes, nested objects, and common string formats (email, uri, date, date-time, uuid). It also states compatibility with json_schema_validate, which is useful.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise: two sentences, with the main action in the first sentence. Every word is necessary, and there is no repetition or fluff. It is front-loaded with the core functionality.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given that a output schema exists (context indicates yes), annotations cover safety, and the description explains use case and features, it is complete. There is no missing information for an agent to decide to use this tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so the schema already documents both parameters ('input' and 'required_all') adequately. The description does not add significant new meaning beyond the schema, so the baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: 'Infer a JSON Schema (draft-07) from a sample JSON value.' It specifies the action (infer), the output (JSON Schema), and the input (sample JSON value). This distinguishes it from sibling tools like json_schema_validate, which validates against a schema.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides explicit usage context: 'Use when you have a sample API response or LLM output and want to auto-generate a validation schema for CI/CD testing.' It does not explicitly state when not to use or name alternatives, but the sibling list includes json_schema_validate, implying the alternative.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
json_schema_validateARead-onlyIdempotentInspect
Validate a JSON value against a JSON Schema (draft-07 subset). Supports type, required, properties, items, enum, const, pattern, format (email/uri/date), minimum/maximum, minLength/maxLength, minItems/maxItems, uniqueItems, additionalProperties, anyOf, allOf, oneOf. Returns all validation errors with dot-notation paths.
| Name | Required | Description | Default |
|---|---|---|---|
| value | Yes | JSON string to validate | |
| schema | Yes | JSON Schema as a JSON string |
Output Schema
| Name | Required | Description |
|---|---|---|
| valid | No | |
| errors | No | |
| error_count | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate read-only, idempotent, non-destructive behavior. The description adds that it returns 'all validation errors with dot-notation paths,' providing behavioral details beyond annotations. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise (three sentences), front-loaded with the core purpose, and efficiently lists supported features and output behavior. Every sentence adds value without redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the presence of an output schema (context signal), the description does not need to detail return values. It covers input format, supported schema features, and error output format. Minor gap: it does not clarify that validation is strict or mention unsupported features, but overall complete.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Both parameters have high schema description coverage (100%). The description restates that value and schema are JSON strings but does not add significant new semantics. The list of supported schema features indirectly informs the schema parameter, but no additional detail beyond the schema descriptions.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: 'Validate a JSON value against a JSON Schema (draft-07 subset).' It specifies the verb (validate) and resource (JSON value against schema), and lists supported features, distinguishing it from siblings like json_schema_generate.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage for validating JSON against a schema, but does not explicitly state when not to use it or mention alternative tools. However, the specific listing of supported schema features helps guide appropriate usage.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
json_to_csvARead-onlyIdempotentInspect
Convert a JSON array of objects to CSV format. Automatically detects columns from all object keys. Handles quoting and escaping per RFC 4180.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | JSON string containing an array of objects | |
| headers | No | Include header row (default: true) | |
| delimiter | No | Column delimiter (default: ",") |
Output Schema
| Name | Required | Description |
|---|---|---|
| csv | No | |
| rows | No | |
| columns | No | |
| column_names | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate idempotent, read-only behavior. The description adds valuable behavioral detail: automatic column detection from object keys, and RFC 4180 quoting/escaping. This goes beyond the annotations, though it could mention delimiter behavior more explicitly.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences deliver all essential information: the conversion action, column detection, and RFC compliance. No redundant words or filler. Efficient and well-structured.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple transformation tool with 3 parameters and an existing output schema, the description covers the core functionality and behavioral nuances. It lacks details on error handling or edge cases but is sufficient for this complexity.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so the schema already documents each parameter (input, headers, delimiter) adequately. The description adds no additional parameter-level details beyond the schema, but the baseline is 3 due to high coverage.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the conversion action ('Convert a JSON array to CSV') and specifies key behaviors like automatic column detection and RFC 4180 compliance. It distinguishes this tool from sibling conversion tools (e.g., json_to_yaml) by focusing on the specific output format.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides implicit context for when to use the tool (converting JSON to CSV) but lacks explicit guidance on when not to use it, alternative tools, or handling of edge cases (e.g., non-array input). No reference to sibling tools or exclusions.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
json_to_yamlARead-onlyIdempotentInspect
Convert a JSON object to clean, human-readable YAML. Handles nested objects, arrays, multiline strings, and special characters. No external dependencies.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | JSON string to convert to YAML | |
| indent | No | Indentation size in spaces (default: 2) |
Output Schema
| Name | Required | Description |
|---|---|---|
| yaml | No | |
| lines | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already provide readOnly, idempotent, non-destructive. Description adds detail on handling nested objects, arrays, multiline strings, special characters, and no dependencies, enhancing transparency beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences with no fluff. Front-loaded with main action. Every sentence adds necessary information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple conversion tool with full schema coverage, annotations, and output schema, the description covers key behaviors. Could mention error handling or input validation but not critical.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with descriptions for both parameters. Description does not add additional meaning beyond schema; defaults and format details are implicit.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
States specific verb 'Convert' and resource 'JSON object to YAML'. Clear purpose but does not differentiate from sibling tools like yaml_to_json.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance on when to use this tool versus alternatives (e.g., format_json, yaml_to_json). No context about prerequisites or limitations.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
latency_benchmarkARead-onlyInspect
Measure response time of one or more HTTP endpoints (GET/POST). Runs N iterations and returns min/max/avg/p95 latency. Useful for API and MCP server benchmarking.
| Name | Required | Description | Default |
|---|---|---|---|
| endpoints | Yes | Endpoints to benchmark | |
| iterations | No | Number of iterations per endpoint (default: 3, max: 10) |
Output Schema
| Name | Required | Description |
|---|---|---|
| results | No | |
| iterations | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, destructiveHint=false, and openWorldHint=true. The description adds value by detailing the iteration behavior and the summary statistics returned. It does not contradict annotations and provides context beyond the structured fields.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two concise sentences with no redundant information. It front-loads the verb and resource, and every sentence adds value—no wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the simple input schema (2 parameters) and the presence of an output schema (as indicated by context signals), the description adequately covers what the tool does and what it returns. No critical gaps are present.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100% with detailed descriptions for each parameter. The tool description does not add any additional parameter meaning beyond the schema, so the baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the action (measure response time), the resource (HTTP endpoints), and the specific verbs and outputs (GET/POST, min/max/avg/p95 latency). It effectively distinguishes itself from sibling tools focused on health checks or other evaluations.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description indicates that the tool is 'useful for API and MCP server benchmarking,' which implies a use case but does not explicitly state when to use it instead of siblings or any exclusions. This lacks the explicit when-not/alternatives guidance needed for a higher score.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
levenshtein_distanceARead-onlyIdempotentInspect
Compute the Levenshtein (edit) distance and normalized similarity ratio between two strings. Supports batch comparison. Useful for fuzzy string matching, deduplication, and test result comparison.
| Name | Required | Description | Default |
|---|---|---|---|
| a | No | First string (single-pair mode) | |
| b | No | Second string (single-pair mode) | |
| batch | No | Batch of {a,b} pairs (max 50) | |
| case_insensitive | No | Ignore case differences (default: false) |
Output Schema
| Name | Required | Description |
|---|---|---|
| a | No | |
| b | No | |
| mode | No | |
| count | No | |
| results | No | |
| distance | No | |
| similarity | No | |
| operations_needed | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate safe, idempotent, non-destructive behavior. Description adds useful behavioral context: batch support and normalized similarity ratio output. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three concise sentences: core action, batch capability, use cases. Front-loaded with essential info, no unnecessary words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the output schema exists, the description adequately covers the tool's purpose and key behaviors. Could mention return format for batch vs single, but not essential. Overall sufficient.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema covers all 4 parameters with descriptions. The description adds meaning beyond schema by mentioning normalized similarity ratio and batch usage, which are not fully captured in parameter descriptions.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool computes Levenshtein distance and normalized similarity ratio for two strings, and highlights batch support. It distinguishes from sibling tools like similarity_score or embedding_similarity by specifying the exact algorithm (edit distance).
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description lists use cases (fuzzy matching, deduplication, test comparison) but does not explicitly state when not to use it or compare to alternatives. Guidance is present but not comprehensive.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
lint_commit_messageARead-onlyIdempotentInspect
Validate a git commit message against the Conventional Commits spec (feat, fix, docs, style, refactor, test, chore, ci, perf, build). Returns compliance score, breaking change detection, and actionable suggestions.
| Name | Required | Description | Default |
|---|---|---|---|
| strict | No | Enforce strict rules: max 72-char subject, imperative mood check (default: false) | |
| message | Yes | Git commit message to validate |
Output Schema
| Name | Required | Description |
|---|---|---|
| type | No | |
| scope | No | |
| score | No | |
| valid | No | |
| checks | No | |
| subject | No | |
| has_body | No | |
| is_breaking_change | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, destructiveHint=false. The description adds behavioral context by stating it 'Returns compliance score, breaking change detection, and actionable suggestions,' which goes beyond annotations. No contradiction detected.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise (one sentence) and front-loaded with the action. It is efficient, but could be slightly more structured (e.g., breaking into two sentences) without losing clarity.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool has 2 parameters, annotations, and an output schema, the description adequately covers the return values (compliance score, breaking change detection, suggestions). It is sufficiently complete for agent use.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with descriptions for both parameters (message and strict). The tool description does not add meaning beyond the schema; it only lists the conventional commit types. Baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Validate a git commit message against the Conventional Commits spec' and lists the allowed types (feat, fix, etc.), which is a specific verb and resource. It distinguishes from sibling tools since no other commit-lint tool exists in the sibling list.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage for conventional commits validation but does not explicitly specify when to use it versus alternatives or provide any exclusions. Guidance on when not to use it is absent.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
list_llm_modelsARead-onlyIdempotentInspect
List all LLM models available on ia-qa.com with their provider, API endpoint, and capabilities. Filter by provider name (e.g. "Groq", "HuggingFace", "OpenAI") or return the full catalog. Use this to discover which models are available before calling an LLM API, or to compare providers.
| Name | Required | Description | Default |
|---|---|---|---|
| provider | No | Filter by provider name (case-insensitive). E.g. "Groq", "HuggingFace", "OpenAI", "Anthropic", "Google", "DeepSeek", "xAI", "Ollama". Omit for full catalog. |
Output Schema
| Name | Required | Description |
|---|---|---|
| total | No | |
| filter | No | |
| models | No | |
| providers | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, destructiveHint=false, covering safety. The description adds that the tool returns provider, API endpoint, and capabilities, but this is partly covered by the output schema's existence. It does not contradict annotations and provides mild added context.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is three sentences: purpose, filtering behavior, and use cases. Every sentence is informative and earns its place. No redundancy or padding.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple tool with one optional parameter, full schema coverage, annotations, and an output schema, the description adequately covers purpose, usage, and behavior. Nothing essential is missing.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%; the parameter 'provider' already has a detailed description with examples. The tool's description merely restates the filtering capability without adding new semantic detail beyond what the schema provides.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description specifically uses the verb 'List' combined with the resource 'LLM models' and the scope 'available on ia-qa.com', clearly stating the tool's function. It also distinguishes itself from sibling tools like 'model_info' by indicating it returns a full catalog or filtered list.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly states when to use the tool: 'to discover which models are available before calling an LLM API, or to compare providers.' While it does not mention alternatives from siblings, the intended usage is clear and contextually appropriate.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
list_local_testsARead-onlyIdempotentInspect
Discover .ia-eval.yaml LLM test suite files in the project directory. Scans CWD and standard sub-directories (evals/, tests/, contracts/). Returns file paths ready to pass to run_eval_contract.
| Name | Required | Description | Default |
|---|---|---|---|
| dir | No | Directory to scan (defaults to server CWD) |
Output Schema
| Name | Required | Description |
|---|---|---|
| dir | No | |
| count | No | |
| files | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false. The description adds scanning scope but does not provide further behavioral traits beyond what annotations convey. No contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, first stating purpose, second providing details. Front-loaded, no extraneous words, every sentence earns its place.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Simple tool with one optional parameter, has output schema. Description covers discovery scope, sub-directories, and integration with run_eval_contract. Sufficient for the task.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so baseline is 3. The description does not add additional meaning beyond the schema's parameter description (directory to scan, defaults to CWD). No extra value.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb 'Discover' and the resource '.ia-eval.yaml LLM test suite files' with scope 'project directory' and specific sub-directories. It effectively distinguishes from sibling tools like run_eval_contract which executes the found files.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly indicates the tool scans CWD and standard sub-directories and returns paths for run_eval_contract, providing clear usage context. It lacks explicit when-not-to-use guidance but is adequate for the simple use case.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
llm_fit_finderARead-onlyIdempotentInspect
Find the best LLM for a given use case. Compares 30+ cloud API models and 12+ local models by cost, speed, benchmarks, features and VRAM requirements. Returns ranked recommendations with cost simulation. No API key needed.
| Name | Required | Description | Default |
|---|---|---|---|
| mode | No | cloud (API models) or local (Ollama/self-hosted). Default: cloud | |
| top_n | No | Number of recommendations to return (default: 5) | |
| vram_gb | No | GPU VRAM in GB (only for mode=local). Default: 16 | |
| features | No | Required features: vision, function_calling, json_mode, streaming, reasoning | |
| use_case | No | Primary use case: chatbot | code | rag | summarization | classification | reasoning | agents | multilingual | |
| max_budget | No | Maximum monthly budget in USD (based on tokens_per_day) | |
| quantization | No | Quantization (only for mode=local): Q4_K_M | Q8_0 | FP16. Default: Q4_K_M | |
| tokens_per_day | No | Estimated daily token volume (default: 100000) |
Output Schema
| Name | Required | Description |
|---|---|---|
| mode | No | |
| score | No | |
| results | No | |
| vram_gb | No | |
| use_case | No | |
| quantization | No | |
| tokens_per_day | No | |
| total_matching | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and idempotentHint=true, making the safety profile clear. The description adds valuable behavioral context: no API key required and includes cost simulation, which are not captured by annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is extremely concise: two sentences that cover purpose, comparison criteria, output type, and a key usage note. No superfluous information, and the most important points are front-loaded.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool has 8 optional parameters, high schema coverage, and an output schema, the description provides adequate context for agent understanding. It covers the core functionality and a crucial operational detail (no API key needed). Minor gap: does not explicitly mention the role of parameters, but schema covers that.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with each parameter having a description, so baseline is 3. The description lists comparison dimensions (cost, speed, etc.) but does not provide additional meaning beyond what the schema already conveys for each parameter.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: finding the best LLM for a given use case by comparing 30+ cloud and 12+ local models on cost, speed, benchmarks, etc. It distinguishes itself from siblings like list_llm_models or compare_models by focusing on ranking recommendations for a specific use case.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage for model selection based on use case and notes no API key needed, but does not explicitly state when to use this tool versus alternatives or provide exclusion criteria. Context about model counts helps but lacks definitive guidance.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
llm_format_checkARead-onlyIdempotentInspect
Validate that an LLM output matches an expected format: JSON, Markdown, code block, bullet list, numbered list, table, YAML, XML, or custom regex. Essential for structured output testing.
| Name | Required | Description | Default |
|---|---|---|---|
| output | Yes | The LLM output to validate | |
| regex_pattern | No | Custom regex pattern (only when expected_format is "regex") | |
| expected_format | Yes | Expected format |
Output Schema
| Name | Required | Description |
|---|---|---|
| valid | No | |
| checks | No | |
| failed | No | |
| passed | No | |
| total_checks | No | |
| expected_format | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false, covering safety and idempotency. The description adds no extra behavioral traits beyond 'Validate,' which is consistent. With annotations present, the description does not need to repeat, but it also does not add context about error handling or return values.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences, front-loaded with the core purpose and format list, followed by a contextual statement. Every sentence earns its place; no redundancy or fluff.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given that an output schema exists, the description does not need to explain return values. However, the tool is a validator, and additional context about the output (e.g., boolean vs. detailed report) would be helpful but is not critical due to the presence of the output schema. The description adequately covers the tool's role in structured output testing.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with parameter descriptions in the schema. The description lists the formats but does not add meaning beyond the enum values, such as what constitutes a valid Markdown heading or bullet list. The baseline is appropriate given the schema's thoroughness.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose as validating an LLM output against an expected format, listing 9 specific formats including JSON, Markdown, code block, etc. It uses a specific verb ('Validate') and resource ('LLM output'), and the listing of formats distinguishes it from sibling tools that handle single formats like json_schema_validate.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides context by stating 'Essential for structured output testing,' which implies when to use it. However, it does not explicitly mention when not to use it or alternate tools, such as json_schema_validate for JSON schema validation. The guidance is clear but lacks exclusions or alternatives.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
llm_generateARead-onlyInspect
Generate text using open-source LLM models hosted on Groq (ultra-fast) or HuggingFace Inference (serverless). No API key required — the server provides its own keys. Supported models: Qwen3 32B, Gemma 4 27B, Gemma 3 27B, Llama 3.3 70B, Llama 4 Scout, DeepSeek R1, Mistral Small 24B, and more. Use list_llm_models to see the full catalog. Rate-limited to prevent abuse.
| Name | Required | Description | Default |
|---|---|---|---|
| model | No | Model ID (default: "qwen/qwen3-32b"). Use list_llm_models tool with provider "Groq" or "HuggingFace" to see available models. | |
| prompt | Yes | The user prompt / instruction to send to the model | |
| system | No | Optional system prompt to set context or persona | |
| max_tokens | No | Maximum tokens to generate (default: 2048, max: 4096) | |
| temperature | No | Sampling temperature 0.0–1.5 (default: 0.7) |
Output Schema
| Name | Required | Description |
|---|---|---|
| model | No | |
| usage | No | |
| content | No | |
| provider | No | |
| latency_ms | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description adds value beyond annotations: it mentions rate-limiting, the use of external services (Groq/HuggingFace), and that no API key is required. These details inform the agent about non-obvious behaviors not captured by readOnlyHint or openWorldHint alone.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise with two sentences covering purpose and key details. It is front-loaded with the main action and efficiently communicates essential information without unnecessary words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the presence of an output schema and full annotation coverage, the description adequately covers need-to-know aspects: supported models, service providers, rate limits, and keyless access. It is complete enough for typical usage.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with clear descriptions for all 5 parameters. The description lists some supported models but does not add significant new meaning beyond what the schema already provides. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool generates text using open-source LLMs on Groq/HuggingFace. It specifies the verb 'generate' and the resource 'text', and the supported models are listed. However, it does not explicitly differentiate from sibling LLM tools like llm_fit_finder or llm_output_validator, though the name and context imply generation.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides context like 'No API key required' and 'rate-limited', but does not explicitly state when to use this tool versus alternatives. It implies usage for text generation but lacks guidance on when to choose this over other LLM-related sibling tools.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
llm_json_schema_checkARead-onlyIdempotentInspect
Validate that an LLM JSON output matches a JSON Schema definition. Tests required fields, types, enums, nested objects, and arrays. Critical for function-calling and structured output testing.
| Name | Required | Description | Default |
|---|---|---|---|
| output | Yes | The LLM JSON output (raw string, will be parsed) | |
| schema | Yes | JSON Schema (draft-07 subset) to validate against |
Output Schema
| Name | Required | Description |
|---|---|---|
| valid | No | |
| errors | No | |
| error_count | No | |
| parse_error | No | |
| parsed_type | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, destructiveHint=false, so the description's extra detail about parsing and validation specifics adds moderate value. It does not contradict annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two concise sentences: first states the purpose, second adds important details. Every word is meaningful and the description is front-loaded with the key verb and resource.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool has 2 required parameters, no enums, and an output schema, the description fully covers the behavior and use case. It is complete for an agent to understand what the tool does and how to use it.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline is 3. The description adds context by mentioning what the validation tests (required fields, types, etc.), which goes beyond the schema descriptions. This extra insight justifies a 4.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool validates LLM JSON output against a JSON Schema, specifying it tests required fields, types, enums, nested objects, and arrays. This definitively distinguishes it from sibling tools like json_schema_validate or function_call_validate.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions it is 'Critical for function-calling and structured output testing', giving clear context for when to use. However, it does not explicitly state when not to use or name alternatives, which would fully satisfy usage guidance.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
llm_output_validatorARead-onlyIdempotentInspect
Validate an LLM response against QA criteria: format checks (JSON, code, markdown), content rules (must-include, must-not-include), length constraints, language detection, and safety patterns. Essential for QA testing LLM-powered features.
| Name | Required | Description | Default |
|---|---|---|---|
| output | Yes | The LLM output text to validate | |
| max_length | No | Maximum character length for the output | |
| min_length | No | Minimum character length for the output | |
| check_safety | No | Check for PII patterns (emails, phones, SSN), profanity signals, and prompt leakage | |
| must_include | No | Comma-separated strings that MUST appear in the output | |
| expected_format | No | Expected output format | |
| must_not_include | No | Comma-separated strings that must NOT appear (e.g. "TODO, FIXME, undefined, NaN") | |
| check_json_schema | No | If expected_format is JSON, provide required keys as comma-separated list to validate the structure | |
| expected_language | No | Expected language of the output (en, fr, es, de…). Checks for common words. |
Output Schema
| Name | Required | Description |
|---|---|---|
| total | No | |
| checks | No | |
| failed | No | |
| passed | No | |
| verdict | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, destructiveHint=false, establishing a safe, non-mutating profile. The description adds value by enumerating the categories of checks performed (format, content, length, language, safety), compensating for the lack of behavioral detail in annotations. No contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences: the first lists the types of checks, and the second states the use case. Every sentence is substantive, no fluff, and front-loaded with key functionality.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the complexity (9 parameters, 1 required) and the presence of an output schema, the description covers the tool's purpose and main capabilities. However, it omits details about the validation return format (e.g., pass/fail, error details) which the output schema likely covers. Slightly incomplete but adequate.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so each parameter already has a description. The tool description provides a high-level summary of parameter categories but does not add new meaning beyond the schema. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: validate LLM responses against QA criteria. It lists specific checks (format, content, length, language, safety) and identifies its use case (QA testing LLM-powered features), distinguishing it from sibling tools like llm_format_check.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description identifies a primary use case ('Essential for QA testing LLM-powered features') but does not provide explicit guidance on when not to use it or how it compares to alternatives like llm_format_check or llm_json_schema_check. The context is implied but not exhaustive.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
lorem_ipsumARead-onlyInspect
Generate Lorem Ipsum placeholder text for UI mockups, design prototypes, or test data population. Configurable paragraphs (1–10), sentences per paragraph (1–20), and approximate words per sentence (3–30).
| Name | Required | Description | Default |
|---|---|---|---|
| paragraphs | No | Number of paragraphs to generate (1–10, default: 1) | |
| words_per_sentence | No | Approximate words per sentence (3–30, default: 10) | |
| sentences_per_paragraph | No | Sentences per paragraph (1–20, default: 5) |
Output Schema
| Name | Required | Description |
|---|---|---|
| paragraphs | No | |
| paragraph_count | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate the tool is read-only and non-destructive (readOnlyHint=true, destructiveHint=false). The description adds the behavioral context of generating placeholder text with configurable ranges, but no further behavioral traits beyond what annotations provide.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences with no unnecessary words. It efficiently communicates purpose and configurability.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity and existence of an output schema, the description adequately covers purpose, parameters, and use cases. It does not need to detail return values since the output schema handles that.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100% with each parameter having a description. The description restates the parameter ranges (1-10 paragraphs, 1-20 sentences, 3-30 words per sentence) but does not add new meaning beyond the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states that the tool generates Lorem Ipsum placeholder text for UI mockups, design prototypes, or test data population. This specific verb-resource combination distinguishes it from sibling tools, none of which generate lorem ipsum text.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description specifies the use cases (UI mockups, design prototypes, test data population), providing clear context. While it does not explicitly list when not to use or alternatives, the uniqueness of the tool makes this sufficient.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
mcp_schema_lintARead-onlyIdempotentInspect
Lint an MCP tool definition for best practices: naming conventions, description quality, schema completeness, required fields consistency, description length. Returns actionable warnings.
| Name | Required | Description | Default |
|---|---|---|---|
| tool_definition | Yes | MCP tool definition object with name, description, inputSchema |
Output Schema
| Name | Required | Description |
|---|---|---|
| grade | No | |
| errors | No | |
| warnings | No | |
| error_count | No | |
| quality_score | No | |
| warning_count | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations provide readOnlyHint and idempotentHint, indicating safe, read-only behavior. Description adds that it returns actionable warnings and checks specific traits, providing useful context beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Single sentence efficiently conveys purpose and scope. No redundant phrases, though could be slightly more structured with bullet points for best practices.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The description covers the tool's function and output (warnings) adequately. Output schema exists, so no need to detail return values. Missing guidance on typical use cases or limitations, but acceptable given simplicity.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with a single parameter described concisely. Description does not add additional meaning beyond the schema's description, so baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool lints MCP tool definitions for best practices, listing specific aspects checked (naming, description, schema, etc.) and mentioning actionable warnings. This differentiates it from sibling tools.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage for checking tool definitions but does not explicitly state when to use or avoid it, nor does it reference alternative tools. Usage context is clear but lacks exclusionary guidance.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
mcp_server_evaluateARead-onlyInspect
Run a full compliance evaluation against a live MCP server URL. Tests: server reachability (ping), manifest discovery (GET /mcp), schema quality (snake_case names, descriptions, inputSchema), JSON-RPC 2.0 test call, and P50/P95 latency. Returns a PASS/FIX/BLOCK verdict with a 0-100 score and per-check details.
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | Base URL of the MCP server (e.g. https://ia-qa.com or http://localhost:3001) | |
| test_tool_name | No | Specific tool name to use in the JSON-RPC test call (defaults to the first tool in the manifest) |
Output Schema
| Name | Required | Description |
|---|---|---|
| url | No | |
| score | No | |
| checks | No | |
| latency | No | |
| verdict | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations declare readOnlyHint=true and destructiveHint=false, and the description consistently describes only read-only tests. The description adds valuable context beyond annotations by listing the exact tests performed and the output format (PASS/FIX/BLOCK verdict, 0-100 score, per-check details). No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is extremely concise (two sentences, ~30 words). It front-loads the main action and uses a colon to efficiently list the tests. No wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a comprehensive evaluation tool, the description covers the purpose, tests performed, and output format. It does not mention error handling or prerequisites, but given the existence of an output schema and the clarity of the annotations, it is largely complete.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
All parameters are documented in the input schema (100% coverage). The description mentions the test_tool_name parameter's role in the JSON-RPC test but does not significantly expand on the schema's descriptions. Baseline of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: 'Run a full compliance evaluation against a live MCP server URL.' It lists specific tests (ping, manifest, schema, JSON-RPC, latency) and distinguishes itself from siblings like mcp_server_health_check and mcp_schema_lint by being a comprehensive evaluation that produces a verdict and score.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies when to use the tool (to evaluate an MCP server) but does not provide explicit guidance on when not to use it or how it compares to alternatives like mcp_server_health_check or mcp_schema_lint. No prerequisites or context are provided.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
mcp_server_health_checkARead-onlyIdempotentInspect
Generate a health check report for an MCP server's tool manifest. Validates tool definitions, schema quality, naming conventions, and documentation completeness. Paste the server manifest JSON to audit.
| Name | Required | Description | Default |
|---|---|---|---|
| strict | No | Enable strict mode: also check for optional best practices (examples, default values, descriptions > 20 chars) | |
| manifest | Yes | MCP server manifest JSON (the response from GET /mcp or tools/list) |
Output Schema
| Name | Required | Description |
|---|---|---|
| stats | No | |
| total | No | |
| checks | No | |
| failed | No | |
| passed | No | |
| verdict | No | |
| toolIssues | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint, idempotentHint, and destructiveHint false, indicating safety. Description adds that it validates specific aspects (tool definitions, schema quality, naming conventions, documentation completeness) and generates a report, providing context beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, zero waste. First sentence states purpose, second gives instruction. Highly concise and front-loaded.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Has output schema, so return values not needed. Description covers what the tool does and how to use it. Could mention the report structure, but not essential given output schema. Appropriate for a 2-parameter tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with descriptions for both parameters. Description adds minimal extra meaning: it mentions 'Paste the server manifest JSON' matching the manifest parameter but doesn't elaborate on 'strict' beyond schema. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Clearly states it generates a health check report for an MCP server's tool manifest, validating definitions, schema quality, naming conventions, and completeness. Distinguishable from siblings like mcp_schema_lint and mcp_server_evaluate.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Implies usage by saying 'Paste the server manifest JSON to audit', but no explicit guidance on when to use this tool versus alternatives like mcp_schema_lint or mcp_server_evaluate. No when-not or exclusion criteria.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
merge_jsonARead-onlyIdempotentInspect
Deep merge two JSON objects. Supports three array strategies: replace (default), concat, or unique (dedup concat). Nested objects are recursively merged — override takes precedence for primitives.
| Name | Required | Description | Default |
|---|---|---|---|
| base | Yes | Base JSON object (will be merged into) | |
| override | Yes | Override JSON object (takes precedence) | |
| array_strategy | No | Array merge strategy: replace (default), concat, or unique |
Output Schema
| Name | Required | Description |
|---|---|---|
| merged | No | |
| new_keys | No | |
| total_keys | No | |
| overridden_keys | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false, covering safety. The description adds behavioral details on recursion and array strategies, enhancing transparency.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two concise sentences, no fluff, front-loaded with the core action. Every sentence adds value.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (simple merge with array strategies) and the presence of an output schema, the description fully covers behavior, prerequisites, and edge cases (nested objects, array strategies).
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema covers 100% of parameters with descriptions. The description adds significant meaning: 'recursively merged', 'override takes precedence for primitives', and details on array merging, which enriches understanding beyond schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's function: 'Deep merge two JSON objects.' It specifies the array strategies and recursion behavior, distinguishing it from sibling JSON tools like json_diff.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description lacks explicit guidance on when to use this tool versus alternatives (e.g., when to choose merge over diff or validation). Usage is implied but not stated.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
minify_jsARead-onlyIdempotentInspect
Minify a JavaScript snippet, function, class, or module up to 50 KB using Terser. Returns minified code and byte savings. Use when embedding scripts in HTML templates, report payloads, or injecting inline code programmatically.
| Name | Required | Description | Default |
|---|---|---|---|
| code | Yes | JavaScript code to minify (max 50kb) |
Output Schema
| Name | Required | Description |
|---|---|---|
| minified | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and destructiveHint=false. The description adds behavioral details: uses Terser, returns minified code and byte savings, and has a size limit. These go beyond annotations without contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences: first defines the action and constraints, second provides use cases. No extraneous words; efficient and front-loaded.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the single parameter, presence of output schema, and clear annotations, the description is fully adequate. It covers purpose, constraints, use cases, and expected output.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema has 100% coverage (one parameter with description). The description adds context about the types of JS code (snippet, function, class, module) and the minifier used, but does not significantly elaborate beyond the schema's description.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the action (minify), the resource (JavaScript code), and the constraints (up to 50 KB using Terser). It also mentions the output (minified code and byte savings). There are no sibling tools with similar functionality, so it stands out.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides explicit when-to-use scenarios: 'embedding scripts in HTML templates, report payloads, or injecting inline code programmatically.' It does not mention when not to use or alternatives, but given the uniqueness of the tool, this is sufficient.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
mock_from_schemaARead-onlyInspect
Generate realistic mock data from a JSON Schema. Supports all common types (string, number, integer, boolean, array, object, null), format hints (email, date, date-time, uri, uuid), enum, const, and nested schemas. Perfect for testing MCP tools with realistic data.
| Name | Required | Description | Default |
|---|---|---|---|
| seed | No | Optional seed string for deterministic output (uses first char codes) | |
| count | No | Number of mock objects to generate (default: 1, max: 20) | |
| schema | Yes | JSON Schema as a JSON string |
Output Schema
| Name | Required | Description |
|---|---|---|
| count | No | |
| results | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description aligns with annotations (readOnlyHint=true, destructiveHint=false), indicating a safe read-only operation. It accurately describes the generation of mock data without side effects.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise and front-loaded, consisting of two sentences that efficiently convey the tool's purpose, supported features, and ideal use case without unnecessary details.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the existence of an output schema (context signals indicate 'Has output schema: true'), the description adequately covers the tool's behavior and output. It mentions generating realistic mock data and lists supported schema elements, which is sufficient for an agent.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the schema already documents parameters. The description adds value by explaining supported schema features (types, formats) that enrich the understanding of the 'schema' parameter.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool generates realistic mock data from a JSON Schema, listing supported types and format hints, distinguishing it from sibling tools like json_schema_validate or json_schema_generate.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions 'Perfect for testing MCP tools with realistic data,' providing a clear use case. However, it does not explicitly state when not to use this tool or suggest alternatives.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
model_infoARead-onlyIdempotentInspect
Get detailed specs for an AI model: context window, pricing per 1K tokens, knowledge cutoff, provider, multimodal support, reasoning capabilities, and feature list. Covers 30+ models from OpenAI, Anthropic, Google, DeepSeek, Meta, Mistral, Cohere, xAI.
| Name | Required | Description | Default |
|---|---|---|---|
| model | Yes | Model name (e.g. "gpt-4o", "claude-3.5-sonnet", "gemini-2.5-pro") |
Output Schema
| Name | Required | Description |
|---|---|---|
| model | No | |
| pricing_per_1k | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnlyHint=true and destructiveHint=false. The description adds value by detailing what specs are returned (context window, pricing, etc.) and the breadth of models covered, without contradicting annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, no wasted words. The first sentence is action-oriented and lists specifics; the second adds context about the range of models. Perfectly concise.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity (one required parameter, read-only), the description fully covers what the tool does and returns. The output schema further documents return values, so no additional detail is necessary.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Input schema provides 100% coverage for the single parameter 'model' with example values. The description does not add additional semantic meaning beyond what the schema already offers, so a baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description starts with a specific verb 'Get' and clearly states the resource 'detailed specs for an AI model', listing specific attributes. It distinguishes from sibling tools like 'list_llm_models' or 'compare_models' by its focus on a single model's detailed specs.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies the tool is for retrieving detailed specs of a single model, which is clear context. However, it does not explicitly state when not to use it or mention alternative tools, so it loses some points.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
multimodal_eval_guideARead-onlyIdempotentInspect
Unified tool for multimodal AI evaluation: set action=guide for reference thresholds/interpretation (CLIP, FID, VQA), or set action=clip_score / fid_score / vqa_accuracy / pipeline to compute real metrics via HuggingFace Inference API and VLM BYOK calls. One tool for both reference and computation.
| Name | Required | Description | Default |
|---|---|---|---|
| fid | No | [pipeline] {real_images, generated_images} for FID. | |
| vqa | No | [pipeline] VQA config object (same inputs as vqa_accuracy). | |
| clip | No | [pipeline] {image_url, text} for CLIP. | |
| text | No | [clip_score only] Text description to compare against the image. | |
| model | No | [vqa_accuracy] VLM model ID (default: gpt-4o). | |
| score | No | [guide only] Optional score value to interpret. | |
| action | No | guide (default) = reference thresholds/interpretation. clip_score/fid_score/vqa_accuracy = compute that metric. pipeline = run all three. | |
| metric | No | [guide only] Metric to explain. | |
| api_key | No | [vqa_accuracy] Your API key for the provider (BYOK). | |
| image_url | No | [clip_score/vqa_accuracy] Public URL of the image. | |
| test_cases | No | [vqa_accuracy] Array of {question, accepted_answers} objects. | |
| real_images | No | [fid_score] Array of real image URLs. | |
| image_base64 | No | [clip_score/vqa_accuracy] Base64-encoded image data. | |
| system_prompt | No | [vqa_accuracy] Optional system prompt. | |
| image_mime_type | No | [clip_score/vqa_accuracy] MIME type for base64 image. | |
| generated_images | No | [fid_score] Array of generated image URLs. |
Output Schema
| Name | Required | Description |
|---|---|---|
| errors | No | |
| metrics | No | |
| results | No | |
| web_tool | No | |
| best_practices | No | |
| comparison_table | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations (readOnlyHint: true, idempotentHint: true, destructiveHint: false) declare the tool safe and non-destructive. The description adds that computations are done via 'HuggingFace Inference API and VLM BYOK calls', disclosing external dependencies. No contradiction with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single, dense sentence that efficiently conveys the tool's purpose, actions, and usage. Every part earns its place, with no redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (16 params, nested objects, multiple actions), the description adequately covers high-level functionality and action selection. The output schema exists but is not described, which is acceptable. Some details on parameter relationships could be beneficial, but overall it is sufficiently complete.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, but the description adds value by grouping parameters per action and explaining the role of the 'action' enum. For example, it clarifies that 'clip' object is for pipeline, and 'score' is for guide only. This goes beyond the schema's individual parameter descriptions.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it is a 'unified tool for multimodal AI evaluation' and lists specific actions (guide, clip_score, fid_score, vqa_accuracy, pipeline) with their purposes. It distinguishes the tool's dual role (reference and computation) and provides enough specificity to differentiate from sibling tools, which are largely unrelated.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly tells when to use each action (e.g., 'set action=guide for reference thresholds/interpretation') and implies that computing metrics requires specific actions. However, it does not provide explicit exclusions or compare with alternative tools for similar tasks, though no close siblings exist in the list.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
needle_haystack_generateARead-onlyIdempotentInspect
Generate a "needle in a haystack" test: embeds a target fact into a large block of filler text at a specified position. Use this to test LLM context window retrieval accuracy. Returns the full haystack, the question to ask, and metadata. No API key needed.
| Name | Required | Description | Default |
|---|---|---|---|
| needle | Yes | The fact to hide (e.g. "The secret code is ALPHA-42") | |
| tokens | No | Target haystack size in tokens (default: 5000, max: 100000) | |
| position | No | Where to insert the needle: "start", "middle", "end", "random" (default: "middle") | middle |
| question | Yes | The question to ask the LLM (e.g. "What is the secret code?") |
Output Schema
| Name | Required | Description |
|---|---|---|
| needle | No | |
| haystack | No | |
| position | No | |
| question | No | |
| insert_block | No | |
| total_blocks | No | |
| estimated_tokens | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Aligns with annotations (read-only, idempotent) and adds context about no API key needed, but could elaborate on the generation process.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three focused sentences with front-loaded purpose, no filler.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With an output schema available, the description provides sufficient context; could mention output format briefly but not necessary.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema provides full parameter descriptions (100% coverage), so the description adds little beyond mentioning 'specified position'; baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool generates a 'needle in a haystack' test for LLM context window retrieval accuracy, with a specific verb and resource, distinguishing it from siblings.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly says to use for testing LLM context window retrieval and notes no API key needed, but does not mention alternatives or when not to use.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
normalize_vectorARead-onlyIdempotentInspect
L2-normalize a float vector (produce a unit vector with norm=1). Required by many vector DBs (Pinecone, Qdrant cosine). Supports batch normalization of up to 1000 vectors.
| Name | Required | Description | Default |
|---|---|---|---|
| batch | No | Batch of vectors to normalize (overrides vector) | |
| vector | No | Single vector to normalize |
Output Schema
| Name | Required | Description |
|---|---|---|
| mode | No | |
| norm | No | |
| count | No | |
| index | No | |
| vector | No | |
| results | No | |
| dimension | No | |
| norm_after | No | |
| normalized | No | |
| norm_before | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description adds behavioral info beyond annotations: it specifies L2-normalization produces a unit vector with norm=1 and a batch limit of 1000. It does not mention handling of zero vectors, but the core behavior is clear. No contradiction with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise: two sentences front-load the purpose and then provide contextual details. No unnecessary words or repetitions.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (normalization with batch support), the description covers purpose, batch limit, and common use cases. An output schema exists, so return format is handled. It could mention edge cases like zero vectors, but overall it is complete.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema already has descriptions for both parameters (batch and vector), providing 100% coverage. The description adds the useful constraint that batch normalization supports up to 1000 vectors, which is not in the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'L2-normalize a float vector (produce a unit vector with norm=1).' It uses a specific verb and resource, and the reference to vector DBs distinguishes it from sibling tools like vector_similarity and vector_quantize.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions it is 'Required by many vector DBs (Pinecone, Qdrant cosine)' and supports batch normalization up to 1000 vectors, providing clear context for when to use it. However, it does not explicitly mention alternatives or when not to use it.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
normalize_whitespaceARead-onlyIdempotentInspect
Normalize whitespace: trim trailing spaces, collapse blank lines, normalize line endings (LF/CRLF), convert tabs to spaces. Useful for cleaning code, configs, and text before processing.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Text to normalize | |
| trim_file | No | Trim leading/trailing blank lines (default: true) | |
| trim_lines | No | Trim trailing whitespace from each line (default: true) | |
| line_ending | No | "lf" (default), "crlf", or "cr" | |
| tab_to_spaces | No | Convert tabs to N spaces (omit to keep tabs) | |
| collapse_blanks | No | Collapse 3+ consecutive blank lines to 2 (default: true) |
Output Schema
| Name | Required | Description |
|---|---|---|
| result | No | |
| line_ending | No | |
| original_length | No | |
| normalized_length | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations mark it as read-only, idempotent, and non-destructive. Description expands on the exact transformations performed (trim, collapse line endings, tab conversion), adding behavioral detail beyond the annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single sentence with a colon followed by key actions. It is front-loaded and efficient, though slightly more structure (e.g., bullet lists) could improve scanability.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's moderate complexity (6 params, 1 required) and the presence of an output schema, the description covers the core transformations and typical use cases. It does not explain return values, but the output schema fills that gap.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema has 100% description coverage for all 6 parameters. The description provides a high-level summary but doesn't add new meaning beyond what's in the parameter descriptions. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb 'Normalize' and resource 'whitespace', listing specific actions (trim, collapse, convert). It distinguishes this whitespace-focused tool from siblings like 'format_json' or 'case_convert'.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Description says 'Useful for cleaning code, configs, and text before processing' which implies usage context but lacks explicit when-to-use vs alternatives or when-not-to-use. With many sibling text tools, more guidance would help.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
number_base_convertARead-onlyIdempotentInspect
Convert numbers between bases: decimal, binary, octal, hexadecimal, or any base 2–36. Auto-detects 0x, 0b, 0o prefixes.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Number to convert (e.g., "255", "0xFF", "0b1010", "0o77") | |
| to_base | No | Target base 2–36 (omit to get all common bases) | |
| from_base | No | Source base 2–36 (auto-detects prefix if omitted) |
Output Schema
| Name | Required | Description |
|---|---|---|
| octal | No | |
| binary | No | |
| result | No | |
| decimal | No | |
| to_base | No | |
| from_base | No | |
| hexadecimal | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate read-only, idempotent, and non-destructive behavior. The description adds value by disclosing auto-detection of prefixes and the base range, which is useful context beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences with no redundancy. The main action and key features are front-loaded. Every sentence earns its place.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity and the presence of an output schema, the description is mostly complete. It covers purpose, base range, and auto-detection. However, it could mention error handling for invalid inputs, but this is minor.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% and parameter descriptions are detailed with examples. The tool description adds minimal extra meaning (e.g., base range), but the schema already does the heavy lifting, so baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Clearly states the verb 'convert' and the resource 'numbers between bases'. Specifies supported bases (2-36) and mentions auto-detection of prefixes, which distinguishes it from other conversion tools.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies the tool is for number base conversion but does not explicitly state when to use it vs. alternatives like base64_decode or color_convert. No guidance on when not to use it.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
openapi_validateARead-onlyIdempotentInspect
Validate the structure of an OpenAPI 3.x specification (JSON or YAML). Checks required top-level fields (openapi, info.title, info.version, paths), validates each operation (responses, operationId uniqueness), detects undeclared $ref components, and flags missing 2xx responses. Returns a PASS/FAIL verdict, a 0–100 compliance score, and a list of errors and warnings with JSON-pointer locations. Use before publishing an API spec or generating SDK code.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | OpenAPI 3.x specification as a JSON or YAML string |
Output Schema
| Name | Required | Description |
|---|---|---|
| score | No | |
| stats | No | |
| errors | No | |
| verdict | No | |
| warnings | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnly and idempotent. Description adds specific validation behaviors (checks fields, operations, $ref, 2xx) beyond what annotations provide. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Concise description with five sentences, each providing unique information. Front-loaded with main action and structured logically.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Description covers checks performed and output format (verdict, score, errors). With an output schema presumably defined, the description adequately complements it. Could mention handling of invalid input but not necessary.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Single parameter 'input' with schema description 'OpenAPI 3.x specification as a JSON or YAML string'. Schema coverage is 100%, so description adds marginal value over schema. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states the tool validates OpenAPI 3.x specifications, listing specific checks (required fields, operation validation, $ref detection, missing 2xx). This verb+resource combination distinguishes it from other validation tools in the sibling list.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly recommends use before publishing an API spec or generating SDK code. Provides clear context but does not mention when not to use or alternatives, which are not critical given the specificity.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
optimize_prompt_tokensARead-onlyIdempotentInspect
Compress an LLM prompt by removing filler words, verbose phrases, duplicate sentences, and unnecessary whitespace. Returns optimized text with token savings breakdown. 100% deterministic, no API key needed.
| Name | Required | Description | Default |
|---|---|---|---|
| text | Yes | The prompt text to optimize | |
| options | No | Toggle optimization steps (all true by default) |
Output Schema
| Name | Required | Description |
|---|---|---|
| steps | No | |
| optimized | No | |
| tokens_after | No | |
| tokens_saved | No | |
| percent_saved | No | |
| tokens_before | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint, idempotentHint, and destructiveHint. The description adds '100% deterministic' (consistent with idempotentHint) and 'no API key needed', providing useful behavioral context beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two concise sentences, front-loaded with the main action. Every sentence provides essential information without waste.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The tool is simple with high schema coverage and annotations. The description covers the core function, output (optimized text with token savings), and key properties (deterministic, no API key). No gaps for a tool of this complexity.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the baseline is 3. The description adds general context about what is removed (filler words, duplicates, whitespace, instructions) which maps to the options, but does not provide detailed parameter-level semantics beyond the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool compresses an LLM prompt by removing specific elements. The verb 'compress' and resource 'LLM prompt' are specific. However, it does not explicitly differentiate from siblings like 'truncate_to_tokens' or 'count_tokens', though the purpose is clear.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance on when to use this tool versus alternatives. The description mentions it is deterministic and requires no API key, but does not provide explicit when/when-not scenarios or alternative tools.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
parse_csvARead-onlyIdempotentInspect
Parse a CSV string into a JSON array of objects (or raw arrays). Handles RFC 4180 quoted fields, escaped quotes, and custom delimiters. Use when processing spreadsheet exports, data imports, or structured text pipelines where the source is CSV. Supports up to 200 KB.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | CSV content to parse | |
| header | No | Treat the first row as headers (default: true) | |
| delimiter | No | Field delimiter character (default: ",") |
Output Schema
| Name | Required | Description |
|---|---|---|
| rows | No | |
| columns | No | |
| headers | No | |
| row_count | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint, idempotentHint, destructiveHint. Description adds handling of quoted fields, escaped quotes, custom delimiters, and a 200 KB size limit. No contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two dense sentences: first states purpose and output, second gives usage context, format handling, and size limit. Every sentence earns its place.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given output schema existence and annotations, description covers usage, edge cases, and size limit. Complete for a parsing tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, but description adds meaning by connecting header parameter to output type (objects vs raw arrays) and mentions custom delimiters. Adds value beyond schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description states the verb 'Parse', resource 'CSV string', and output 'JSON array of objects (or raw arrays)'. It specifies RFC 4180 and custom delimiters, making it distinct from sibling parsing tools.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Provides clear contexts: 'spreadsheet exports, data imports, or structured text pipelines where the source is CSV'. Lacks explicit exclusions or alternatives, but context is sufficient.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
parse_http_headersARead-onlyIdempotentInspect
Parse a raw HTTP headers block into a structured JSON object. Detects multi-value headers, masks Authorization values, and optionally audits for missing security headers (HSTS, CSP, X-Frame-Options, etc.).
| Name | Required | Description | Default |
|---|---|---|---|
| headers | Yes | Raw HTTP headers (one "Name: Value" per line) | |
| analyze_security | No | Audit for missing security headers (default: true) |
Output Schema
| Name | Required | Description |
|---|---|---|
| parsed | No | |
| security | No | |
| header_count | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint and idempotentHint true, and destructiveHint false. The description adds behavioral context: it masks Authorization values (a non-obvious transformation) and optionally audits security headers. No contradictions with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, front-loaded with purpose, then key features. Every sentence adds value with no redundancy or fluff.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With an output schema present and clear annotations, the description is sufficient. It covers the tool's core action, optional parameter behavior, and special handling. No gaps remain for an agent to understand invocation.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, but the description adds value by explaining that the tool detects multi-value headers and masks Authorization values, which are not explicit in the parameter descriptions. This enhances understanding of the behavior.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb 'Parse', the resource 'raw HTTP headers block', and the output 'structured JSON object'. It also lists specific features like multi-value detection and Authorization masking, distinguishing it from similar tools like security_headers_check.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No explicit guidance on when to use this tool versus alternatives. No mention of prerequisites or exclusions. The context signals show many sibling tools, but the description does not differentiate or provide selection criteria.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
post_jira_commentAInspect
Post the output of jira_to_test_suite as a formatted comment on the source Jira ticket. Converts Gherkin, E2E steps, API tests, and ambiguities into Atlassian Document Format (ADF). STATEFUL — creates a comment on the issue.
| Name | Required | Description | Default |
|---|---|---|---|
| issue_key | Yes | Jira issue key, e.g. "PROJ-123" | |
| jira_email | Yes | Atlassian account email | |
| jira_token | Yes | Atlassian API token | |
| test_suite | Yes | The test_suite object from jira_to_test_suite result | |
| jira_base_url | Yes | Atlassian base URL |
Output Schema
| Name | Required | Description |
|---|---|---|
| success | No | |
| comment_id | No | |
| comment_url | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description explicitly notes 'STATEFUL — creates a comment on the issue', adding behavioral context beyond annotations (readOnlyHint=false, idempotentHint=false). It communicates that each call adds a new comment and converts formats, providing valuable transparency.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is three sentences long, front-loaded with the key action, and contains no extraneous information. Every sentence serves a purpose: stating the operation, detailing the conversion, and indicating statefulness.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The description covers the input source and conversion behavior. With an output schema present, it does not need to explain return values. However, it could mention error conditions (e.g., invalid issue key) or authentication prerequisites, but the schema covers required parameters. Slightly incomplete but still good.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema has 100% description coverage, so the schema already explains all parameters. The description adds minimal new semantic meaning beyond restating the connection to jira_to_test_suite for the test_suite parameter. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the core function: posting the output of jira_to_test_suite as a formatted comment on a Jira ticket. It specifies the input source and the conversion details (Gherkin, E2E, API tests into ADF), making it distinct from sibling tools like jira_to_test_suite (which extracts data) and fetch_jira_issue (which retrieves issue details).
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage after obtaining test_suite from jira_to_test_suite, providing clear context. However, it does not explicitly state when not to use it or mention alternatives for posting comments, but the context is sufficient for an AI agent to infer the workflow.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
pr_gatekeeperARead-onlyIdempotentInspect
Compound quality gate for pull requests. Runs three sequential checks: (1) secret detection — scans diff for API keys, tokens, passwords matching 16 regex patterns; (2) bug analysis — heuristic scan for eval(), innerHTML, empty catch, console.log, TODO/FIXME; (3) commit message linting against Conventional Commits spec. Returns gate verdict (PASS/WARN/BLOCK), blockers, and actionable warnings. Use before merging any code change.
| Name | Required | Description | Default |
|---|---|---|---|
| diff | Yes | Unified git diff (output of `git diff HEAD`) | |
| context | No | Optional: PR title or description for richer bug analysis | |
| commit_message | Yes | The commit message to lint (e.g. "feat(auth): add OAuth2 login") |
Output Schema
| Name | Required | Description |
|---|---|---|
| flags | No | |
| score | No | |
| checks | No | |
| verdict | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnlyHint=true, idempotentHint=true, destructiveHint=false, confirming safe analysis. The description adds substantial behavioral detail: the specific patterns scanned (16 regex for secrets, heuristic patterns for bugs, Conventional Commits for commit messages) and the return format (verdict: PASS/WARN/BLOCK, blockers, warnings). No contradictions with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise with four sentences: purpose, checks listed, return type, usage instruction. No redundant information, front-loaded with the core purpose, and every sentence adds value.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (three separate checks, multiple return fields) and the existence of an output schema, the description provides a good overview. It mentions the verdict types and actionable warnings but could add details on ordering or severity levels. Still, it is sufficiently complete for an agent to understand its role.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema has 100% description coverage for all three parameters. The description adds context by linking parameters to checks (e.g., diff used for secret detection and bug analysis, commit_message for linting). While it doesn't elaborate beyond the schema, it reinforces parameter roles, which is valuable.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it is a 'Compound quality gate for pull requests' and lists the three sequential checks (secret detection, bug analysis, commit message linting). It uses a specific verb ('runs') and resource ('quality gate for pull requests'), effectively distinguishing it from sibling tools like secret_scan or lint_commit_message by being a composite check.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly advises 'Use before merging any code change,' providing clear usage context. It does not, however, explicitly state when not to use this tool or mention alternatives (e.g., for individual checks use secret_scan or lint_commit_message). The guidance is clear but lacks exclusions.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
prompt_injection_scanARead-onlyIdempotentInspect
Scan user input or prompts for common prompt injection patterns. Detects system prompt overrides, jailbreak attempts, role manipulation, encoding tricks, delimiter attacks, template/interpolation injection ({{...}}, ${...}), and context-exfiltration attempts ("repeat everything above").
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | The user input or prompt to scan for injection patterns | |
| sensitivity | No | Detection sensitivity (default: medium) |
Output Schema
| Name | Required | Description |
|---|---|---|
| detections | No | |
| risk_level | No | |
| sensitivity | No | |
| input_length | No | |
| detections_count | No | |
| injection_detected | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate the tool is read-only and idempotent. The description adds value by detailing the patterns detected, but does not disclose behavioral traits like false positive rates, performance impact, or maximum input length. With annotations covering safety, a 3 is appropriate.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single concise sentence that front-loads the purpose and lists detection categories efficiently. No wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given that an output schema exists (mentioned in context signals), the description does not need to detail return values. It adequately explains the input and detection scope. Minor gap: no mention of output structure or sensitivity parameter behavior, but overall complete.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with descriptions for both parameters. The tool description repeats the purpose but adds no extra meaning to the parameters themselves. Baseline 3 is correct as schema does the heavy lifting.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool scans user input for prompt injection patterns and lists specific pattern types, making the purpose unambiguous. It distinguishes from sibling tools like toxicity_scan or secret_scan by focusing on injection attacks.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
While the description implies usage for scanning inputs, it does not provide explicit guidance on when to use this tool versus alternatives (e.g., guardrail_test) or when not to use it. There is no mention of prerequisites or limitations.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
prompt_template_fillARead-onlyIdempotentInspect
Fill a prompt template with variables. Supports {{variable}} syntax and {{#if key}}...{{/if}} conditional blocks. Returns the filled prompt and lists unfilled variables.
| Name | Required | Description | Default |
|---|---|---|---|
| strict | No | Throw error if any variable is not provided (default: false) | |
| template | Yes | Prompt template with {{variable}} placeholders | |
| variables | No | Key-value pairs to fill (e.g. {"name":"Alice","role":"engineer"}) |
Output Schema
| Name | Required | Description |
|---|---|---|
| result | No | |
| total_vars | No | |
| filled_variables | No | |
| unfilled_variables | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description adds value beyond annotations by detailing the supported syntax ({{variable}} and conditional blocks) and confirming that it returns unfilled variables. Annotations already indicate idempotent, read-only, non-destructive behavior, so the description's additional context on syntax and output enhances transparency without contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise (3 sentences), front-loaded with the core action, and every sentence adds information. No wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity, an output schema exists, and annotations cover safety, the description covers the main functionality and return. However, it does not explain the 'strict' parameter behavior beyond what's in the schema, and missing edge cases like error behavior for missing variables in non-strict mode.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the baseline is 3. The description does not add new parameter-level details beyond what the schema provides (e.g., template, variables, strict). The mention of syntax and conditional blocks relates to the template value but not to parameter semantics themselves.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states that the tool fills a prompt template with variables, supporting {{variable}} syntax and conditional blocks. It specifies the return of the filled prompt and unfilled variables, which is specific and actionable. However, it does not explicitly distinguish itself from similar sibling tools like build_rag_prompt or few_shot_formatter.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance is provided on when to use this tool versus alternatives, such as when template filling is needed versus building prompts from scratch. There are no exclusions or prerequisites mentioned, leaving the agent to infer usage context.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
prompt_test_suiteARead-onlyIdempotentInspect
Define a test suite for a prompt: provide the system prompt, user prompt, and expected output criteria. Returns a test plan with scored rubric — use this as input for manual or automated LLM evaluation.
| Name | Required | Description | Default |
|---|---|---|---|
| max_tokens | No | Max token budget for the test | |
| temperature | No | Temperature to use | |
| user_prompt | Yes | The user prompt to send | |
| check_safety | No | Include safety/PII checks in the rubric | |
| must_include | No | Required content (comma-separated) | |
| system_prompt | Yes | The system prompt under test | |
| expected_format | No | Expected output format | |
| must_not_include | No | Forbidden content (comma-separated) | |
| expected_behavior | No | Description of what the LLM should do (free text) | |
| adversarial_prompts | No | Auto-generate adversarial test variants (jailbreak, injection, edge cases) |
Output Schema
| Name | Required | Description |
|---|---|---|
| rubric | No | |
| categories | No | |
| total_tests | No | |
| instructions | No | |
| test_suite_name | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and idempotentHint=true, so the tool is safe and idempotent. The description adds that it returns a test plan with a scored rubric, which is useful but not essential beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise with two sentences, front-loading the action and then describing the output. Every sentence adds value without redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the complexity (10 parameters, 2 required) and the presence of an output schema, the description adequately covers the main purpose and output. It could be slightly more detailed about the rubric, but the output schema likely handles that.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so the input schema already explains each parameter. The description does not add additional meaning beyond 'provide system prompt, user prompt, and expected output criteria,' which is a high-level overview.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: defining a test suite for a prompt by providing system prompt, user prompt, and expected output criteria. It distinguishes itself from sibling tools that focus on running tests, not defining them.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description indicates that the output can be used for manual or automated LLM evaluation, providing clear context. However, it does not explicitly exclude scenarios or mention alternatives (e.g., using run_semantic_tests instead), which would improve guidance.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
rag_relevance_rankARead-onlyIdempotentInspect
Rank an array of text chunks by relevance to a query using TF-IDF scoring. Simulates retrieval ranking for RAG testing without needing embeddings or an API.
| Name | Required | Description | Default |
|---|---|---|---|
| query | Yes | The user query | |
| top_k | No | Return top K results (default: all) | |
| chunks | Yes | Array of text chunks to rank |
Output Schema
| Name | Required | Description |
|---|---|---|
| rank | No | |
| index | No | |
| query | No | |
| score | No | |
| results | No | |
| returned | No | |
| total_chunks | No | |
| chunk_preview | No | |
| keyword_overlap | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnly and idempotent hints. The description adds useful behavior context: uses TF-IDF, simulates retrieval ranking for RAG testing, no external dependencies. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two short sentences with no wasted words, front-loaded with the action and key differentiator.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the presence of an output schema, the description adequately covers the tool's purpose, method, and use case. It could mention default behavior for top_k but that's in schema.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
All parameters have schema descriptions (100% coverage). The tool description does not add extra parameter semantics beyond the schema, so baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool ranks text chunks by relevance using TF-IDF, distinguishing it from siblings like 'bm25_score' and 'embedding_similarity' by specifying the algorithm and use case for RAG testing without embeddings.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
It describes when to use (for lightweight RAG testing without embeddings/API) but does not explicitly mention when not to use or name alternative tools, though the context is clear.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
rate_toolAInspect
Give honest usage feedback on an IA-QA MCP tool. Provide a score (1-5) and a comment. Rate low (1-2) if the tool was wrong, irrelevant, or a poor fit; rate high (4-5) only if it genuinely solved your need. Ratings are aggregated on a public dashboard at /devtools/mcp-ratings. Skip rating routine successes — we want signal, not praise. Example: rate_tool({ tool_name: "format_json", score: 2, comment: "Tried to pretty-print a JSON5 file, it rejected trailing commas — not usable for my case." })
| Name | Required | Description | Default |
|---|---|---|---|
| score | Yes | Rating from 1 (poor) to 5 (excellent) | |
| comment | No | Strongly encouraged — explain what you were trying to do and whether the tool got you there. Be specific about what was missing, wrong, or a poor fit. This is the most valuable part of the rating (max 500 chars). | |
| tool_name | Yes | Name of the MCP tool to rate (e.g. "format_json", "shield_analyze") |
Output Schema
| Name | Required | Description |
|---|---|---|
| ok | No | |
| score | No | |
| comment | No | |
| message | No | |
| rated_at | No | |
| tool_name | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description states that ratings are aggregated on a public dashboard, which is a behavioral disclosure beyond what the annotations provide. Annotations are minimal (false for all hints), so the description carries the burden well. No contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is well-structured with a clear purpose, guidelines, and an example. It is slightly long but every sentence contributes meaning. It is front-loaded with the core purpose.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The tool is simple with three parameters fully documented. The description covers usage context, scoring rationale, comment expectations, and mentions the public dashboard. It is complete for an agent to correctly invoke the tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline is 3. The description adds significant value by explaining the scoring rubric, the importance of comments, and providing an example. This enhances understanding beyond what the schema alone offers.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: 'Give honest usage feedback on an IA-QA MCP tool.' It specifies that the user provides a score and a comment, and it distinguishes itself from sibling tools by being a rating tool, not a utility tool.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides explicit guidance on when to rate low (1-2) vs high (4-5), and it instructs to 'Skip rating routine successes — we want signal, not praise.' This clearly tells the agent when to use the tool and when not, with a concrete example.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
redact_piiARead-onlyIdempotentInspect
Automatically detect and redact Personally Identifiable Information (PII) from text. Replaces emails, phone numbers, SSNs, credit cards, IP addresses, and JWT tokens with [REDACTED_TYPE] placeholders. Safe to use before logging or sending to an LLM.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Text to redact PII from | |
| types | No | Comma-separated types to redact (default: all). Options: email, phone, ssn, credit_card, ip_address, jwt | |
| marker | No | Custom replacement marker (default: "REDACTED"). Result: [REDACTED_EMAIL] |
Output Schema
| Name | Required | Description |
|---|---|---|
| clean | No | |
| pii_found | No | |
| replacements | No | |
| redacted_text | No | |
| total_redactions | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate non-destructive, idempotent read-only operation. Description adds useful detail about the replacement format ([REDACTED_TYPE]) and safety context, without contradicting annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two concise sentences front-loaded with purpose and key details; no wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With full schema and output schema present, description is sufficient for the tool's simplicity. Provides enough context for safe usage and output expectation.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema covers all params with descriptions (100%). Description adds practical detail like default values for types and marker, and the expected output format, surpassing baseline.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool detects and redacts PII, listing specific types like emails and phone numbers, and explains use before logging or LLM, distinguishing it from siblings like detect_secrets.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Provides explicit context for safe use ('before logging or sending to an LLM') but lacks explicit when-not-to-use or alternative tool names.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
regex_testARead-onlyIdempotentInspect
Test a regular expression pattern against an input string and return all matches with their index positions and named capture groups. Use for validating user inputs, extracting structured data from text, or debugging regex patterns. Supports flags g, i, m, s, u, y.
| Name | Required | Description | Default |
|---|---|---|---|
| flags | No | Regex flags: g (global), i (case-insensitive), m (multiline), s (dotAll) — default: "" | |
| input | Yes | The string to test against (max 50 KB) | |
| pattern | Yes | Regular expression pattern (without delimiters) |
Output Schema
| Name | Required | Description |
|---|---|---|
| note | No | |
| flags | No | |
| matched | No | |
| matches | No | |
| pattern | No | |
| match_count | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint, idempotentHint, and destructiveHint=false. Description adds behavioral context beyond annotations: return format details and supported flags. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, front-loaded with purpose and output, followed by use cases and flag support. No redundant information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's moderate complexity and full schema coverage plus output schema, the description provides all necessary context: purpose, parameters, flag support, and input constraints.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema already covers all 3 parameters with descriptions. Description adds value by specifying max input size (50 KB), that pattern is without delimiters, and listing supported flags (g,i,m,s,u,y).
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states the verb 'Test' and resource 'regular expression pattern against an input string', and specifies output 'all matches with their index positions and named capture groups'. This is distinct from sibling tools, none of which perform regex testing.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly lists use cases: 'validating user inputs, extracting structured data from text, or debugging regex patterns'. However, it does not provide when-not-to-use or compare with alternatives, but given the lack of sibling regex tools, this is sufficient.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
rerank_evaluateARead-onlyIdempotentInspect
Evaluate RAG retrieval quality using the NVIDIA neural reranker (nv-rerankqa-mistral-4b-v3). Ranks passages by semantic relevance to a query and computes Precision@k and Recall@k. Optionally accepts ground-truth relevance labels to produce a PASS/FAIL CI/CD verdict.
| Name | Required | Description | Default |
|---|---|---|---|
| query | Yes | The search query or question to rank against | |
| top_k | No | k for Precision@k evaluation (default 3) | |
| passages | Yes | Array of passage objects to rank (min 2, max 20) | |
| threshold | No | Minimum Precision@k to PASS (0-1, default 0.5) |
Output Schema
| Name | Required | Description |
|---|---|---|
| model | No | |
| query | No | |
| top_n | No | |
| results | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Discloses use of a specific model, semantic ranking, metric computation, and optional verdict. Annotations (readOnlyHint, idempotentHint) are consistent; description adds valuable behavioral context beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two concise sentences that front-load purpose and key capabilities. Every sentence adds value with no redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given full schema coverage and an output schema existing, the description sufficiently covers the tool's purpose, inputs, and optional behavior. No major gaps.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with descriptions for all parameters. The description provides no additional parameter-level detail beyond what the schema already offers; baseline score is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Clearly states the tool evaluates RAG retrieval quality using a specific NVIDIA neural reranker, ranks passages, computes Precision@k and Recall@k, and optionally provides a PASS/FAIL verdict. Distinguishes from siblings like bm25_score by specifying the reranker approach.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage for evaluating retrieval with optional ground truth for CI/CD, but does not explicitly state when not to use it or mention alternative tools. Lacks explicit exclusions.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
response_quality_scoreARead-onlyIdempotentInspect
Score an LLM response on multiple quality dimensions: relevance, completeness, clarity, conciseness, formatting. Returns a weighted 0-100 score with detailed breakdown.
| Name | Required | Description | Default |
|---|---|---|---|
| question | Yes | The original question/prompt | |
| response | Yes | The LLM response to score | |
| max_length | No | Ideal max character length (penalize if exceeded) | |
| expected_keywords | No | Keywords that should appear in a good answer |
Output Schema
| Name | Required | Description |
|---|---|---|
| grade | No | |
| stats | No | |
| breakdown | No | |
| max_score | No | |
| total_score | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate read-only and idempotent behavior. The description adds useful behavioral context by specifying the quality dimensions and the output (weighted score with detailed breakdown). However, it does not describe the weighting methodology or handle edge cases like empty responses.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences, front-loading the action and key details (dimensions, output). No redundant or extraneous information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity, full schema documentation, and existence of an output schema (which handles return value details), the description is complete. It conveys the essential purpose and what the tool returns, sufficient for an agent to decide to use it.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so baseline is 3. The description does not add any parameter-specific meaning beyond what the schema already provides (e.g., it does not explain the role of max_length or expected_keywords in scoring).
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: scoring an LLM response on five specific quality dimensions (relevance, completeness, clarity, conciseness, formatting) and returning a weighted 0-100 score. This differentiates it from sibling tools like 'compare_responses' or 'hallucination_check' which have different focuses.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance is provided on when to use this tool versus alternatives. Given the large set of sibling tools (e.g., compare_responses, bias_detect), explicit context on appropriate usage scenarios would be beneficial.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
run_eval_contractARead-onlyInspect
Parse a .ia-eval.yaml LLM test suite, call the specified LLM model for each scenario, run all configured scorers, and return a structured JSON report with per-scenario Pass/Fail verdicts and a Markdown summary. Use list_local_tests to discover available test files.
| Name | Required | Description | Default |
|---|---|---|---|
| api_keys | No | API keys to use for LLM generation (all optional — falls back to server env vars) | |
| overrides | No | Override contract defaults | |
| contract_path | No | Absolute or relative path to a .ia-eval.yaml file (required unless inline_contract is provided) | |
| inline_contract | No | Raw contract object (alternative to contract_path) |
Output Schema
| Name | Required | Description |
|---|---|---|
| summary | No | |
| metadata | No | |
| warnings | No | |
| contract_path | No | |
| scenario_results | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description contradicts the annotations: readOnlyHint is true, but the tool calls an LLM and runs scorers, which are side-effecting operations (API calls, token usage). This is a clear contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise, with two sentences that effectively communicate the core workflow. It could be slightly more structured but is appropriately front-loaded.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the complexity (4 params, nested objects, output schema), the description covers the main workflow and output format. It mentions the report structure and the relationship with list_local_tests, but lacks details on scorers or the interaction between contract_path and inline_contract.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the description doesn't add much beyond the schema. It implicitly refers to the contract_path but doesn't elaborate on parameter usage or syntax.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: parsing a .ia-eval.yaml file, calling an LLM, running scorers, and returning a report. It distinguishes from siblings by mentioning to use list_local_tests for discovery, which implies this tool is for running a specific contract.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides clear context on when to use this tool (when you have an eval contract) and a helpful hint to use list_local_tests for discovery. However, it doesn't explicitly state when not to use it or mention alternatives.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
run_pr_gate_pipelineARead-onlyInspect
Full automated QA pipeline for a pull request. Takes a unified git diff (output of git diff HEAD) and returns: bug hotspots, regression impact areas, risk score (0–100), generated test cases, severity assessment, and a merge recommendation (PASS / CONDITIONAL / BLOCK). This is the highest-value QA tool — use it when reviewing any code change.
| Name | Required | Description | Default |
|---|---|---|---|
| context | No | Optional PR title or description for richer analysis | |
| git_diff | Yes | Unified git diff (output of `git diff HEAD` or copied from GitHub diff view) |
Output Schema
| Name | Required | Description |
|---|---|---|
| sla | No | |
| high | No | |
| topBugs | No | |
| critical | No | |
| bugsFound | No | |
| riskLevel | No | |
| riskScore | No | |
| impactAreas | No | |
| changedFiles | No | |
| severityLevel | No | |
| testCasesGenerated | No | |
| mergeRecommendation | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations declare readOnlyHint=true, so description's emphasis on returning results is consistent. Adds list of outputs but no additional behavioral traits.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, front-loaded with purpose and outputs. No wasted words, effectively concise.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Output schema exists, reducing need for return value explanation. Description covers primary inputs, output list, and usage context. Slightly elevated by clear output listing and use case.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline is 3. Description adds value for git_diff by specifying source (e.g., `git diff HEAD`), but does not mention the optional context parameter.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states it's a full QA pipeline for PRs, listing what it takes and returns. High verb specificity, but does not explicitly differentiate from sibling tools like analyze_diff_bugs.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Advises use when reviewing any code change, providing clear context. However, lacks explicit when-not-to-use guidance or comparisons to alternatives.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
run_semantic_testsARead-onlyIdempotentInspect
Semantic assertion primitive: compare actual vs expected text pairs using cosine similarity + ROUGE-L. Two modes: tfidf (default, free, no API key) or embeddings (OpenAI text-embedding-3-small, BYOK, true semantic similarity). Returns per-case PASS/FAIL verdicts and an overall verdict. CI-ready: pipe the JSON verdict field to gate a build.
| Name | Required | Description | Default |
|---|---|---|---|
| mode | No | tfidf (default): fast, free, lexical. embeddings: OpenAI text-embedding-3-small, true semantic similarity, requires api_key. | |
| cases | Yes | Array of (actual, expected) pairs to evaluate. | |
| api_key | No | OpenAI API key — required only when mode is embeddings. | |
| thresholds | No | Pass/fail thresholds (defaults: cosine 0.75, rouge_l 0.5). | |
| require_all | No | If true (default), all cases must pass for overall PASS. If false, at least one case passing returns PASS. |
Output Schema
| Name | Required | Description |
|---|---|---|
| mode | No | |
| total | No | |
| failed | No | |
| passed | No | |
| results | No | |
| verdict | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate read-only, idempotent, non-destructive behavior. The description adds context about CI usage (gate builds), mode requirements (API key for embeddings), and defaults. No contradictions. It could mention network calls for embeddings mode, but overall it enhances transparency.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two concise sentences with no wasted words. It front-loads the core functionality and ends with a practical CI use case. Every sentence earns its place.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (5 parameters, nested objects, multiple modes) and the presence of an output schema, the description is reasonably complete. It covers purpose, modes, verdicts, CI applicability, and default thresholds. Minor omission: no mention of error handling or max items limit (but schema covers maxItems).
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline is 3. The description adds value by clarifying mode behavior ('fast, free, lexical' vs 'true semantic similarity') and specifying defaults for thresholds (cosine 0.75, rouge_l 0.5). This context helps agents choose mode and understand parameter impact.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description explicitly states the tool's function: compare actual vs. expected text pairs using cosine similarity and ROUGE-L, returning PASS/FAIL verdicts. It distinguishes from sibling tools by emphasizing its role as a semantic assertion primitive for testing, with two distinct modes (tfidf and embeddings).
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides guidance on when to use each mode (tfidf for free/lexical, embeddings for true semantic similarity with an API key) and mentions CI readiness. While it does not explicitly compare to siblings, the context implies it is for pass/fail testing rather than generic similarity scoring.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
run_vlm_test_suiteARead-onlyInspect
Run a test suite against a Vision-Language Model (VLM) — send an image (URL or base64) + N test cases (each with a question + assertion) to GPT-4o, Claude 3.5, or Gemini. Returns per-case PASS/FAIL verdicts, a pass rate, an overall PASS/WARNING/FAIL verdict (customizable threshold), and latency stats. Assertion types: contains, not_contains, json_format, min_length, max_length, semantic_contains (TF-IDF cosine similarity ≥ 0.4). BYOK: requires your own API key for the target provider.
| Name | Required | Description | Default |
|---|---|---|---|
| model | Yes | VLM model to use. | |
| api_key | Yes | API key for the model provider (OpenAI sk-, Anthropic sk-ant-, or Google AIzaSy...). | |
| image_url | No | Public URL of the image to evaluate (required unless image_base64 is provided). | |
| threshold | No | Pass rate threshold for overall verdict (default: 80, 0–100). | |
| test_cases | Yes | Array of test cases to run. | |
| image_base64 | No | Base64-encoded image data (required unless image_url is provided). | |
| system_prompt | No | Optional system prompt sent to the VLM. | |
| image_mime_type | No | MIME type of the image if using image_base64 (default: image/jpeg). |
Output Schema
| Name | Required | Description |
|---|---|---|
| model | No | |
| total | No | |
| failed | No | |
| passed | No | |
| results | No | |
| verdict | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations mark the tool as readOnlyHint=true and openWorldHint=true, indicating safe but possibly external-dependent behavior. The description adds value by detailing the return format (PASS/FAIL verdicts, pass rate, overall verdict, latency stats) and noting that it uses the user's own API keys. Missing any mention of resource consumption (e.g., calls to external APIs) but that's implied by the API key requirement.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single sentence but packs significant information: purpose, models, input/output format, assertion types, and requirement. It is concise and front-loads the main action. Could be slightly more scannable with bullet points, but it is not verbose and earns its length.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (8 parameters, output schema exists), the description covers all essential aspects: what the tool does, how to use it (image + test cases), expected returns, and prerequisites (API key). With 100% schema coverage and an output schema, the description is complete and leaves no major questions unanswered.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so the baseline is 3. The description adds context to parameters like 'model' (lists options), 'assertion_type' (explains each type), and 'threshold' (default 80). However, most parameter details are already in the schema, so the description provides marginal added semantic value beyond pointing out defaults.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool runs test suites against Vision-Language Models, specifies supported models (GPT-4o, Claude, Gemini), and details the input and output. It distinguishes from sibling tools like 'run_vlm_test_suite_batch' and 'multimodal_eval_guide' by focusing on single-suite execution with per-case verdicts.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly tells when to use this tool (to test VLM responses with assertions) and includes crucial usage guidance: 'BYOK: requires your own API key' and lists assertion types. While it doesn't list counterexamples, the documentation is sufficiently specific to guide correct invocation.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
run_vlm_test_suite_batchARead-onlyInspect
Compare multiple VLMs on the same test suite in parallel — send an image (URL or base64) + N test cases to all models simultaneously. Returns per-model PASS/FAIL verdicts, pass rates, latency stats, and a comparison table. Assertion types: contains, not_contains, json_format, min_length, max_length, semantic_contains. BYOK: requires API keys for each provider.
| Name | Required | Description | Default |
|---|---|---|---|
| models | Yes | Array of model IDs to compare (runs in parallel). | |
| api_keys | Yes | Map of model ID → API key. Example: { "gpt-4o": "sk-...", "claude-3-5-sonnet-20241022": "sk-ant-..." } | |
| image_url | No | Public URL of the image to evaluate (required unless image_base64 is provided). | |
| threshold | No | Pass rate threshold for overall verdict (default: 80, 0–100). | |
| test_cases | Yes | Array of test cases to run against every model. | |
| image_base64 | No | Base64-encoded image data (required unless image_url is provided). | |
| system_prompt | No | Optional system prompt sent to every VLM. | |
| image_mime_type | No | MIME type of the image if using image_base64 (default: image/jpeg). |
Output Schema
| Name | Required | Description |
|---|---|---|
| suites | No | |
| verdict | No | |
| total_failed | No | |
| total_passed | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already provide readOnlyHint=true and openWorldHint=true. The description adds specific behavioral details: parallel execution, per-model verdicts, latency stats, and the need for API keys (important for openWorld). No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two concise sentences plus a list of assertion types. It front-loads the main purpose and is free of fluff.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (8 params, nested objects, parallel execution) and existence of output schema, the description covers core functionality: purpose, input (image + test cases), parallel execution, output (verdicts, stats, comparison), assertion types, and API keys. It misses edge cases like model failure or threshold behavior, but overall complete for main use.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so baseline is 3. The description adds value by clarifying the mutual exclusivity of image_url and image_base64 (both required unless one is provided), and listing assertion types which are not fully detailed in schema. This goes beyond just restating parameter names.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description starts with 'Compare multiple VLMs on the same test suite in parallel' which clearly states the verb (compare) and resource (VLMs on test suite). It distinguishes from siblings like run_vlm_test_suite by emphasizing multiple models and parallel execution.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage: when you want to compare multiple models on the same test cases. It mentions BYOK requirement. However, it does not explicitly state when not to use or alternatives like run_vlm_test_suite for single model.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
score_geo_signalsARead-onlyIdempotentInspect
Analyze a webpage HTML (or full HTML) for GEO (Generative Engine Optimization) signals. Returns a score /60 with per-check results and improvement tips. GEO = optimizing pages for AI-powered search engines (ChatGPT Search, Perplexity, etc.).
| Name | Required | Description | Default |
|---|---|---|---|
| head_html | Yes | Raw HTML of the <head> section (or full page HTML) to analyze |
Output Schema
| Name | Required | Description |
|---|---|---|
| grade | No | |
| score | No | |
| checks | No | |
| passed | No | |
| max_score | No | |
| total_checks | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations declare readOnly, idempotent, and non-destructive behavior. The description adds context by explaining the analysis produces a score with per-check results and improvement tips, and defines GEO. No contradictions. It could mention that no data is modified.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences, front-loaded with the core purpose, and every sentence adds value. No extraneous information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the simple parameter and presence of an output schema (referenced), the description adequately covers the tool's purpose and output. It explains the GEO context, which helps agents understand the tool's domain. Slightly more detail on the return format could improve, but sufficient.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema has 100% coverage with a description for the single parameter. The tool description adds that the input can be <head> HTML or full page HTML, which is a slight clarification beyond the schema. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool analyzes webpage HTML for GEO signals and returns a score out of 60 with per-check results and improvement tips. It uses a specific verb ('Analyze') and resource ('webpage <head> HTML'), and distinguishes from siblings by focusing on GEO for AI-powered search engines.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explains what the tool does and provides background on GEO, but does not explicitly state when to use it versus alternatives. Among many sibling analysis tools, no guidance is given on selection criteria or exclusion cases.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
search_jira_issuesARead-onlyInspect
Search Jira using JQL (Jira Query Language). Returns matching issues with key fields. Ideal for finding open bugs, sprint tickets, or issues by label/assignee/component. BYOK — credentials transit in-memory only, never stored.
| Name | Required | Description | Default |
|---|---|---|---|
| jql | Yes | JQL query string, e.g. "project = PROJ AND status = Open AND assignee = currentUser() ORDER BY priority DESC" | |
| fields | No | Fields per issue. Default: summary, status, assignee, priority, issuetype, labels, created, updated | |
| jira_email | Yes | Atlassian account email | |
| jira_token | Yes | Atlassian API token | |
| max_results | No | Max issues to return (default: 10, max: 50) | |
| jira_base_url | Yes | Atlassian base URL, e.g. "https://mycompany.atlassian.net" |
Output Schema
| Name | Required | Description |
|---|---|---|
| jql | No | |
| total | No | |
| issues | No | |
| returned | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and destructiveHint=false. The description adds value by highlighting security behavior (BYOK, credentials in-memory only, never stored) and noting the return of 'key fields'. No contradictions with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three concise sentences: first defines action, second provides usage examples, third adds critical security note. No wasted words, front-loaded with core purpose.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a medium-complexity search tool with an output schema and good annotations, the description covers purpose, usage, and security. It does not mention pagination or error handling, but these are partially covered by the output schema and annotations.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so the schema already fully documents parameters. The description does not add additional parameter meaning beyond what is in the schema (e.g., JQL example in description aligns with schema description).
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it searches Jira using JQL and returns matching issues, with specific use-case examples like finding open bugs and sprint tickets. This distinguishes it from sibling tools like fetch_jira_issue (single issue fetch) via the verb 'Search' and JQL mention.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Provides explicit examples of when to use (e.g., finding open bugs, sprint tickets, issues by label/assignee/component). However, does not explicitly state when not to use or mention alternatives, though the context of JQL search vs. single-issue fetch is implicit.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
secret_scanARead-onlyIdempotentInspect
Scan text or code for leaked secrets: API keys (AWS, GCP, Azure, OpenAI, Anthropic, Stripe, GitHub, GitLab, Slack, Twilio, SendGrid, HuggingFace), private keys (RSA/EC/PGP), JWTs, database connection strings, Bearer tokens, and Basic auth headers. Returns a list of findings with type, severity, line number, and a redacted preview. Use before committing code, sharing logs, or sending text to an LLM. 100% regex-based, zero network calls.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Text or code to scan for secrets | |
| types | No | Comma-separated types to scan (default: all). Options: aws, gcp, azure, openai, anthropic, stripe, github, gitlab, slack, twilio, sendgrid, huggingface, jwt, private_key, connection_string, bearer, basic_auth |
Output Schema
| Name | Required | Description |
|---|---|---|
| summary | No | |
| findings | No | |
| risk_level | No | |
| input_lines | No | |
| secrets_found | No | |
| findings_count | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate read-only, idempotent, and non-destructive behavior. The description adds valuable operational details: '100% regex-based, zero network calls.' This goes beyond annotations by explaining the processing method and privacy guarantees, though it doesn't mention any limitations or failure modes.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is four sentences, front-loaded with the main purpose. All sentences add value, but the list of secret types is somewhat lengthy. Could be slightly more concise without losing specificity.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity (2 parameters, output schema exists), the description is complete. It covers what it does, when to use it, behavioral traits, and output structure. The presence of an output schema means return values need not be detailed further.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with two parameters ('input' and 'types'). The description adds that 'types' is a comma-separated list and lists options, but this largely repeats the schema's enum-like description. No additional semantics about input format or length are provided, so it meets but does not exceed the baseline.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb 'Scan' and the resource 'text or code for leaked secrets.' It enumerates specific secret types (API keys, JWTs, etc.), making the tool's purpose highly specific and distinct from siblings. While sibling 'detect_secrets' exists, the description's detail effectively distinguishes it.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly states when to use the tool ('before committing code, sharing logs, or sending text to an LLM'), providing clear usage context. However, it does not explicitly mention when not to use it or name alternative tools for comparison, which would elevate to 5.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
security_headers_checkARead-onlyInspect
Analyse the HTTP security headers of a public URL OR of raw response headers you paste in. Grades each header (A–F) for: Strict-Transport-Security, Content-Security-Policy, X-Frame-Options, X-Content-Type-Options, Referrer-Policy, Permissions-Policy, X-XSS-Protection, Cross-Origin-Opener-Policy, Cross-Origin-Resource-Policy, and Cross-Origin-Embedder-Policy. Returns an overall score (0–100), per-header grades, missing headers, and fix snippets for Express, Nginx, and Apache. For localhost/private targets the remote server cannot reach, pass the headers parameter instead of url.
| Name | Required | Description | Default |
|---|---|---|---|
| url | No | Optional. Full public URL to check (e.g. https://example.com). Omit it entirely when using `headers`. The server cannot reach localhost/private IPs. | |
| headers | No | Optional, and sufficient on its own (no url needed). The response headers to grade, either as an object {"strict-transport-security": "max-age=...", ...} or as the raw header block pasted as a string (e.g. `curl -sI` output). Use this to audit a local server the remote MCP cannot reach. |
Output Schema
| Name | Required | Description |
|---|---|---|
| fix | No | |
| key | No | |
| url | No | |
| weak | No | |
| grade | No | |
| score | No | |
| value | No | |
| header | No | |
| source | No | |
| weight | No | |
| details | No | |
| missing | No | |
| weak_count | No | |
| missing_count | No | |
| overall_grade | No | |
| headers_checked | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate read-only and nondestructive behavior. The description adds value by detailing the grading process (A-F per header), overall score, and return of missing headers and fix snippets, which goes beyond the annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences: a comprehensive first sentence explaining the core functionality and a second sentence addressing the critical edge case. Every sentence provides essential information without redundancy or fluff.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the output schema exists, the description need not detail return values, but it briefly mentions the outputs (overall score, grades, missing headers, fix snippets). It covers the main workflow and the localhost constraint, leaving minimal gaps.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so baseline is 3. The description adds meaning by explaining the dual modes, that both parameters are optional but one is sufficient, and that 'headers' can be an object or raw string, which is not fully captured in the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it analyzes HTTP security headers, lists the specific headers graded, and distinguishes between two input modes (URL or raw headers). This is specific, uses a strong verb, and differentiates from sibling tools like cors_checker or web_security_audit.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides guidance on when to use each parameter: 'url' for public URLs and 'headers' for localhost/private targets, mentioning the remote server's limitation. It lacks explicit comparison with sibling tools, but the context is clear for effective usage.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
shield_analyzeARead-onlyInspect
Run a comprehensive AI guardrail analysis on an LLM response. Orchestrates 6 deterministic safety checks plus an optional LLM-powered deep analysis in parallel: hallucination detection (grounding score), prompt injection scan, toxicity scan, output validation (PII/safety), guardrail rules, response quality scoring, and AI verdict (via Qwen, Gemma, Llama, etc.). Returns a unified PASS/FIX/BLOCK verdict with a 0-100 safety score, per-check results, and actionable fix recommendations. Use this as a single-call safety gate before surfacing any LLM output to users.
| Name | Required | Description | Default |
|---|---|---|---|
| model | No | LLM model for AI-powered deep analysis (default: "qwen/qwen3-32b"). Set to "none" to skip LLM check. Supports any model from list_llm_models. | |
| rules | No | Optional guardrail rules array (same format as guardrail_test tool) | |
| prompt | No | Optional original prompt (used for quality scoring and injection detection) | |
| source | No | Optional reference/source text for hallucination grounding check | |
| response | Yes | The LLM-generated response to analyze |
Output Schema
| Name | Required | Description |
|---|---|---|
| flags | No | |
| grade | No | |
| score | No | |
| checks | No | |
| verdict | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Description reveals that the tool runs 6 deterministic checks plus optional LLM analysis in parallel, returning a unified verdict with scores and recommendations. This adds value beyond annotations (which mark it as readOnly). No contradictions. Could mention idempotency or rate limits but not required given annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three concise sentences that front-load the purpose, list specifics, and state return format. No redundancy or waste. Efficiently communicates what the tool does.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (orchestrating multiple checks), the description adequately covers the checks performed, the verdict format, and the output components. Could mention error handling or prerequisites but largely complete for a read-only analysis tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% so description carries less burden. Adds minor value by noting that setting model to 'none' skips LLM check and that the default model is qwen/qwen3-32b (already in schema). Does not significantly enhance understanding beyond schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Clearly describes the tool as a comprehensive guardrail analysis orchestrating multiple checks. Distinguishes as a single-call safety gate for LLM output. Could explicitly differentiate from individual sibling tools like guardrail_test, hallucination_check, etc., but overall purpose is clear.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
States 'Use this as a single-call safety gate before surfacing any LLM output to users.' Provides context for when to use but does not explicitly mention when not to use or compare to alternative tools. Agent may need to infer when to choose this over individual check tools.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
similarity_scoreARead-onlyIdempotentInspect
Compute text similarity between reference and hypothesis using multiple metrics: Cosine (BoW, TF-IDF), Jaccard, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU. No API key needed. Ideal for LLM eval (expected vs actual), RAG quality checks, and NLG benchmarking. Supports batch mode.
| Name | Required | Description | Default |
|---|---|---|---|
| batch | No | Batch mode: array of {reference, hypothesis} pairs. | |
| metrics | No | Metrics to compute (default: all). Options: "cosine_bow", "cosine_tfidf", "jaccard", "rouge1", "rouge2", "rougeL", "bleu" | |
| reference | No | Reference / expected text (ground truth) | |
| threshold | No | Optional pass/fail threshold (0-1). Applies to ROUGE-L F1 score. | |
| hypothesis | No | Hypothesis / actual text (LLM output) |
Output Schema
| Name | Required | Description |
|---|---|---|
| f1 | No | |
| mode | No | |
| count | No | |
| recall | No | |
| results | No | |
| precision | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate safe, idempotent, read-only behavior. The description adds useful context: 'No API key needed' and 'Supports batch mode'. It does not contradict annotations, and adds value beyond them.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, no fluff. The first sentence immediately states the core action and lists metrics. The second adds key notes. Highly efficient.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool has 5 parameters and supports multiple modes (batch vs individual), the description covers essential uses and metrics. However, it does not explicitly state the relationship between parameters (e.g., batch vs reference+hypothesis). Still adequate with high schema coverage.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline is 3. The description lists metrics and mentions batch mode, but these are already detailed in schema descriptions. It adds minor value with 'No API key needed' and use case context.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it computes text similarity using specific metrics (Cosine, Jaccard, ROUGE, BLEU). It lists multiple metrics, distinguishing it from single-metric tools like levenshtein_distance or embedding_similarity, but does not explicitly compare to siblings.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description suggests use cases (LLM eval, RAG quality checks, NLG benchmarking) but does not specify when not to use this tool or mention alternatives like embedding_similarity or bm25_score. The guidance is implied rather than explicit.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
sort_linesARead-onlyIdempotentInspect
Sort, deduplicate, reverse, or filter lines of text. Useful for cleaning import lists, dependencies, log files, and config entries.
| Name | Required | Description | Default |
|---|---|---|---|
| trim | No | Trim whitespace from each line (default: true) | |
| input | Yes | Multi-line text to process | |
| filter | No | For "filter": keep lines containing this substring (case-insensitive) | |
| operation | No | "sort" (default), "sort_desc", "reverse", "deduplicate", "unique_sort", "filter" | |
| remove_empty | No | Remove empty lines (default: true) |
Output Schema
| Name | Required | Description |
|---|---|---|
| result | No | |
| removed | No | |
| line_count | No | |
| original_count | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint, idempotentHint, and destructiveHint, so the description doesn't need to repeat safety info. It adds behavioral context by describing the types of text transformations, which is sufficient but does not go beyond what the schema implies.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences: first states core operations, second gives use cases. No fluff. Front-loaded with the verb actions. Excellent conciseness.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity (5 parameters, straightforward operations), comprehensive annotations, and presence of an output schema, the description provides enough context for an agent to understand and invoke the tool correctly.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema has full coverage with descriptions for all parameters, so the description's high-level summary adds minimal additional semantic value. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: sorting, deduplicating, reversing, or filtering lines of text. It also provides concrete use cases (cleaning import lists, dependencies, etc.), making it distinct from sibling tools that handle other text operations.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description suggests usage scenarios ('useful for cleaning import lists, dependencies, log files, and config entries'), but does not explicitly state when to avoid using this tool or mention alternative tools. Despite this, the context is clear enough for most cases.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
split_chunksARead-onlyIdempotentInspect
Split text into chunks of at most N tokens (cl100k_base: ~4 chars/token) with optional overlap. Designed for RAG ingestion pipelines.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Text to split into chunks | |
| overlap | No | Token overlap between consecutive chunks (default: 0) | |
| chunk_tokens | Yes | Maximum tokens per chunk (10–8000) |
Output Schema
| Name | Required | Description |
|---|---|---|
| chunks | No | |
| chunk_count | No | |
| overlap_tokens | No | |
| tokens_per_chunk | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnlyHint=true, idempotentHint=true, destructiveHint=false, so the agent knows it's a safe, non-destructive operation. The description adds valuable context: it mentions the tokenization model (cl100k_base) and approximate character-per-token ratio (~4 chars/token), which helps the agent understand the behavior of the chunk_tokens parameter beyond the schema.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single sentence that immediately states the core function and key details (N tokens, cl100k_base, optional overlap, RAG purpose). Every word serves a purpose with no redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity, the presence of an output schema, and the behavioral annotations, the description provides sufficient context for an agent to use the tool correctly. It covers the main purpose, parameters, and typical use case.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with descriptions for all three parameters. The description adds extra value by specifying the tokenizer model and approximate char/token ratio, which helps interpret the chunk_tokens range. It also clarifies that overlap is optional, reinforcing the schema's default of 0.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb 'split' and resource 'text into chunks', specifying the token limit and optional overlap. It also explicitly mentions its design for RAG ingestion pipelines, distinguishing it from sibling tools like truncate_to_tokens (which truncates) and count_tokens (which counts tokens).
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description notes the tool is 'Designed for RAG ingestion pipelines,' providing context for when to use it. However, it does not explicitly state when not to use it or mention alternatives among the many text-processing siblings, leaving the agent to infer usage boundaries.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
ssl_certificate_checkARead-onlyInspect
Analyse the SSL/TLS certificate of any HTTPS host. Returns certificate subject, issuer, validity dates, days until expiry, protocol version, cipher suite, key exchange info, and an overall grade (A+, A, B, C, F). Detects expired, self-signed, and weak certificates. Use this to audit TLS posture before production deployment or during security reviews.
| Name | Required | Description | Default |
|---|---|---|---|
| host | Yes | Hostname to check (e.g. example.com). Do not include https:// prefix. | |
| port | No | Port number (default: 443) |
Output Schema
| Name | Required | Description |
|---|---|---|
| host | No | |
| grade | No | |
| cipher | No | |
| issuer | No | |
| issues | No | |
| subject | No | |
| protocol | No | |
| valid_to | No | |
| is_expired | No | |
| valid_from | No | |
| is_self_signed | No | |
| days_until_expiry | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnlyHint=true and destructiveHint=false, signaling a safe read operation. The description adds detailed behavioral context: it returns certificate details, days until expiry, and detects expired/self-signed/weak certificates. It fully discloses what the tool does beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences: first action and outputs, second detection features and usage. It is front-loaded, efficient, and every sentence adds significant value. No redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With an output schema present, the description need not explain return values. It covers purpose, usage, and key features (detection of weak certs). It lacks limitations (e.g., only HTTPS, required network access), but overall it is complete for typical use. A minor gap prevents a 5.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, meaning both parameters are already well-described in the input schema (host with format instructions, port with default). The description adds no additional parameter meaning beyond what the schema provides. Baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description uses a specific verb 'Analyse' and clearly identifies the resource as 'SSL/TLS certificate of any HTTPS host'. It lists concrete return fields (subject, issuer, validity, grade) and distinguishes from siblings like security_headers_check by its certificate focus.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly states when to use: 'audit TLS posture before production deployment or during security reviews.' This provides clear context. It does not mention when not to use or alternatives, but the usage scenario is well-defined.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
strip_markdownARead-onlyIdempotentInspect
Strip all Markdown formatting (headers, bold, italic, code fences, links, lists) from text and return clean plain text. Run this before injecting scraped documentation, README files, or user content into an LLM prompt to eliminate redundant markup tokens and reduce cost.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Markdown text to convert to plain text |
Output Schema
| Name | Required | Description |
|---|---|---|
| text | No | |
| original_length | No | |
| stripped_length | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnlyHint=true, destructiveHint=false, and idempotentHint=true, making the tool's safety profile clear. The description adds the behavioral note about eliminating markup tokens to reduce cost, which is useful but not essential beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences: first defines function, second provides practical use case. No redundant words, perfectly front-loaded, and each sentence earns its place.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity (one parameter, comprehensive annotations, output schema exists), the description is fully complete. It covers purpose, usage, and benefits without leaving gaps.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
With 100% schema coverage for the single parameter 'input' (described as 'Markdown text to convert to plain text'), the description does not need to add more. The main description already covers the parameter's purpose indirectly.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description explicitly states the tool strips Markdown formatting (headers, bold, italic, code fences, links, lists) and returns plain text, with a clear verb ('Strip') and resource ('Markdown formatting'). It distinctively specifies the scope and use case, differentiating it from sibling tools like html_to_markdown or normalize_whitespace.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides clear usage context: 'Run this before injecting scraped documentation, README files, or user content into an LLM prompt'. It implicitly advises against using it when Markdown is needed. It does not explicitly mention alternative tools, but the context is sufficient for most scenarios.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
system_prompt_builderARead-onlyIdempotentInspect
Build a structured system prompt from components: role, task, constraints, output format, tone, language, and examples. Generates a production-ready system prompt with token estimate.
| Name | Required | Description | Default |
|---|---|---|---|
| role | Yes | Role/persona (e.g. "Senior QA Engineer", "JSON extraction assistant") | |
| task | No | Main task or objective | |
| tone | No | Communication tone | |
| examples | No | Brief examples to include | |
| language | No | Response language (e.g. "French") | |
| constraints | No | Rules and constraints to follow | |
| output_format | No | Expected output format description |
Output Schema
| Name | Required | Description |
|---|---|---|
| sections | No | |
| system_prompt | No | |
| token_estimate | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, and no destructiveness. The description adds value by confirming the tool generates a production-ready prompt with a token estimate, providing behavioral context beyond the annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is exceptionally concise: two sentences that front-load the core functionality and output. Every word adds value, with no repetition or fluff.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the output schema exists, the description does not need to explain return values. It sufficiently covers the tool's operation, inputs, and output (prompt with token estimate) for an agent to understand its purpose and usage.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
With 100% schema description coverage, the description adds little beyond listing parameter names already present in the schema. It does not provide additional format, constraints, or usage nuances for individual parameters.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: building a structured system prompt from explicit components such as role, task, constraints, etc. It also mentions the output includes a token estimate, distinguishing it from similar tools like build_rag_prompt.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage for constructing system prompts from components but offers no explicit guidance on when to use this tool versus alternatives (e.g., build_rag_prompt, few_shot_formatter). No when-not-to-use or exclusion criteria are mentioned.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
test_skillARead-onlyInspect
Validate a SKILL.md definition (Cursor / GitHub Copilot / Windsurf) by auto-generating trigger-positive and trigger-negative scenarios, running each through the model with the skill injected as a system prompt, and scoring trigger accuracy + step adherence. Returns a PASS/FIX/BLOCK verdict with per-scenario breakdown. Uses Groq llama-3.3-70b by default (server key, no api_key needed). Pass api_key + model to use your own provider.
| Name | Required | Description | Default |
|---|---|---|---|
| model | No | LLM model ID to use for both scenario generation and testing (e.g. gpt-4o-mini, claude-3-5-haiku-20241022). Defaults to llama-3.3-70b-versatile (Groq, server key). | |
| api_key | No | API key for the chosen model provider. Not required when using the default Groq model. | |
| skill_md | Yes | Full content of the SKILL.md file to test. Must include a name, a "Use when:" trigger description, and at least one step. | |
| scenario_count | No | Number of test scenarios to generate: half trigger-positive, half trigger-negative. Default: 6. |
Output Schema
| Name | Required | Description |
|---|---|---|
| score | No | |
| verdict | No | |
| scenarios | No | |
| step_adherence | No | |
| trigger_accuracy | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and destructiveHint=false, indicating a safe read operation. The description adds value by disclosing the default model (Groq llama-3.3-70b) and the ability to use a custom provider, which affects latency and cost. No contradictions with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences, front-loading the core purpose in the first sentence. Every sentence earns its place: first states the main function, second adds critical provider details. No redundant or vague wording.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's moderate complexity (4 parameters, output schema exists), the description covers the high-level process (scenario generation, testing, scoring) and output type (PASS/FIX/BLOCK verdict). It doesn't detail edge cases, but the output schema likely handles that, making it sufficiently complete.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the baseline is 3. The description adds meaning beyond the schema by noting the default scenario count (6) and the required structure of skill_md (must include name, trigger, step). These details help the agent construct valid inputs.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: 'Validate a SKILL.md definition' by auto-generating test scenarios, running them through a model, and scoring performance. It specifies the resource (SKILL.md) and the action (validation with automated testing), distinguishing it from other sibling tools like 'get_testing_guidelines' or 'run_semantic_tests' which are more generic.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly specifies when to use the tool ('Validate a SKILL.md definition') and provides context about the default provider and custom API key option. However, it does not state when not to use it or mention alternatives among siblings, so it's clear but lacks explicit exclusion criteria.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
text_statsARead-onlyIdempotentInspect
Compute comprehensive statistics for any text: character count (with and without spaces), word count, line count, sentence count, paragraph count, and estimated reading time in minutes. Use for validating form field lengths, evaluating LLM output verbosity, or content auditing.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | The text to analyse |
Output Schema
| Name | Required | Description |
|---|---|---|
| chars | No | |
| lines | No | |
| words | No | |
| sentences | No | |
| paragraphs | No | |
| chars_no_space | No | |
| reading_time_minutes | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnlyHint, idempotentHint, destructiveHint. Description adds specific output metrics but no additional behavioral traits like performance or error conditions. Adequate given annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, first lists main action and outputs, second provides use cases. No filler, front-loaded with key information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple tool with one parameter and output schema, the description fully covers purpose, outputs, and usage context. Nothing missing.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with parameter description. Description does not add extra meaning beyond the schema, so baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states the tool computes comprehensive statistics for text, listing specific metrics (character count, word count, etc.). It distinguishes from siblings like calculate_readability or count_tokens by offering a broader set of stats.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly gives use cases: validating form field lengths, evaluating LLM output verbosity, content auditing. Though it doesn't mention when not to use or alternative tools, the context is clear and helpful.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
timestamp_convertARead-onlyIdempotentInspect
Convert between Unix timestamps (seconds or milliseconds) and ISO-8601 / UTC date strings. Auto-detects epoch vs. millisecond format. Omit input to get the current time. Returns iso, unix_s, unix_ms, utc, date, and time fields.
| Name | Required | Description | Default |
|---|---|---|---|
| input | No | Unix timestamp (number, seconds or ms) or ISO date string. Omit to get the current time. |
Output Schema
| Name | Required | Description |
|---|---|---|
| iso | No | |
| utc | No | |
| date | No | |
| time | No | |
| unix_s | No | |
| unix_ms | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations indicate read-only, idempotent, non-destructive behavior. The description adds auto-detection and return fields, providing useful context beyond annotations. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is extremely concise with four short sentences, front-loading the purpose and covering all essential behavior and output without waste.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity (one optional parameter, full schema coverage, output schema implied), the description is complete, covering input behavior, auto-detection, and return fields.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with a clear parameter description. The tool description reiterates but does not significantly enhance parameter understanding beyond the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool converts between Unix timestamps and ISO-8601/UTC date strings, specifying the verb and resource. It distinguishes from siblings since no other tool in the list performs timestamp conversion.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
It explains auto-detection of epoch vs. millisecond format and that omitting input returns current time. Although it does not explicitly state when not to use it, the guidelines are sufficient for this narrow scope.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
token_budget_calculatorARead-onlyIdempotentInspect
Plan token allocation across system prompt, user input, context/RAG chunks, and expected output. Warns if budget exceeds model context window. Supports 25+ models.
| Name | Required | Description | Default |
|---|---|---|---|
| model | Yes | Model name (e.g. gpt-4o, claude-3.5-sonnet, gemini-2.0-flash) | |
| context | No | Actual context text (will estimate tokens) | |
| user_input | No | Actual user input text (will estimate tokens) | |
| system_prompt | No | Actual system prompt text (will estimate tokens) | |
| context_tokens | No | Token count for RAG context / documents | |
| user_input_tokens | No | Token count for user message | |
| system_prompt_tokens | No | Token count for system prompt | |
| expected_output_tokens | No | Expected max output tokens |
Output Schema
| Name | Required | Description |
|---|---|---|
| model | No | |
| warnings | No | |
| breakdown | No | |
| context_window | No | |
| fits_in_window | No | |
| remaining_tokens | No | |
| utilization_percent | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and idempotentHint=true, so the tool is safe. The description adds behavioral context: it warns if budget exceeds the model context window and supports 25+ models. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise with two sentences, front-loading the key purpose and adding the warning and model support efficiently. Every sentence adds value.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the output schema exists, the description does not need to explain return values. It covers the main behaviors (planning, warning, model support). Minor missing detail: it doesn't mention that users can provide either text or token counts for some fields, but this is clear from the schema.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the schema already explains all 8 parameters. The description mentions the categories (system prompt, user input, context/RAG chunks, expected output) but does not add additional semantics beyond what is in the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly defines the tool as a token budget calculator for system prompt, user input, context/RAG chunks, and expected output. The verb 'Plan' and specific resources distinguish it from sibling tools like count_tokens and context_window_check.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explains when to use this tool (planning token allocation and checking budget against context window). However, it does not explicitly state when not to use it or direct to alternatives for simpler token counting or context window checks.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
toxicity_scanARead-onlyIdempotentInspect
Scan text for toxic language, bias indicators, profanity, and harmful content categories. Returns risk scores per category. Useful for LLM safety guardrail testing.
| Name | Required | Description | Default |
|---|---|---|---|
| text | Yes | Text to scan | |
| categories | No | Categories to check (default: all) |
Output Schema
| Name | Required | Description |
|---|---|---|
| results | No | |
| text_length | No | |
| overall_risk | No | |
| categories_checked | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, destructiveHint=false. Description adds that it returns risk scores per category, which is useful but not extensive. No contradiction with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, front-loaded with the main action, no unnecessary words. Every sentence earns its place.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Simple tool with two parameters, good annotations, and description mentions output format (risk scores per category). Complete for its complexity.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with descriptions for both parameters. The tool description lists categories (toxic language, bias, profanity, harmful content) that align with the enum, adding marginal value beyond the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states the tool scans text for toxic language, bias, profanity, and harmful content categories, and returns risk scores. The verb 'scan' and resource 'text' are specific, and it distinguishes from sibling tools like 'bias_detect' which focus on a subset.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Provides explicit use case: 'LLM safety guardrail testing.' No explicit when-not-to-use or alternatives, but the context is clear given the sibling tools.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
transform_json_arrayARead-onlyIdempotentInspect
Transform a JSON array using common operations: pluck (extract specific fields), filter (by field value), sort_by (field), group_by (field), count_by (field), uniq_by (field). Useful for processing MCP tool results and LLM structured outputs.
| Name | Required | Description | Default |
|---|---|---|---|
| n | No | For first_n / last_n: number of items | |
| path | No | Optional dot-notation path to the array within the JSON object (e.g. "data.items") | |
| field | No | Field to operate on (for sort_by, group_by, count_by, uniq_by, filter) | |
| input | Yes | JSON string containing an array (or object with an array at path) | |
| fields | No | Comma-separated field list for "pluck" (e.g. "id,name,email") | |
| filter_op | No | For "filter": "==" | "!=" | ">" | ">=" | "<" | "<=" | "contains" | "exists" | "!exists" | |
| operation | Yes | Operation: "pluck", "filter", "sort_by", "group_by", "count_by", "uniq_by", "reverse", "first_n", "last_n", "flatten" | |
| sort_order | No | For sort_by: "asc" (default) or "desc" | |
| filter_value | No | For "filter": value to compare against |
Output Schema
| Name | Required | Description |
|---|---|---|
| count | No | |
| field | No | |
| order | No | |
| total | No | |
| fields | No | |
| result | No | |
| removed | No | |
| operation | No | |
| group_count | No | |
| unique_values | No | |
| removed_duplicates | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, destructiveHint=false, so the safety profile is clear. The description adds the operations list but doesn't detail edge cases, error handling, or performance. No contradictions with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single sentence that efficiently conveys the purpose and lists operations. It is front-loaded and has no wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (9 parameters, multiple operations) and the presence of an output schema, the description is adequate but could be more thorough about usage patterns and return value structure. It doesn't mention output schema existence but that's covered by schema.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so the schema already documents all parameters. The description provides a functional overview but adds limited new meaning beyond listing operations.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool transforms a JSON array using common operations, listing specific operations like pluck, filter, sort_by, etc. It distinguishes itself from sibling tools by focusing on array transformations, not general JSON manipulation.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions it's useful for processing MCP tool results and LLM structured outputs, giving some context, but lacks explicit guidance on when to use this tool over alternatives, no exclusions or comparisons to sibling tools.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
truncate_to_tokensARead-onlyIdempotentInspect
Truncate text to at most N tokens (cl100k_base: ~4 chars/token) to avoid exceeding an LLM context window. Optionally keeps the end of the text instead of the start (useful for keeping recent conversation history). Reports whether truncation occurred and the estimated token count.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Text to truncate | |
| from_end | No | Keep the end of the text instead of the start (default: false) | |
| max_tokens | Yes | Maximum number of tokens to keep |
Output Schema
| Name | Required | Description |
|---|---|---|
| text | No | |
| truncated | No | |
| tokens_estimate | No | |
| original_tokens_estimate | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations declare readOnlyHint=true, idempotentHint=true, destructiveHint=false. The description adds meaningful context beyond annotations: encoding (cl100k_base), approximate chars/token, option to keep start vs end, and reporting of truncation and estimated token count. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences cover purpose, encoding, option, and reporting. No redundant information; every sentence earns its place. Highly concise and well-structured.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given an output schema exists, the description appropriately focuses on input and behavior. It covers main functionality, options, and results. Could mention tokenization boundary handling, but not necessary for typical use. Complete for a utility tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, and the description adds value beyond the schema by explaining the encoding detail (~4 chars/token) and the effect of the from_end parameter. This enhances understanding for an AI agent.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool truncates text to at most N tokens using cl100k_base encoding, with an option to keep the end. It distinguishes itself from siblings like count_tokens and token_budget_calculator by focusing on truncation, making its purpose unambiguous.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explains when to use the tool: to avoid exceeding an LLM context window, and mentions a use case (keeping recent conversation history). It does not explicitly state when not to use it, but the context is clear enough for an AI agent to decide.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
unescape_htmlARead-onlyIdempotentInspect
Convert HTML entities (&, <, >, ", ', and numeric NNN;) back to plain characters. Use when processing HTML-encoded text from APIs, email content, or legacy database fields before passing to an LLM or displaying to users.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | HTML-encoded string to unescape |
Output Schema
| Name | Required | Description |
|---|---|---|
| unescaped | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, destructiveHint=false. The description adds value by specifying the exact entities handled and numeric forms, which is beyond the annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two efficient sentences: first states the function, second gives usage. No unnecessary words, well front-loaded.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With one parameter fully described in schema, comprehensive annotations, and a likely clear output schema, the description is complete for effective agent use.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% (one parameter described). The description only repeats 'HTML-encoded string' similar to schema, adding no new semantic detail.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb 'Convert' and the resource 'HTML entities', listing specific examples. It distinguishes itself from siblings like escape_html and html_to_markdown by focusing on unescaping.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly says 'Use when processing HTML-encoded text from APIs, email content, or legacy database fields', providing clear context. It does not include when not to use, but the positive guidance is strong.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
url_decodeARead-onlyIdempotentInspect
Decode a percent-encoded URL string back to plain text. Use when parsing query parameters from raw URLs or when displaying encoded values to users.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | URL-encoded string to decode |
Output Schema
| Name | Required | Description |
|---|---|---|
| decoded | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint, idempotentHint, and destructiveHint, so the description adds little beyond confirming the decoding operation. It does not discuss edge cases (e.g., invalid encoding) or response format, which would add value. The description is consistent with annotations, no contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two concise sentences deliver purpose, use cases, and context with zero fluff. The description is front-loaded and every word earns its place.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple, one-parameter utility with an output schema and rich annotations, the description is nearly complete. It could mention the output format (decoded string) but that is likely covered by the output schema. The description adequately informs the agent when and how to use the tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The single parameter 'input' is fully described in the schema (100% coverage), and the description uses synonymous phrasing ('percent-encoded URL string'). No additional meaning or format guidance is provided, so the description adds no value beyond the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the action ('Decode') and the resource ('percent-encoded URL string'), and provides two specific use cases (parsing query parameters, displaying encoded values). This is precise and leaves no ambiguity about the tool's purpose.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly states when to use the tool ('Use when parsing query parameters from raw URLs or when displaying encoded values to users'). However, it does not mention when not to use it or provide alternatives (e.g., base64_decode), missing an opportunity to guide the agent away from incorrect usage.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
url_encodeARead-onlyIdempotentInspect
Percent-encode a string for safe use in URLs. Call this before programmatically building query strings, path segments, or form-encoded bodies to prevent injection and malformed URLs.
| Name | Required | Description | Default |
|---|---|---|---|
| mode | No | "component" (default) or "full" for encodeURI behavior | |
| input | Yes | String to URL-encode |
Output Schema
| Name | Required | Description |
|---|---|---|
| encoded | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and idempotentHint=true, so the description's mention of preventing injection adds purpose but not behavioral traits. No contradictions; baseline score of 3 is appropriate.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences with no wasted words, front-loaded with the core purpose, and highly efficient.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple encoding tool with comprehensive annotations and an output schema, the description is complete. It covers purpose, usage context, and is suitable for selection and invocation.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% and both parameters (input, mode) are well-described in the schema. The description does not add additional meaning beyond the schema, so baseline 3.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Percent-encode a string for safe use in URLs' using a specific verb and resource, and the context of building query strings distinguishes it from the sibling url_decode.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
It explicitly advises calling this before building query strings, path segments, or form-encoded bodies, which is clear guidance. It doesn't mention when not to use it or alternatives, but for a simple utility this is sufficient.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
validate_agent_trajectoryARead-onlyIdempotentInspect
Run declarative assertions on an agent trace (OpenAI tool-call messages, LangChain run trees, or plain text logs). No LLM call — deterministic. Assertion types: order (tool A before B), must_call, must_not_call, max_calls, min_calls, no_error, recovery (agent continues after error). Returns per-assertion PASS/FAIL, parsed steps, and an overall verdict. Use this to gate CI/CD on agent behavior correctness.
| Name | Required | Description | Default |
|---|---|---|---|
| trace | Yes | Agent execution trace as JSON (OpenAI messages array, LangChain run tree) or plain text log (Thought/Action/Observation format). | |
| format | No | Trace format. auto (default) detects automatically. | |
| assertions | Yes | List of assertions to validate against the trace. |
Output Schema
| Name | Required | Description |
|---|---|---|
| steps | No | |
| total | No | |
| failed | No | |
| passed | No | |
| verdict | No | |
| assertions | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint and idempotentHint, but the description adds context: deterministic, no LLM call, and describes return values (PASS/FAIL, parsed steps, verdict). This adds value beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is four sentences, each serving a purpose: introduction, behavioral trait, assertion types, return values and use case. No redundant information; front-loaded with core purpose.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the output schema exists, the description covers all necessary context: input types, determinism, assertion types, and use case. It fully prepares the agent for correct invocation.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, but the description clarifies the trace formats and assertion types, supplementing the schema's parameter descriptions. It adds meaning about the tool's behavior and output structure.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: running declarative assertions on agent traces. It specifies the input types (OpenAI messages, LangChain run trees, plain text) and lists assertion types, distinguishing it from sibling tools like function_call_validate which focus on single function calls.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly mentions a use case: gating CI/CD on agent behavior correctness. However, it does not mention when not to use this tool or provide alternatives, leaving some ambiguity for the agent.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
validate_emailARead-onlyIdempotentInspect
Validate an email address against RFC 5322 syntax before storing it, sending a transactional email, or adding it to a mailing list. Returns { valid, email } — use this to avoid bounces and malformed data.
| Name | Required | Description | Default |
|---|---|---|---|
| Yes | Email address to validate |
Output Schema
| Name | Required | Description |
|---|---|---|
| No | ||
| valid | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations declare readOnlyHint=true, idempotentHint=true, destructiveHint=false. The description adds value by specifying the validation standard (RFC 5322) and the return format ({ valid, email }). This goes beyond annotations without contradicting them.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences with no waste. The first sentence states purpose and use cases; the second states return value and benefit. Front-loaded and efficient.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple validation tool with one parameter and an output schema, the description is complete. It covers the validation standard, use cases, return format, and benefit.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% (1 parameter documented). The description adds minimal additional meaning beyond the schema, though it mentions RFC 5322 which implies the parameter must be a string. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: 'Validate an email address against RFC 5322 syntax' and lists specific use cases (storing, sending, adding to list). It is distinct from sibling tools like validate_url.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly says when to use the tool ('before storing it, sending a transactional email, or adding it to a mailing list') and the benefit ('avoid bounces and malformed data'). It does not mention when not to use it or alternative tools, but the context is sufficient.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
validate_mcp_responseARead-onlyIdempotentInspect
Validate that an MCP tool response conforms to expected format, schema, and content rules. Use this to QA-test any MCP server tool. Supply the tool's actual JSON result and a set of checks to perform.
| Name | Required | Description | Default |
|---|---|---|---|
| response | Yes | The MCP tool result as a JSON string to validate | |
| min_items | No | If response is an array, minimum number of items expected | |
| expected_type | No | Expected top-level type: "object", "array", "string", "number" | |
| required_keys | No | Comma-separated list of keys that MUST exist in the response (dot-notation for nested: "data.id, data.name") | |
| actual_latency | No | Actual measured latency in ms (from the call) | |
| forbidden_keys | No | Comma-separated list of keys that MUST NOT exist (e.g. "password, secret, token") | |
| max_size_bytes | No | Maximum acceptable response size in bytes | |
| max_response_ms | No | Maximum acceptable latency in ms (will be compared if provided) |
Output Schema
| Name | Required | Description |
|---|---|---|
| total | No | |
| checks | No | |
| failed | No | |
| passed | No | |
| verdict | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description adds context to annotations by specifying that the tool validates format, schema, and content rules. Annotations already indicate readOnlyHint=true and destructiveHint=false, and the description reinforces this. It does not contradict annotations and provides extra detail about the validation scope.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is highly concise with two sentences that front-load the main purpose. Every sentence adds value without redundancy or clutter.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool has 8 parameters and an output schema (not shown but indicated), the description is complete enough for a validation tool. It covers the main function and usage. A slightly more detailed explanation of what the validation checks entail could be beneficial, but the output schema likely handles that.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema has 100% description coverage for all 8 parameters, so the schema already provides clear meaning. The description does not add additional details about parameters beyond mentioning 'a set of checks'. Therefore, the description adds minimal value beyond the schema, earning a baseline score of 3.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: validating MCP tool responses. It uses a specific verb ('Validate') and resource ('MCP tool response'), and distinguishes from sibling tools by specifying it's for QA-testing any MCP server tool, making it unique among other validators like json_schema_validate.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly states when to use the tool: 'Use this to QA-test any MCP server tool.' It provides actionable guidance by telling the user to supply the JSON result and a set of checks. However, it does not mention when not to use it or provide explicit alternatives, though the sibling context implies other validators exist.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
validate_urlARead-onlyIdempotentInspect
Parse and validate a URL. Returns decomposed components: protocol, hostname, port, path, query parameters, hash, and origin.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | URL to validate and parse |
Output Schema
| Name | Required | Description |
|---|---|---|
| full | No | |
| hash | No | |
| port | No | |
| valid | No | |
| origin | No | |
| search | No | |
| hostname | No | |
| pathname | No | |
| protocol | No | |
| query_params | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnlyHint, idempotentHint, and non-destructive behavior. The description adds context about the return value but does not contradict annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
One sentence that front-loads the purpose and is efficient with no wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the presence of an output schema (mentioned in context signals) and the simple nature of the tool, the description sufficiently covers what the tool does and returns.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% for the single parameter, and the description does not add additional format or constraints beyond the schema's description.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Parse and validate a URL' and lists the decomposed components, distinguishing it from similar tools like url_encode or url_decode.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No explicit when-to-use or when-not-to-use guidance is provided, though the purpose is implied. Alternatives are not mentioned.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
vector_quantizeARead-onlyIdempotentInspect
Simulate int8 or int4 quantization of float32 embedding vectors. Reduces storage by 4x (int8) or 8x (int4). Returns quantized values, scale factor, and precision loss (MSE). Useful for understanding vector DB compression trade-offs.
| Name | Required | Description | Default |
|---|---|---|---|
| bits | No | Quantization bits: 8 (int8, default) or 4 (int4) | |
| vector | Yes | Float32 vector to quantize |
Output Schema
| Name | Required | Description |
|---|---|---|
| mse | No | |
| bits | No | |
| offset | No | |
| dimension | No | |
| quantized | No | |
| scale_factor | No | |
| compression_ratio | No | |
| storage_bytes_float32 | No | |
| storage_bytes_quantized | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description goes beyond annotations by stating the tool simulates quantization (not actual), and explicitly lists the outputs: quantized values, scale factor, and precision loss (MSE). This adds valuable behavioral context not present in the annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is three concise sentences, front-loading the main action and key benefit. Every sentence adds value—purpose, reduction factors, outputs, and use case—with no redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's moderate complexity (simulation, optional bits, multiple return values) and that an output schema exists, the description covers all necessary aspects: what it does, the storage reduction, the outputs, and the use case, making it sufficiently complete.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, baseline is 3. The description adds the context that vectors are 'embedding vectors' and the quantization is 'int8 or int4', slightly expanding on schema descriptions. It also clarifies that the bits parameter corresponds to these types.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool simulates int8 or int4 quantization of float32 embedding vectors, coupling a specific verb with a specific resource. It distinguishes itself from sibling tools like normalize_vector or vector_similarity by focusing on quantization for storage reduction.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions the tool is 'useful for understanding vector DB compression trade-offs,' providing clear context for when to use it. However, it does not explicitly exclude alternative tools or provide when-not-to-use guidance.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
vector_similarityARead-onlyIdempotentInspect
Compute similarity/distance between two float vectors: cosine similarity, dot product, Euclidean and Manhattan distance. Essential for vector DB relevance scoring, embedding evaluation, and nearest-neighbor testing.
| Name | Required | Description | Default |
|---|---|---|---|
| metric | No | Distance metric (default: all) | |
| vector_a | Yes | First vector as array of floats | |
| vector_b | Yes | Second vector as array of floats |
Output Schema
| Name | Required | Description |
|---|---|---|
| norm_a | No | |
| norm_b | No | |
| dimension | No | |
| dot_product | No | |
| interpretation | No | |
| cosine_distance | No | |
| cosine_similarity | No | |
| euclidean_distance | No | |
| manhattan_distance | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true and idempotentHint=true, making the tool's safe, deterministic behavior clear. The description adds the list of supported metrics but does not disclose additional behavioral traits beyond what annotations cover.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two concise sentences with no fluff. Front-loaded with the core action and followed by use-case context. Every part adds value.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's low complexity (2 vector inputs, 5 metrics), existing schema coverage, and annotations, the description provides sufficient context for correct usage.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so parameters are fully documented. The description mentions metrics but adds no additional meaning beyond the schema's enum values and parameter descriptions.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states the tool computes similarity/distance between two float vectors with specific metrics. However, it does not differentiate from sibling tool 'embedding_similarity', which likely has overlapping functionality.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Description provides use cases like vector DB scoring and embedding evaluation but no explicit guidance on when to use this tool versus alternatives or when not to use it.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
vector_statsARead-onlyIdempotentInspect
Compute statistics for a float vector or matrix of vectors: mean, std, L2 norm, min, max, sparsity, top-K indices. Useful for debugging embedding quality and analyzing vector distributions in a vector DB.
| Name | Required | Description | Default |
|---|---|---|---|
| top_k | No | Return indices of top K absolute values (default: 5) | |
| matrix | No | Matrix of vectors (overrides vector). Returns per-vector + matrix-level stats. | |
| vector | No | Single vector to analyze |
Output Schema
| Name | Required | Description |
|---|---|---|
| max | No | |
| min | No | |
| std | No | |
| mean | No | |
| l2_norm | No | |
| sparsity | No | |
| dimension | No | |
| per_vector | No | |
| matrix_shape | No | |
| matrix_stats | No | |
| top_k_indices | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate read-only and idempotent behavior. The description adds context about the types of statistics computed, how matrix input overrides vector, and that matrix mode returns per-vector plus matrix-level stats. This goes beyond annotations without contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is extremely concise: two sentences that cover functionality, inputs, and use case. No unnecessary words, and it is front-loaded with the core action.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the output schema exists and the description covers all input scenarios (single vector vs matrix) and the list of computed statistics, there are no apparent gaps. The description is complete for a statistics computation tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with clear descriptions for each parameter (top_k, vector, matrix). The description reiterates the matrix override and per-vector/matrix-level stats but does not add significant new meaning beyond the schema. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool computes statistics for a float vector or matrix of vectors, listing specific outputs (mean, std, L2 norm, min, max, sparsity, top-K indices) and the use case of debugging embedding quality. This distinguishes it from sibling tools like vector_similarity or normalize_vector.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides context for use: 'debugging embedding quality and analyzing vector distributions in a vector DB.' While it does not explicitly state when not to use it or list alternatives, the use case is clear and sufficient given the sibling list.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
webhook_endpoint_createAInspect
Create a temporary webhook endpoint that captures incoming HTTP requests for one hour. Returns the webhook id, public URL, expiration timestamp, and current request count. Use together with webhook_endpoint_requests to inspect captured payloads.
| Name | Required | Description | Default |
|---|---|---|---|
| base_url | No | Optional public base URL. Default: https://ia-qa.com/mcp/webhook |
Output Schema
| Name | Required | Description |
|---|---|---|
| id | No | |
| url | No | |
| expires_at | No | |
| request_count | No | |
| retention_minutes | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Discloses temporary nature (one hour), returned fields, and captures HTTP requests. Annotations already indicate write operation, so description adds value beyond default.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two concise sentences with front-loaded verb, resource, return values, and usage tip. No wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Complete for a simple creation tool: explains purpose, temporary nature, return values, and complementary tool. Output schema exists, so return values need not be detailed further.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% (only optional base_url), and description does not add extra parameter details. Baseline 3 as per guidelines.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Create a temporary webhook endpoint' with specific verb and resource, and distinguishes from sibling 'webhook_endpoint_requests'.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly states to use together with webhook_endpoint_requests for inspecting payloads, providing clear usage context, though no exclusions or when-not-to-use.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
webhook_endpoint_requestsARead-onlyInspect
Fetch the requests captured by a webhook created with webhook_endpoint_create. Returns the newest requests first with method, headers, query params, body payload, and timestamps.
| Name | Required | Description | Default |
|---|---|---|---|
| id | Yes | Webhook id returned by webhook_endpoint_create | |
| limit | No | Maximum number of requests to return (1-100, default: 20) |
Output Schema
| Name | Required | Description |
|---|---|---|
| id | No | |
| requests | No | |
| expires_at | No | |
| request_count | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Beyond annotations (readOnlyHint), the description adds that results are ordered newest first and includes specific fields (method, headers, query params, body, timestamps). No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, front-loaded with the core action, no extraneous words. Every sentence adds value.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Parameters are well-covered, output schema exists, and the description mentions return fields. The tool is simple; no gaps identified.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline 3. The description adds context for the 'id' parameter by linking to webhook_endpoint_create, which exceeds baseline.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb 'Fetch', the resource 'requests captured by a webhook', and specifies return fields. It distinguishes itself from sibling tools like webhook_endpoint_create.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage after creating a webhook with webhook_endpoint_create. It provides clear context but does not explicitly state when not to use or mention alternatives.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
web_security_auditARead-onlyInspect
Run a comprehensive web security audit combining headers, SSL, CORS, and cookies checks — then use an LLM to produce a prioritised remediation plan. Orchestrates security_headers_check + ssl_certificate_check + cors_test + cookie_security_audit in parallel, merges all findings, then asks an AI model to: (1) rank vulnerabilities by real-world exploitability, (2) generate a remediation roadmap, (3) produce fix code snippets for the detected stack. Returns both raw audit data and the AI analysis. Use this as a one-click security posture assessment.
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | Full URL to audit (e.g. https://example.com) | |
| model | No | LLM model for AI analysis (default: "qwen/qwen3-32b"). Set to "none" to skip AI analysis. | |
| api_key | No | Your Groq or HuggingFace API key. Required to enable AI analysis. |
Output Schema
| Name | Required | Description |
|---|---|---|
| fix | No | |
| key | No | |
| url | No | |
| name | No | |
| weak | No | |
| grade | No | |
| score | No | |
| tests | No | |
| value | No | |
| header | No | |
| issues | No | |
| secure | No | |
| weight | No | |
| cookies | No | |
| details | No | |
| message | No | |
| missing | No | |
| httpOnly | No | |
| sameSite | No | |
| risk_level | No | |
| weak_count | No | |
| cookies_found | No | |
| missing_count | No | |
| overall_grade | No | |
| origins_tested | No | |
| total_findings | No | |
| headers_checked | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, but the description adds significant behavioral detail: it runs multiple checks in parallel, merges findings, and uses an AI model to rank and generate remediation. This goes well beyond annotations and fully discloses the internal workflow.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single well-structured paragraph that starts with the main purpose, explains the orchestration steps, the AI analysis functions, and the return type. It is concise, front-loaded, and contains no unnecessary information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (composite of sub-tools), the description fully explains the workflow: parallel execution, merging, AI analysis, and output. It notes the return of both raw data and AI analysis. An output schema exists, so detailed return structure is handled there.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
All three parameters (url, model, api_key) have descriptions in the input schema (100% coverage). The description adds minimal extra value beyond the schema, e.g., clarifying that setting model to 'none' skips AI analysis, which is already in the schema. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool runs a comprehensive web security audit combining headers, SSL, CORS, and cookies checks, and uses an LLM for prioritised remediation. It specifically names the sub-tools it orchestrates, distinguishing it from the individual sibling tools like security_headers_check or cookie_security_audit.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description says 'Use this as a one-click security posture assessment,' implying it's for a holistic audit. It does not explicitly state when not to use it or when to prefer individual tools, but the orchestration description implicitly suggests using sub-tools for single checks. The option to skip AI analysis is noted.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
word_frequencyARead-onlyIdempotentInspect
Analyze word frequency in text. Returns top N words with counts and percentages. Supports English stopword filtering. Useful for content analysis, keyword extraction, and LLM output analysis.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | Text to analyze | |
| top_n | No | Return top N words (default: 20, max: 200) | |
| min_length | No | Minimum word length to include (default: 3) | |
| remove_stopwords | No | Remove common English stopwords (default: true) |
Output Schema
| Name | Required | Description |
|---|---|---|
| top_words | No | |
| total_words | No | |
| unique_words | No | |
| stopwords_removed | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false, so the safety profile is clear. The description adds context about returning percentages and supporting stopword filtering, but does not disclose any additional behavioral traits (e.g., language limitations, case sensitivity).
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three sentences, each serving a purpose: purpose statement, output description, and use cases. Front-loaded with the main action. No unnecessary words. Slightly better than minimal viable.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's low complexity, an output schema exists (per context signals), and annotations cover safety, the description adequately covers purpose, output, and use cases. It could mention that stopword support is English-only, but overall it is fairly complete.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so all parameters are documented. The description summarizes the output but does not add meaning beyond the schema (e.g., it mentions 'top N words' but schema already details `top_n` with default and max). Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb 'Analyze' and resource 'word frequency', and specifies it returns top N words with counts and percentages. It distinguishes from siblings by focusing on word frequency analysis, which is distinct from general text stats or token counting.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions use cases like 'content analysis, keyword extraction, and LLM output analysis', but does not explicitly say when to avoid this tool or suggest alternatives. No explicit exclusions or comparative guidance is provided.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
xml_to_jsonARead-onlyIdempotentInspect
Convert an XML string to a JSON object. Supports attributes, nested elements, arrays, CDATA, and namespaces. Options: parse numbers, parse booleans, ignore attributes.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | XML string to convert | |
| attr_prefix | No | Prefix for attribute keys (default: "@_") | |
| ignore_attrs | No | Ignore XML attributes (default: false) | |
| parse_values | No | Auto-parse numbers and booleans (default: true) |
Output Schema
| Name | Required | Description |
|---|---|---|
| result | No | |
| key_count | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate the tool is read-only, idempotent, and non-destructive. The description adds behavioral detail about supported XML constructs (attributes, CDATA, namespaces, arrays) and mention of parsing options (numbers, booleans, ignore attributes). This provides useful context beyond annotations, though it doesn't describe edge cases or error handling.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two concise sentences that front-load the main purpose and follow up with supported features. No wasted words. Every sentence adds information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the output schema exists and annotations are complete, the description covers the essential purpose and features. It does not describe return format or error scenarios, but output schema likely covers return value. Adequate for a straightforward conversion tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline is 3. The description adds value by mapping 'parse numbers' and 'parse booleans' to the parse_values parameter and 'ignore attributes' to ignore_attrs, but does not mention attr_prefix. Overall, it supplements schema information but does not significantly exceed baseline.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool converts XML string to JSON object, and lists supported XML features (attributes, nested elements, arrays, CDATA, namespaces). The tool name itself is descriptive, and the description reinforces the purpose, differentiating it from sibling conversion tools like yaml_to_json or base64_decode.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No explicit guidance on when to use this tool versus alternatives. The description only states what it does, not when it is appropriate. For example, no mention that this is for XML-to-JSON conversion only, and other tools handle different formats. Lack of context for decision-making.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
yaml_to_jsonARead-onlyIdempotentInspect
Parse a YAML string and return the equivalent JSON value. The reverse of json_to_yaml. Supports nested objects, arrays, anchors, aliases, multi-document streams, and all scalar types. Use when processing config files, CI/CD pipeline definitions, or OpenAPI specs authored in YAML.
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | YAML string to parse | |
| multi | No | If true, parse all documents in a multi-document stream and return an array (default: false) |
Output Schema
| Name | Required | Description |
|---|---|---|
| json | No | |
| count | No | |
| documents | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate safety and idempotence. Description adds details on supported features like anchors and multi-document streams, contributing beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, front-loaded with core function, then added context. Every sentence adds value with no waste.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the simple tool with 2 parameters and output schema present, the description fully covers use cases and features needed for correct invocation.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema covers all parameters with descriptions. The description references multi-document streams relating to the multi parameter, but adds limited additional meaning beyond schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool parses YAML and returns JSON, with specific verb and resource. It distinguishes itself from the sibling json_to_yaml.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Provides explicit use cases like config files, CI/CD, OpenAPI specs. Though no explicit when-not-to-use, the context is well-defined.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
Claim this connector by publishing a /.well-known/glama.json file on your server's domain with the following structure:
{
"$schema": "https://glama.ai/mcp/schemas/connector.json",
"maintainers": [{ "email": "your-email@example.com" }]
}The email address must match the email associated with your Glama account. Once published, Glama will automatically detect and verify the file within a few minutes.
Control your server's listing on Glama, including description and metadata
Access analytics and receive server usage reports
Get monitoring and health status updates for your server
Feature your server to boost visibility and reach more users
For users:
Full audit trail – every tool call is logged with inputs and outputs for compliance and debugging
Granular tool control – enable or disable individual tools per connector to limit what your AI agents can do
Centralized credential management – store and rotate API keys and OAuth tokens in one place
Change alerts – get notified when a connector changes its schema, adds or removes tools, or updates tool definitions, so nothing breaks silently
For server owners:
Proven adoption – public usage metrics on your listing show real-world traction and build trust with prospective users
Tool-level analytics – see which tools are being used most, helping you prioritize development and documentation
Direct user feedback – users can report issues and suggest improvements through the listing, giving you a channel you would not have otherwise
The connector status is unhealthy when Glama is unable to successfully connect to the server. This can happen for several reasons:
The server is experiencing an outage
The URL of the server is wrong
Credentials required to access the server are missing or invalid
If you are the owner of this MCP connector and would like to make modifications to the listing, including providing test credentials for accessing the server, please contact support@glama.ai.
Discussions
No comments yet. Be the first to start the discussion!