Skip to main content
Glama

TestMyVibes

Server Details

MCP-native AI browser testing for coding agents. Submit a URL + goal, get back action trail, bugs, screenshots, and WebM video your agent patches from directly. 43 tools, 12 AI evaluation personalities, combo tiers with auto-pause-on-bugs, throwaway email + SMS inboxes.

Status
Healthy
Last Tested
Transport
Streamable HTTP
URL

Glama MCP Gateway

Connect through Glama MCP Gateway for full control over tool access and complete visibility into every call.

MCP client
Glama
MCP server

Full call logging

Every tool call is logged with complete inputs and outputs, so you can debug issues and audit what your agents are doing.

Tool access control

Enable or disable individual tools per connector, so you decide what your agents can and cannot do.

Managed credentials

Glama handles OAuth flows, token storage, and automatic rotation, so credentials never expire on your clients.

Usage analytics

See which tools your agents call, how often, and when, so you can understand usage patterns and catch anomalies.

100% free. Your data is private.
Tool DescriptionsA

Average 4.3/5 across 43 of 43 tools scored. Lowest: 3.4/5.

Server CoherenceA
Disambiguation4/5

The tools cover a wide range of functionalities, but each has a clearly distinct purpose. For example, submit_test, submit_test_batch, submit_combo, and submit_interaction_scene are all different types of submissions with unique parameters. However, the sheer number of tools (43) might cause some initial confusion, but descriptors resolve ambiguity.

Naming Consistency5/5

All tool names follow a consistent verb_noun pattern in snake_case (e.g., list_projects, create_project, get_test_results). The only exception is 'whoami', which is a common idiom and does not break the pattern. Overall, naming is highly predictable.

Tool Count3/5

43 tools is on the high side for a single server. The domain is broad (testing, worker marketplace, credits, cards, feedback, video), so the count is justifiable. However, it borders on being overwhelming, and some tools could be consolidated (e.g., multiple submit_* variants).

Completeness3/5

The tool surface covers core workflows like project creation, test submission, result retrieval, worker management, and credit operations. However, there are gaps: no update or delete for projects, no delete for worker offerings, and no user-facing combo editing (though combos are predefined). These are minor but noticeable.

Available Tools

43 tools
assemble_demo_videoAssemble a multi-segment demo video (hands-off)AInspect

Stitches video clips + voiceover narration into a single MP4 published to Spaces. Each segment is one of: (a) videoUrl + narrationText (voiceover replaces video's audio track), (b) narrationText only (generates a brand-color title card sized to narration length), (c) videoUrl + audioUrl (drops in a pre-baked audio track). Returns a 24h signed URL to the final MP4. Use this for marketplace catalog submissions, tutorial videos, or any time you'd otherwise screen-record + iMovie by hand. Charged on success only; failed runs are free.

ParametersJSON Schema
NameRequiredDescriptionDefault
segmentsYesOrdered segment list. Concatenated in order. Max 12 segments / ~3 minutes total for the catalog use-case.
publishAsNoOptional. When set, ALSO writes the final MP4 to a stable Spaces path at demo-videos/promoted/<publishAs>.mp4 so a public page (e.g. /ai/demo.mp4) can keep embedding the same URL forever. Re-running with the same publishAs key overwrites. Common values: "ai-landing" (powers /ai page), "anthropic-submission" (catalog submission).
outputAspectNo16:9 for desktop / YouTube, 9:16 for mobile / TikTok / Shorts, 1:1 for square social.16:9

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations (readOnlyHint=false, destructiveHint=false) are consistent; description adds details like returning a 24h signed URL, charging on success only, and optional stable publishing path.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Single paragraph of five sentences, front-loaded with core operation, efficient with no extraneous text.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Covers segment types, constraints (max 12 segments, ~3 minutes), output (signed URL), use cases, and pricing; sufficient given presence of output schema.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, but description adds rich context: explains three segment types, defaults for voiceId, common videoUrl source, minDurationSec for title cards, and publishAs usage for persistent URLs.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool stitches video clips and voiceover into an MP4, specifies three segment types, and distinguishes from siblings like synthesize_voiceover by focusing on video assembly.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly lists use cases (marketplace catalog submissions, tutorial videos) and mentions charging on success, but does not explicitly state when not to use or compare with sibling tools.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

capture_screenshotsCapture screenshots of a URL across viewportsAInspect

Drive a headless Chromium against a URL and return a screenshot for each requested viewport (mobile / tablet / desktop). Optional clickPaths lets you grab the state behind a sequence of clicks (e.g. ['Sign in', '#email', 'Continue']). Pricing: 1 credit per single viewport, 5 credits for the desktop+tablet+mobile triple (otherwise 1 × viewport count). Output: signed Spaces URLs valid for 7 days. Use this for marketing screenshots, design QA, regression-watch baselines — anything where you need pixels without a full AI test.

ParametersJSON Schema
NameRequiredDescriptionDefault
urlYesPublic URL to capture. Must be reachable from TMV's outbound IP.
settleMsNoMilliseconds to wait after navigation + each click before screenshotting. Default 1500ms covers most SPAs.
viewportsYesWhich viewports to capture. 'mobile' = iPhone 14 (390×844), 'tablet' = iPad Air (820×1180), 'desktop' = 1440×900 laptop.
clickPathsNoOptional sequence of selectors / visible text to click before screenshotting. Each entry applied in order; missing selectors are skipped (non-fatal).
projectLabelNoAudit label naming which of your projects requested this capture.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description adds valuable behavioral context beyond annotations: it mentions headless Chromium, pricing model (1 credit per viewport, 5 for triple), output as signed Spaces URLs valid for 7 days, and click path behavior (non-fatal missing selectors). No contradiction with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Four sentences, front-loaded with action and purpose, each sentence adds value (what, how, pricing, output, use cases). No superfluous content.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given output schema exists, description need not detail return structure, but it does mention signed URLs with 7-day validity. All key aspects are covered: input url, viewports, optional clicks, timing, pricing, and intended use.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, baseline 3. Description adds meaning beyond schema by explaining pricing logic for viewports, that clickPaths are non-fatal for missing selectors, and provides default settleMs coverage for SPAs.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description clearly states it drives headless Chromium to capture screenshots across specified viewports, with optional click paths. It distinctly differentiates from sibling tools (none are screenshot-related) and uses specific verbs ('Drive', 'return a screenshot').

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly recommends use cases: marketing screenshots, design QA, regression baselines. While it doesn't list when not to use or alternatives, the sibling tools are sufficiently diverse and unrelated, making the guidance clear.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

claim_jobClaim a jobAInspect

Atomically take ownership of a pending job. Returns the full checklist the worker needs to walk through, plus the SLA deadline. After this call, the job is yours; submit results with submit_job_results when done, or it expires after the SLA and is returned to the queue.

ParametersJSON Schema
NameRequiredDescriptionDefault
jobIdYesThe jobId from list_available_jobs.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description reveals atomicity, the return of checklist and SLA deadline, and the consequence of expiry—all valuable beyond annotations (which only show non-readonly and non-destructive). No contradictions with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences, front-loaded with the core action, followed by essential workflow details. Every sentence adds necessary information without redundancy.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the output schema exists (so return values are documented), the description covers the claim's effect, result contents, and post-claim expectations. For a simple one-parameter tool, this is complete.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% for the single parameter 'jobId', which already includes a description. The description adds context by specifying that the jobId comes from 'list_available_jobs', helping the agent understand provenance. This lifts it above the baseline of 3.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description uses specific verbs ('take ownership') and names the resource ('pending job'), clearly distinguishing it from siblings like 'list_available_jobs' (listing) and 'submit_job_results' (submitting). It fully captures the tool's atomic claim action.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description contextualizes the tool by mentioning that the job comes from 'list_available_jobs' and that results must be submitted via 'submit_job_results', implying the workflow. It also warns about SLA expiry. However, it does not explicitly state when not to use the tool (e.g., if already claimed).

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

complete_checkoutComplete checkout (ChatGPT Instant Checkout)AInspect

Called by ChatGPT (or any agent runtime supporting Stripe's Shared Payment Token flow) after the user clicks Pay in an inline payment widget. Receives the SPT, charges it via Stripe, and credits the user's TMV account synchronously. The checkout_session_id is the Stripe Checkout Session ID returned by top_up_credits.

ParametersJSON Schema
NameRequiredDescriptionDefault
buyerNo
payment_dataYes
checkout_session_idYesStripe Checkout Session ID minted by top_up_credits.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description reveals that the tool charges via Stripe and credits the user's TMV account synchronously, which adds behavioral context beyond the annotations. Annotations indicate non-readOnly and non-destructive, but the description clarifies the specific side effects. No contradictions with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise, consisting of three sentences that efficiently convey the purpose, trigger, and key parameter relationship. It is front-loaded with the core functionality and contains no unnecessary words.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity and the presence of an output schema, the description covers the essential flow and references the prerequisite tool. However, it lacks details on error handling, idempotency, or rate limits, which would be helpful for an AI agent. It is minimally adequate.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters2/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is low (33%), and the tool description only adds context for the checkout_session_id parameter, mentioning it comes from top_up_credits. It does not provide additional meaning for buyer or payment_data beyond what the schema already includes. The description should compensate more for the low coverage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's function: completing a checkout by charging a Stripe payment token and crediting a TMV account. It specifies the trigger ('after user clicks Pay') and the source of the checkout_session_id, making the purpose unambiguous and distinguishing it from sibling tools like top_up_credits.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly states when the tool is called ('after the user clicks Pay in an inline payment widget') and refers to the prerequisite call to top_up_credits. However, it does not explicitly mention when not to use it or alternative tools, leaving some room for improvement.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

create_projectCreate projectAInspect

Register a new site for testing. Returns the projectId you can attach to future submit_test calls.

ParametersJSON Schema
NameRequiredDescriptionDefault
urlYesThe site's URL.
nameYesDisplay name for the project.
defaultJobTypeNoDefault test category.General QA

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate it is a non-read-only, non-destructive, non-idempotent creation operation. The description adds the return value (projectId) but no further behavioral details like duplicate handling or validation.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single, front-loaded sentence with no wasted words, efficiently conveying purpose and output.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the simple parameters and presence of an output schema (implied by returning projectId), the description is adequate. It could mention which parameter is the site identifier but is sufficient.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

All parameters have descriptions in the schema (100% coverage), so the description adds no additional meaning. Baseline of 3 is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool creates a new site for testing and returns a projectId for future use, distinguishing it from sibling tools like list_projects or submit_test.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explains the tool should be used before submit_test to get a projectId, but does not explicitly state when not to use it or suggest alternatives like reusing existing projects.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

create_worker_offeringPublish a new service offeringAInspect

Add a named, priced offering to your worker menu. Customers see name + description + creditsCharged + estDurationHr and pick directly. Worker earns 75% of credits charged (floor-rounded); TMV keeps 25%. Price must be a whole number of credits, ≥ 15. Until your account is uncapped (3 quality-scored jobs, OR 1 four-star+ customer review, OR $100 cleared earnings), the per-offering ceiling is 50 credits.

ParametersJSON Schema
NameRequiredDescriptionDefault
nameYes
descriptionYes
specialtiesNo
estDurationHrNo
creditsChargedYesWhole-number credit price. Floor=15. New-worker ceiling=50.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Beyond annotations (readOnlyHint=false, etc.), the description details economic behavior: worker earns 75%, TMV keeps 25%, floor-rounded credits, and per-offering ceiling conditions. It also explains the cap removal criteria, providing full behavioral context.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is three sentences, each delivering essential information: purpose, customer visibility, economics, and constraints. It is front-loaded and concise without unnecessary detail.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given 5 parameters and an output schema, the description covers all critical aspects: what the tool does, how customers see it, economic split, pricing limits, and conditions for lifting caps. It is sufficient for an agent to correctly invoke the tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is low (20%), but the description adds meaning by explaining how parameters (name, description, creditsCharged, estDurationHr) are used by customers. It specifies pricing rules for creditsCharged, which is not in the schema description. However, optional parameters like specialties are not discussed.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool adds a named, priced offering to the worker menu, using specific verbs like 'Add' and 'publish'. It distinguishes from sibling creation tools (e.g., create_project, submit_job) by focusing on service offerings for workers.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explains when to use the tool: to add an offering that customers see and pick. It provides pricing constraints (minimum 15 credits, ceiling 50 for new accounts) and notes the revenue split, but does not explicitly list when to avoid using it or alternatives.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

freeze_test_cardCancel a previously-issued test cardA
Destructive
Inspect

Cancels the card (no further authorizations). Idempotent. Auto-freeze runs daily for any card past its 24h expiry; call this explicitly to freeze immediately after a successful checkout test.

ParametersJSON Schema
NameRequiredDescriptionDefault
cardIdYesThe test card ID returned by provision_test_card.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior1/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Description claims 'Idempotent' but annotations set idempotentHint=false. This is a contradiction. Additionally, annotations have destructiveHint=true which matches the cancel action, but the idempotency claim is inconsistent and misleading.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences are efficient and front-loaded with the core purpose. Every sentence adds value.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given annotations and output schema existence, the description is largely complete for a simple mutation. It explains behavior, idempotency (though contradictory), and usage context. Lacks mention of return value but output schema likely covers that.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema covers 100% of parameters with adequate description for cardId. The tool description does not add extra parameter semantics beyond the schema, so baseline 3 is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description clearly states the tool cancels a test card and prevents further authorizations. The verb 'Cancels' and resource 'card' are specific. It distinguishes from sibling tools like provision_test_card and list_test_cards.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicit guidance on when to use: 'call this explicitly to freeze immediately after a successful checkout test.' Also mentions the alternative auto-freeze that runs daily for expired cards.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

get_combo_statusRolled-up status for a submitted combo bundleA
Read-onlyIdempotent
Inspect

Single-call combo status: per-leg breakdown + cumulative bug counts across legs + pause-on-bugs-threshold proximity + estimated time remaining. Use this instead of polling N individual jobIds for a combo. Free.

ParametersJSON Schema
NameRequiredDescriptionDefault
comboIdYesThe comboId returned by submit_combo (one per combo run; distinct from each leg's jobId).

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate read-only, idempotent, and non-destructive behavior. The description adds value by explaining the composite nature of the response, including cumulative bug counts and estimated time remaining. It also notes the tool is 'Free', providing cost transparency.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two sentences, front-loaded with the tool's purpose and key features. Every word adds value, with no redundant or filler content.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the simple input (one parameter) and the presence of an output schema, the description provides sufficient context. It outlines the major output categories (breakdown, bug counts, threshold, time remaining) and usage guidance, making the tool's functionality clear.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% with a detailed description of the single parameter 'comboId'. The tool description does not add further semantic meaning beyond what the schema already provides, so a baseline score of 3 is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool returns a composite status for a combo bundle, listing specific output elements (per-leg breakdown, bug counts, pause threshold, time remaining). It distinguishes from siblings by explicitly advising to use this instead of polling individual jobIds.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description directly tells when to use the tool: 'Use this instead of polling N individual jobIds for a combo.' While it doesn't specify when not to use, this explicit guidance is strong and clear.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

get_credit_balanceGet credit balanceA
Read-onlyIdempotent
Inspect

Get the current account's credit balance. Returns total valid credits, raw batches, and a warning flag if the balance is below the threshold a typical test costs.

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate the tool is read-only, idempotent, and non-destructive. The description adds valuable context about the return data (total valid credits, raw batches, warning flag), which goes beyond what annotations provide. There is no contradiction.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two sentences long, concise, and front-loaded with the main purpose. Every sentence adds value and there is no extraneous information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a simple read-only tool with no input parameters and an existing output schema (as indicated), the description covers the essential return fields and the threshold warning. It is complete enough for an AI agent to understand the tool's behavior.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The tool has zero parameters, so no semantic information is needed. The description does not attempt to describe parameters, which is appropriate. The schema coverage is 100% by default, meeting the baseline expectation.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Get the current account's credit balance.' It specifies the resource (credit balance) and the action (get). This distinguishes it from sibling tools like top_up_credits (adds credits) and list_credit_packs (lists packs).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies the tool is used to fetch the credit balance, but it does not explicitly state when to use this tool over alternatives or provide any usage restrictions. Since there are no direct alternatives for checking the balance, the lack of explicit guidance is acceptable but not ideal.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

get_integration_guideGet TestMyVibes integration guideA
Read-onlyIdempotent
Inspect

Returns the canonical guide for using TMV from a coding-agent context. Covers the fix-test-retest loop, how to write a good test prompt, how to read the actionTrail / consoleErrors / failedRequests outputs, and common gotchas. Call this first if you're a new agent on a project — it'll save you a debug session. The same content is served at https://testmyvibes.com/docs/coding-agents.

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint, idempotentHint, and destructiveHint, but the description adds valuable context: it covers the fix-test-retest loop, writing test prompts, reading outputs, and common gotchas. It also mentions the same content is at a URL, which helps the agent understand the nature of the response.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences, front-loaded with purpose, followed by content summary and usage hint. Every sentence earns its place. No verbosity.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

With 0 parameters and a rich output schema (implied), combined with complete annotations, the description fully covers what the tool does, when to use it, and what to expect. No gaps.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The input schema has zero parameters and schema coverage is 100%, so the description does not need to add parameter details. Baseline for 0 params is 4, and the description provides no additional parameter info, which is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states 'Returns the canonical guide for using TMV from a coding-agent context.' It specifies the exact resource (guide) and action (returns), and distinguishes from sibling tools by positioning it as a first stop for new agents.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly says 'Call this first if you're a new agent on a project' – providing a clear when-to-use directive. It also references common gotchas and outputs, implying context where it's valuable, and distinguishes from other tools by focusing on integration guidance.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

get_scene_statusRolled-up status for an interaction sceneA
Read-onlyIdempotent
Inspect

Single-call view of every role in a scene: per-role job status, signals fired so far, shared state. Use instead of polling N individual jobIds.

ParametersJSON Schema
NameRequiredDescriptionDefault
sceneIdYesThe sceneId returned by submit_interaction_scene.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true and idempotentHint=true, so the description does not need to restate safety. It adds value by describing the response content but lacks details on error handling, latency, or what happens with invalid sceneIds, which would be helpful.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two sentences, each earning its place: first sentence defines the tool's purpose and output, second provides usage guidance. No wasted words, front-loaded with key information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's simplicity, the presence of an output schema, and comprehensive annotations, the description covers all necessary context: what the tool does, when to use it, and what the input parameter is. No gaps remain for effective use.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Input schema coverage is 100% and the description only indirectly explains sceneId by referencing its origin (from submit_interaction_scene). The description does not add significant meaning beyond the schema's existing description, resulting in a baseline score.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool provides a single-call view of every role in a scene, listing specific data (job status, signals, shared state). It distinguishes from polling individual jobs but does not explicitly differentiate from similar sibling tools like get_combo_status or get_test_status, leaving some ambiguity.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly states to use this tool instead of polling N individual jobIds, providing clear positive guidance. It does not specify when not to use it or mention alternatives among siblings, but the context is sufficient for the intended use case.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

get_test_resultsGet test resultsA
Read-onlyIdempotent
Inspect

Fetch full results for a completed test: the checklist outcomes, the report summary, and any AI-generated analysis. Returns status='pending' if the test isn't done.

ParametersJSON Schema
NameRequiredDescriptionDefault
jobIdYesThe job ID.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint, idempotentHint, and destructiveHint, so the description adds value by disclosing the pending status behavior. No contradictions with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two concise sentences that front-load the purpose and include a key behavioral detail. No wasted words.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The description covers the core functionality and the pending state. With an output schema present, return values are not needed. It omits prerequisites like how to obtain jobId, but context is sufficient for a low-complexity tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% with jobId described as 'The job ID.' The description does not add additional parameter semantics beyond what the schema provides, resulting in baseline score.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool fetches full results for a completed test, listing specific contents (checklist outcomes, report summary, AI analysis). It distinguishes from sibling tools like get_test_status by implying this is for full results.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage after a test is complete but does not explicitly guide when to use this tool versus alternatives like get_test_status. It mentions the pending status, which offers some context, but lacks explicit exclusions or alternative recommendations.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

get_test_statusGet test statusB
Read-onlyIdempotent
Inspect

Look up the current status of a submitted test job.

ParametersJSON Schema
NameRequiredDescriptionDefault
jobIdYesThe job ID returned by submit_test.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false, so the safety profile is clear. The description adds a behavioral detail (returns current status) but does not elaborate on response format or performance characteristics. With annotations doing most of the work, the description provides marginal addition.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

A single, concise sentence with no extraneous information. It is appropriately sized for a simple lookup tool and front-loads the key information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool is simple with one parameter and has rich annotations and an output schema, the description is complete enough. It could mention that this is solely for status and not results, but overall it adequately serves the agent's needs.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, and the schema already describes jobId as 'The job ID returned by submit_test.' The description does not add any further semantic context for the parameter, so it meets the baseline expectation for high schema coverage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the action (look up) and the resource (current status of a submitted test job). It distinguishes from other tools like get_test_results (which likely provides full results) and get_scene_status (different resource). However, it does not explicitly differentiate from other status lookups, leaving some ambiguity.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives such as get_test_results or get_scene_status. There is no mention of context or prerequisites, leaving the agent to infer usage without support.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

get_worker_earningsGet worker earningsA
Read-onlyIdempotent
Inspect

Show the calling worker's payout balance: lifetime earned, lifetime paid-out, currently pending. Includes Stripe Connect status and whether the pending balance meets the auto-payout threshold.

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate readOnlyHint=true, destructiveHint=false, idempotentHint=true. The description adds no further behavioral traits such as authentication requirements or rate limits. It does not contradict annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

A single sentence that is concise and well-structured, listing all key information without redundancy. Every phrase earns its place.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given zero parameters and the presence of an output schema, the description sufficiently covers the tool's purpose and the content of its response. No obvious gaps remain.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

There are no parameters, so the description does not add parameter-level meaning. However, it explains the output fields (lifetime earned, paid-out, pending, Stripe status, threshold), which is valuable beyond the empty schema. With 100% schema coverage (no params), a baseline of 3 applies, but the detail on output semantics elevates it.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool shows the calling worker's payout balance including lifetime earned, paid-out, pending, Stripe Connect status, and auto-payout threshold. It specifies the verb 'Show' and the resource 'worker earnings', effectively distinguishing it from sibling tools which involve other operations.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implicitly indicates usage for checking the calling worker's personal earnings, but lacks explicit guidance on when to use this tool versus alternatives like get_credit_balance or get_combo_status. No when-not-to-use or alternative mentions are provided.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_available_jobsList available jobs to claimA
Read-onlyIdempotent
Inspect

Return the queue of pending jobs the calling worker could pick up. Excludes jobs owned by the calling account (you can't test your own site) and jobs already claimed by another worker. Returns the freshest jobs first.

ParametersJSON Schema
NameRequiredDescriptionDefault
jobTypeNoFilter to a single job-type label (e.g. 'General QA'). Omit to see all types.
maxResultsNo

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate readOnlyHint, idempotentHint, and destructiveHint. The description adds that it returns the freshest jobs first and excludes owned/claimed jobs, providing context beyond annotations. No contradiction.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three sentences, each providing essential information without redundancy. Front-loaded with the core action. No wasteful words.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The tool has an output schema (not provided but referenced), so return values need not be explained. It covers filtering and ordering. Minor omission: no mention of pagination, but the maxResults parameter implies a limit. Overall adequate for a simple list tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 50% (one parameter described, one not). The description does not add meaning beyond the schema; the schema already explains 'jobType' filter and 'maxResults' default/max/min. Hence baseline 3.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool returns the queue of pending jobs available for the calling worker to pick up, specifying exclusions (jobs owned by the calling account or already claimed). It distinguishes from siblings like 'list_my_claimed_jobs' by noting that owned jobs are excluded.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage context for listing available jobs to claim, and implicitly contrasts with 'list_my_claimed_jobs' by excluding owned jobs. However, it does not explicitly state when not to use this tool or provide direct alternatives.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_combosList packaged AI-agent combo bundlesA
Read-onlyIdempotent
Inspect

Returns the full combo catalog: cheap pre-launch smoke checks → the Whole Kit & Kaboodle (every agent, multiple viewports). Each combo lists its legs, estimated credit cost (recomputed from the live personality catalog), estimated duration, the bug-threshold that auto-pauses + refunds the remaining legs, and (where applicable) the cheaper combo we recommend running first. Read-only; charges nothing.

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true, idempotentHint=true, destructiveHint=false. The description adds 'Read-only; charges nothing', confirming no side effects or cost. It also describes output details (legs, cost, duration, threshold), providing context beyond annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences, no wasted words. First sentence establishes purpose with specific examples, second details contents. Front-loaded and efficient.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool has an output schema and no parameters, the description provides a comprehensive list of what the output contains (legs, cost, duration, threshold, recommendation). It is complete for a list tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Input schema has 0 parameters, so no parameter semantics are needed. Schema coverage is 100%. The description adds no parameter info, but none is required. Baseline score of 4 applies.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Title 'List packaged AI-agent combo bundles' and description 'Returns the full combo catalog' clearly state the tool lists all combos. The description further details what each combo contains, distinguishing it from sibling tools like get_combo_status.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage for browsing the catalog but does not explicitly state when to use this tool vs alternatives like get_combo_status or list_available_jobs. Since it has no parameters and is read-only, the need for guidelines is lower, but explicit guidance would improve clarity.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_credit_packsList credit packsA
Read-onlyIdempotent
Inspect

List the credit packs available for purchase. Returns pack index, credit count, USD price, and per-credit cost. Use the returned packIndex with top_up_credits.

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true and destructiveHint=false, so the description adds value by specifying the exact fields returned (pack index, credit count, USD price, per-credit cost). No contradiction with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences pack all necessary information without redundancy. The critical action (list) and key return fields are front-loaded, and every sentence contributes value.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given no parameters and an output schema (implied), the description is complete: it states what the tool does, what it returns, and how to use the result. No missing context.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

There are no parameters and schema coverage is 100%. The description adds meaning by detailing the output structure, which compensates for the absence of parameters. Baseline for 0 parameters is 4.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it lists credit packs for purchase, with the verb 'List' and specific resource 'credit packs'. It distinguishes from sibling tools like top_up_credits by indicating the returned packIndex is used with that tool.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

It explicitly advises using the returned packIndex with top_up_credits, providing clear usage context. While it doesn't list when not to use it, the tool's simplicity and zero parameters make this guidance sufficient.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_device_presetsList available device emulation presetsA
Read-onlyIdempotent
Inspect

Returns the device presets you can pass as devicePreset on submit_test / submit_test_batch / retest_job. Each entry includes viewport width/height, deviceScaleFactor, isMobile, and hasTouch so the AI agent (and you) can pick the right one. Free — emulation runs as part of the base test cost, no markup. Use featuredOnly=true for the 15 most common phones/tablets; pass featuredOnly=false to see all 131.

ParametersJSON Schema
NameRequiredDescriptionDefault
featuredOnlyNoWhen true (default) returns the curated featured subset (~15 modern phones/tablets). When false returns all 131 Puppeteer-bundled devices.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false, so the description adds value by noting the tool is free and that emulation runs as part of the base test cost with no markup. This provides behavioral context beyond safety profiles.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is three sentences long, front-loaded with purpose and immediate keywords like 'Returns the device presets'. Every sentence adds essential detail with no waste.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The output schema exists, so return values are covered. The description still adds useful context about what each entry includes (viewport, deviceScaleFactor, etc.) and mentions cost implications. Given the simplicity of the tool, this is thoroughly complete.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, so baseline is 3. The description adds meaning by explaining the effect of the featuredOnly parameter: returning a curated subset vs. all devices. This goes beyond the schema description's generic phrasing.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description clearly states the tool returns device presets for use in submit_test, submit_test_batch, and retest_job. It specifies the resource (device presets) and the action (list), and distinguishes from sibling submission tools by being a preparatory listing action.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit guidance on when to use each parameter value: use featuredOnly=true for the 15 most common devices and featuredOnly=false for all 131. It also implies the tool should be used before submitting tests, though it does not explicitly state alternatives or when not to use.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_feedbackList queued feedback (staff only)A
Read-onlyIdempotent
Inspect

Staff-only triage view. Returns feedback items optionally filtered by status / category / since. Use status="new" at session start to see what came in unaddressed. Returns most recent first.

ParametersJSON Schema
NameRequiredDescriptionDefault
limitNo
statusNoFilter by status. Omit to see all.
categoryNo
severityNo
sinceIsoNoISO 8601 timestamp — only return items created since.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint, idempotentHint, and destructiveHint. Beyond that, the description adds behavioral context: staff-only access, triage view, and default ordering (most recent first). This provides sufficient transparency for safe invocation.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is three sentences long, each serving a distinct purpose: stating the tool's role, listing filtering options, and giving a real-world usage example. No redundant or vague phrases.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given that an output schema exists (context signal: has output schema = true), the description is not required to detail return fields. It covers purpose, filtering, ordering, and usage tip; the only minor omission is the lack of mention of pagination via the limit parameter, but it is acceptable.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is only 40% (2 out of 5 parameters have descriptions). The description mentions status, category, and sinceIso, adding meaning to those, but does not cover limit or severity. It partially compensates for the low coverage but still leaves gaps.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it is a staff-only triage view that returns feedback items, optionally filtered by status/category/since. It explicitly distinguishes itself from sibling mutating tools like submit_feedback or update_feedback by emphasizing read-only triage.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides a concrete usage guideline: 'Use status="new" at session start to see what came in unaddressed.' This tells the agent when to apply the filter, though it does not explicitly mention when not to use or alternative tools.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_kept_personasList your kept-alive test personasA
Read-onlyIdempotent
Inspect

Returns every persona this account has kept alive (created via submit_test with keepTestAccount=true and successfully signed back in). Each entry includes the personaId you'd pass as existingPersonaId to retest the same user, plus the originating customer site and credential expiry. Personas auto-expire 30 days after their last use; each successful retest bumps the expiry.

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description adds significant behavioral context beyond annotations (readOnlyHint, idempotentHint, destructiveHint): auto-expiry after 30 days, retest bumps expiry, and what fields are returned. No contradiction with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three sentences, each earning its place: first states the action, second describes the return data structure, third explains auto-expiry behavior. No filler or redundancy.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The description fully explains the purpose, output (personaId, customer site, credential expiry), and key behavior (auto-expiry, retest bump). Given 0 parameters and an existing output schema, this is complete.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

There are 0 parameters and schema coverage is 100% (empty object), so no parameter documentation is needed. The baseline for 0 parameters is 4, and the description does not add unnecessary parameter info.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it returns all kept-alive personas, specifying their origin (submit_test with keepTestAccount=true) and what each entry includes (personaId, customer site, credential expiry). This distinguishes it from sibling list tools like list_test_cards or list_projects.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explains when to use the tool (to retrieve kept-alive personas for retesting) and provides important context about auto-expiry and retest behavior. It does not explicitly list alternatives or when not to use, but the context is clear enough.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_my_claimed_jobsList my claimed jobsA
Read-onlyIdempotent
Inspect

Jobs this worker currently holds: claimed but not yet submitted, plus in-progress.

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint, idempotentHint, and destructiveHint. The description adds behavioral context beyond annotations by specifying the included job states (claimed but not submitted, plus in-progress). No contradictions with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single concise sentence that front-loads the purpose and scope. Every part is essential and no wasted words.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool has no parameters, annotations cover safety and idempotency, and an output schema exists, the description is complete. It clearly defines the set of jobs returned without missing critical information.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The input schema has zero parameters, so the baseline is 4. The description does not need to provide parameter details as none exist.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description uses the specific verb 'list' and resource 'my claimed jobs', clarifying it returns jobs the worker currently holds. It distinguishes from sibling tools like 'list_available_jobs' and 'claim_job' by specifying the state: claimed but not submitted plus in-progress.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies when to use this tool (to see current workload) but does not explicitly contrast with alternatives like 'list_available_jobs' for unclaimed jobs. No explicit when-not-to-use or prerequisite information is provided.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_personality_offeringsBrowse the AI personality menuA
Read-onlyIdempotent
Inspect

List all packaged AI tests TMV publishes — each personality has 0..N service offerings (e.g. 'First-Time User • Signup gauntlet • 10 credits • ~5min'). Submit one back to submit_test as personalityOfferingId and the run's step budget, inbox provisioning, personality, and price are all locked to the offering preset. Use this when you want a deterministic, named test product rather than tuning maxSteps / useTestInbox by hand.

ParametersJSON Schema
NameRequiredDescriptionDefault
tierNoOptional filter by personality tier.
maxCreditsNoOptional ceiling on credit price (e.g. 10 → exclude offerings that cost more).

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnly, idempotent, non-destructive. Description adds that returned offerings lock step budget, inbox provisioning, personality, and price – useful behavioral context beyond annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three sentences, no fluff. Front-loaded with purpose, then explains usage flow, then gives recommendation. Every sentence earns its place.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Has output schema (mentioned but details not in description, which is fine). Description covers what the tool returns and how it integrates with submit_test. Adequate for a simple filtered list tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100% (both parameters described in schema). Description mentions 'Optional filter by personality tier' and 'ceiling on credit price', aligning with schema but adding no new meaning beyond the schema's own descriptions.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description specifies 'List all packaged AI tests TMV publishes' – clear verb+resource. Distinguishes from sibling 'submit_test' by explaining the role of the returned ID. Also distinct from other list tools by focusing on personality offerings with presets.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly says 'Use this when you want a deterministic, named test product rather than tuning maxSteps / useTestInbox by hand.' Provides context on when to choose this tool over manual configuration. Doesn't state when not to use, but implication is clear.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_projectsList projectsA
Read-onlyIdempotent
Inspect

List the projects (sites under test) registered to this account.

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already provide readOnlyHint=true and idempotentHint=true, so the description adds minimal behavioral context beyond clarifying that projects are 'sites under test'. No contradictions.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

A single, clear sentence with no unnecessary words. It is front-loaded and efficient.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a parameterless list tool with full annotation coverage and an output schema, the description is complete. It explains the resource being listed and serves its purpose.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The input schema has no parameters, so baseline is 4. The description does not add parameter info, which is appropriate as there are none.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it lists projects, specifying they are 'sites under test'. This distinguishes it from other list tools like list_test_cards or list_feedback, which list different resources.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

No explicit guidance on when to use this tool versus alternatives, but given its simplicity and parameterless nature, the context is implied. It does not mention exclusions or alternatives.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_test_cardsList test cards issued to this accountA
Read-onlyIdempotent
Inspect

Audit view of every test card this account has minted. PANs are NEVER returned (we don't persist them) — only last4 + funded amount + status + expiry. Useful for reconciling Stripe Issuing balance against TMV spend.

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint, idempotentHint, and destructiveHint, so the safety profile is clear. The description adds value by explaining that PANs are never returned and that this is an audit view, which is beyond the annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences, front-loaded with purpose, no wasted words. Every sentence earns its place.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given zero parameters and an output schema (presumably detailed), the description fully covers purpose, return values, and use case, leaving no gaps for the AI agent.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

There are no parameters, so the schema provides full coverage. The description adds context about what is returned, which aids understanding of the output schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it lists test cards for the account, specifies exactly what fields are returned (last4, funded amount, status, expiry), and explicitly notes that PANs are never returned, distinguishing it from other tools.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides a concrete use case ('reconciling Stripe Issuing balance against TMV spend') and implicitly suggests when to use it for auditing. No alternative tools with overlapping functionality exist among siblings.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_worker_offeringsBrowse the worker marketplace menuA
Read-onlyIdempotent
Inspect

List active worker offerings. Filter by specialty to find workers fluent in a domain (e.g. 'payments', 'i18n-japanese', 'react-spa'). Each entry includes the worker's bio, specialty tags, employment type ('external' = marketplace, 'in_house' = TMV staff), and the credit price.

ParametersJSON Schema
NameRequiredDescriptionDefault
limitNo
specialtyNoOptional specialty filter — case-insensitive substring match against the worker or offering specialty tags.
includeInHouseNoWhether to include TMV-staffed in-house workers (premium tier). Default true.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true and destructiveHint=false, so the agent knows it is safe. The description adds behavioral context by stating it lists only active offerings and explains the output structure (bio, tags, employment type, price), which is beyond what annotations provide.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two sentences long with no extraneous information. It starts with the main purpose, provides filtering guidance, and lists key output fields. Every sentence is valuable and well-structured.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the presence of an output schema, the description does not need to explain return values. It already covers the main output fields and the filtering option. For a read-only listing tool with annotations indicating safety, this is complete.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 67%, and the description adds value for the 'specialty' parameter with examples and substring match explanation. However, it does not mention 'limit' or 'includeInHouse' in the description, though the schema covers them adequately. Baseline 3 is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool lists active worker offerings and provides filtering by specialty, with examples. It distinguishes from siblings like 'list_personality_offerings' and 'create_worker_offering' by focusing on workers and listing output fields.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description gives clear context on when to use the tool (to browse worker offerings) and how to filter by specialty. It does not explicitly state when not to use, but the sibling list provides implicit alternatives. It also does not cover the 'includeInHouse' parameter, but the schema covers it.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

provision_test_cardMint a spendable test-payment cardAInspect

Issues a single-use Stripe-Issuing virtual card hard-capped at fundedUsd, billed at funded + 25% markup + $2 service fee. PAN + CVC are returned ONCE in the response and TMV never persists them. Card auto-freezes 24h after creation. In sandbox mode (test key) cards auth only against Stripe test-mode merchants, perfect for verifying customer checkout flows without real money. Charged in credits at 1 credit = $0.10 (so a $10 funded card costs ~125 credits all-in). Provisioning fee absorbed into the markup.

ParametersJSON Schema
NameRequiredDescriptionDefault
fundedUsdYesUSD amount to load onto the card (and the card's spending limit).
testJobIdNoOptional TMV job ID to associate the card with. Used by the AI worker to surface the card via runContext.testPaymentCard so the agent can type it at checkout.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Discloses key behaviors beyond annotations: card auto-freezes 24h, PAN/CVC returned once and never persisted, credit charging, fees. No contradiction with annotations (readOnlyHint=false, destructiveHint=false).

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise (110 words) yet densely informative. Every sentence adds value—fee breakdown, credit cost, sandbox usage, data privacy. Front-loaded with main action.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the presence of an output schema, the description adequately covers cost, expiration, use case, and privacy. No missing critical behavioral context for a provisioning tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema descriptions already cover both parameters (fundedUsd, testJobId). The description adds extra context: cost calculation (25% markup + $2 fee), credit conversion rate, and purpose of testJobId for surfacing via runContext.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it issues a single-use Stripe-Issuing virtual card, specifying the resource and action. It distinguishes from siblings like freeze_test_card and list_test_cards by detailing unique behaviors (auto-freeze, one-time PAN/CVC).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear usage context: 'perfect for verifying customer checkout flows without real money' and mentions sandbox mode. However, it does not explicitly exclude alternatives or state when not to use this tool.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

quote_demo_videoQuote the credit cost of a multi-segment demo videoA
Read-onlyIdempotent
Inspect

Free preview of assemble_demo_video pricing. Sums the Hume Octave narration cost across segments and adds a flat 10-credit assembly fee. Useful before committing to a longer video.

ParametersJSON Schema
NameRequiredDescriptionDefault
segmentsYes

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false. The description adds that it is a 'free preview' and sums costs, which is consistent. However, it does not provide additional behavioral context (e.g., rate limits, response format) beyond what annotations convey.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise (three short sentences) and front-loaded with the key action. However, given the complexity of the parameter, it could be slightly more detailed without being verbose. It earns its place but misses some clarity.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The description covers the main idea (cost preview for a multi-segment video) but lacks specifics on segment configuration, cost factors beyond narration, and the output schema. Given the existence of an output schema, the description does not need to detail return values, but it should still guide input usage.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters1/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The input schema has a single complex parameter 'segments' with 8 properties, but schema description coverage is 0%. The description only mentions 'narration cost' without explaining which fields affect pricing (e.g., narrationText). It adds minimal value for parameter understanding.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's function: a free preview of assemble_demo_video pricing, summing Hume Octave narration costs and adding a 10-credit assembly fee. It distinguishes itself from the sibling tool 'assemble_demo_video' by being a quote-only operation.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description says 'Useful before committing to a longer video,' which implies using this tool before assembly to check cost. It indirectly references the sibling 'assemble_demo_video' as the alternative. However, it does not explicitly mention when not to use or other related siblings like 'quote_voiceover'.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

quote_test_cardPreview the cost of a spendable test cardA
Read-onlyIdempotent
Inspect

Pre-flight pricing for provision_test_card. Pass the USD amount you want loaded onto the card; returns funded + markup + service fee + total charged. Funded $1-$200. No credits deducted.

ParametersJSON Schema
NameRequiredDescriptionDefault
fundedUsdYes

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description explicitly states 'No credits deducted,' which aligns with the readOnlyHint and destructiveHint annotations. It also details the return values (funded, markup, service fee, total charged), providing behavioral clarity beyond the annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three short sentences, each adding value: first states purpose, second explains input/output, third provides constraints. No fluff, front-loaded with key information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The description covers input, output, range, and side-effects. Given the tool's simplicity (1 param, no nested objects) and the presence of an output schema, the description is complete and self-sufficient.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The description fully explains the single parameter 'fundedUsd' by stating its purpose ('amount you want loaded onto the card') and range ('$1-$200'), compensating for the 0% schema description coverage. It also clarifies what the tool returns, adding value beyond the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it provides pre-flight pricing for provision_test_card, specifying the input (USD amount) and the output components (funded, markup, service fee, total charged). It distinguishes itself from the sibling provision_test_card by being a read-only pricing check.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage as a precursor to provision_test_card ('Pre-flight pricing for provision_test_card'), giving clear context. However, it does not explicitly state when not to use it or provide alternatives among siblings, though the context is sufficient.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

quote_voiceoverQuote the credit cost of a Hume Octave voiceoverA
Read-onlyIdempotent
Inspect

Read-only cost preview for synthesize_voiceover. Returns the credit charge derived from script length at Hume's ~$0.05 / 1k char list rate, converted at TMV's 1 credit = $0.10. Free.

ParametersJSON Schema
NameRequiredDescriptionDefault
textYesScript text to be synthesized.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false. The description adds value by specifying the cost calculation details (rate conversion) and stating it's free, which goes beyond the annotations without contradicting them.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is extremely concise with two short sentences, front-loading the key purpose. Every word serves a purpose with no redundancy.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's simplicity (one parameter, output schema exists), the description fully covers all necessary context for an agent to use it correctly. It explains the cost calculation and that it's a preview.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The single parameter 'text' is fully described in the schema (100% coverage), so the description adds minimal extra meaning. The baseline of 3 is appropriate as no additional parameter semantics are provided.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description explicitly states it is a 'read-only cost preview for synthesize_voiceover' and details the cost calculation, making the purpose unmistakably clear. It distinguishes itself from the sibling 'synthesize_voiceover' by indicating it only provides a cost estimate.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage is for previewing costs before calling 'synthesize_voiceover', and mentions it is 'Read-only' and 'Free'. However, it does not explicitly state when not to use it or suggest alternatives beyond the implied main tool.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

retest_jobRe-run a previously completed testA
Idempotent
Inspect

Re-run an existing test job against the latest deployment. Useful after pushing a fix surfaced by get_test_results — call this to verify whether the bug is gone. Keeps the original test's URL, custom goal, system prompt, and inbox configuration so the verification covers the same flow.

ParametersJSON Schema
NameRequiredDescriptionDefault
jobIdYesJob ID returned by submit_test (or a prior retest_job).

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate idempotent and non-destructive behavior. Description adds that it keeps original test configuration, implying no side effects. Could further state that it does not alter original results, but still good.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences: first states purpose, second provides use case and retained config. No wasted words, front-loaded with key information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given single parameter, output schema availability, and clear annotations, the description covers all needed context: purpose, usage trigger, preserved config, and parameter source.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Only one parameter (jobId) with schema coverage 100%. Description adds meaning by stating that the original configuration is preserved, reinforcing the parameter's role in identifying the test to re-run.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description clearly states 'Re-run an existing test job against the latest deployment' with specific verb and resource. It differentiates from sibling tools like submit_test and get_test_results by focusing on re-running completed tests.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly says to use after pushing a fix surfaced by get_test_results to verify bug fix. Also notes that it keeps original configuration (URL, goal, prompt, inbox), providing clear context for when it is appropriate.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

submit_comboSubmit a packaged AI-agent combo bundleAInspect

Queue a named combo against a URL. Fans into N ordered jobs (cheap bug-finders first, expensive audits last) sharing one batchId. +15% parallel premium applies. If the combo has a pauseOnBugThreshold, the worker auto-cancels remaining pending legs + refunds their credits once cumulative bug count crosses the threshold — so a broken site never burns the full bundle. Use list_combos to browse the catalog first.

ParametersJSON Schema
NameRequiredDescriptionDefault
urlNoTarget URL all legs run against. Required unless `pages` or `stories` is provided.
pagesNoMulti-page mode (legacy): list 1-10 distinct page URLs. Each combo leg fans out × pages.length. Cost scales linearly. Prefer `stories[]` on Whole Kit tiers.
comboIdYesID of a combo from list_combos (e.g. 'combo-smoke-stack', 'combo-whole-kit-core').
storiesNoStory-based mode (Whole Kit tier preferred). Each story = one end-to-end user flow exercised by every leg in the combo. Hard-capped by the combo's maxStories (Solo=1, Core=3, Plus=6, Max=10). Pricing is the combo's flatCreditPrice (bulk-discounted at higher tiers), not derived from stories.length × per-leg cost.
projectIdNo
descriptionYesPlain-English description applied to every leg as the job title.
projectLabelNo

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Discloses beyond annotations: fan-out ordering, +15% parallel premium, auto-cancellation with refunds on pauseOnBugThreshold. Annotations already indicate non-destructive, non-idempotent, open world; description adds valuable behavioral context.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three sentences plus a note, all front-loaded and essential. No fluff. Every sentence adds unique value.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (7 params, conditional logic), the description covers key aspects: ordering, pricing, cancellation, and prerequisite browsing. Output schema exists, so return values are not needed.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema has good coverage (71%), but description adds meaning around comboId, ordering, pricing, and cancellation logic that are not in schema. Some parameters (projectId, projectLabel) get no extra context, but overall adds value.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Clearly states it queues a named combo against a URL, with specific ordering and batch ID. Distinguishes from sibling submit tools (e.g., submit_test) by focusing on combos and referencing list_combos.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly advises to use list_combos to browse the catalog first. Implicitly tells when to use (for combos) but does not explicitly mention when not to use or alternatives.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

submit_conversation_testRun a two-AI voice conversation through Paradise CommsAInspect

Spawns two AI personas as participants on a real LiveKit voice call (via Paradise's self-hosted comms cluster), each driven by its own LLM (Claude + GPT by default), and runs a structured conversation. Each persona talks aloud (TTS) and listens to the other (Whisper STT) — this isn't simulation, it's a real WebRTC call with real audio. Used to verify Paradise Comms end-to-end (publisher → SFU → subscriber → recording → outbound webhook) and to demo agent-to-agent voice. Returns the full transcript and a recording hint.

ParametersJSON Schema
NameRequiredDescriptionDefault
turnsNoTotal back-and-forth turns. 6 means A→B→A→B→A→B. Cap of 20 to bound LLM + TTS + Whisper spend per test.
personaANoPersona for agent A (speaks first). Defaults to skeptical-cto. See lib/personalities.ts for the full list of 8 personas.skeptical-cto
personaBNoPersona for agent B. Defaults to power-user. The pairing skeptical-cto + power-user is the canonical demo because their voices contrast strongly enough to prove the conversation is real (not echo).power-user
scenarioNoOptional scenario nudge added to both personas' system prompts. Example: 'Topic: should the team switch from PostgreSQL to MongoDB? Have a real disagreement.' Leave empty to let the personas freestyle.
smokeUrlNoURL of the smoke page on the LiveKit SFU droplet. Defaults to staging.https://livekit-staging.comms.paradisemodern.com/smoke/
paradiseBaseNoParadise Comms API base URL. Defaults to staging; pass production when ready.https://comms.staging.paradisemodern.com
paradiseTokenYesParadise Comms portfolio bearer token (e.g. paradise-staging_test_…). Get one by running scripts/seed-staging.ts in the paradisemodern repo, OR via POST /api/admin/comms/tokens as a super_admin.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations provide openWorldHint=true and destructiveHint=false. The description adds that it is a real WebRTC call with real audio, TTS, and STT, which is significant behavioral context beyond annotations. No contradictions.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise (4 sentences) and front-loaded with the core action. Every sentence adds value without redundancy.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

All 7 parameters are described in schema, output schema exists, and the description mentions return values (transcript and recording hint). Complete for a complex tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Input schema coverage is 100% and descriptions in schema already explain defaults and purpose. The description adds no new parameter-level information beyond the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it spawns two AI personas on a real LiveKit voice call with TTS and STT, not simulation, and returns transcript and recording hint. This distinguishes it from sibling tools like submit_test or submit_test_batch.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly states it is used to verify Paradise Comms end-to-end and to demo agent-to-agent voice. It does not mention when not to use, but the purpose is clear enough.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

submit_feedbackFile a bug, feature request, or UX nit for operator triageAInspect

Queues feedback for staff review. NOT acted on automatically — items sit in status="new" and are worked through with the operator. Good filing hygiene: one issue per submission, name the surface affected (e.g. "submit_test default step budget too low for OAuth flows"), include reproduction steps in the body. If you're filing while running another job, pass context.relatedJobId so the operator can pull the screenshots / report. Anonymous filers can include reporterEmail for follow-up.

ParametersJSON Schema
NameRequiredDescriptionDefault
bodyYesFull context, repro steps, what you expected vs. what happened. Markdown OK; the admin view renders it.
titleYesShort imperative summary — what should change. E.g. "submit_test should accept devicePreset by alias".
categoryNobug
severityNocritical = blocking real work, major = wrong result / cost, minor = papercut, suggestion = enhancement.minor
mcpClientNoYour MCP client name so we can spot patterns by tool (Claude Code, Cursor, Codex, etc.).
relatedJobIdNoIf this feedback is about a specific test result, pass the jobId so staff can pull the report / screenshots.
reporterEmailNoOptional contact for follow-up. Authed accounts already have an email on file; this is for anonymous catalog browsers.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Discloses that feedback queues for staff review and is not automated, items sit in status='new'. Adds value beyond annotations which show readOnlyHint=false, idempotentHint=false, destructiveHint=false.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Paragraph is dense but effectively conveys key information. Could be slightly more structured, but each sentence adds value.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Covers purpose, usage guidelines, parameter semantics, and behavioral notes comprehensively. Output schema exists, so no need to describe return values. Complete for a feedback submission tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Adds meaning beyond schema: explains relatedJobId usage, reporterEmail for anonymous filers, severity definitions, and title format examples. Schema coverage is high (86%), but description provides usage context.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Title and description clearly state it's for filing feedback (bug, feature request, UX nit) for operator triage. Distinguishes from sibling tools like list_feedback and update_feedback by specifying it queues staff review.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicit guidance: NOT acted on automatically, one issue per submission, name surface, include reproduction steps, pass relatedJobId for context, optional reporterEmail. Could be more explicit about when not to use, but clear context.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

submit_interaction_sceneSubmit a multi-agent interaction sceneAInspect

Queues 2-10 AI agents in parallel as roles in a coordinated scene. Each role gets its own browser, persona, and goal — and uses signal() + wait_for_signal() actions to communicate with sibling roles. Use this for publisher+viewer (livestream), buyer+seller (marketplace), multi-user chat, host+guest flows, anything where one agent must produce a value (URL / order id / stream id) that another agent needs. Returns sceneId + role-to-jobId mapping. Each role billed as a normal AI test + 15% parallel premium on top.

ParametersJSON Schema
NameRequiredDescriptionDefault
rolesYes
projectIdNo
descriptionYesPlain-English description of the scene (e.g. 'PM Comms publisher → viewer livestream verification').
projectLabelNo

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description adds significant behavioral context beyond the annotations, including that roles run in parallel, each billed with a 15% premium, and the tool returns a sceneId and role-to-jobId mapping. There is no contradiction with annotations (readOnlyHint=false, etc.). The description does not mention any destructive or idempotent traits, which is consistent.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single paragraph of three sentences, densely packed with key information. It is front-loaded with the core action and provides examples. It could be slightly better structured (e.g., bullet points) but is concise and efficient with no wasted words.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The description covers the essential purpose and usage examples, and since an output schema exists, it doesn't need to detail return values. However, given the complexity of the tool (multiple nested role options), it lacks details on parameters like projectId and projectLabel, and could better explain the roles array structure. It is adequate but not thorough.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters2/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The schema has 4 top-level parameters with only 25% description coverage, but the description only lightly touches on the roles structure (browser, persona, goal) and does not clarify parameters like projectId or projectLabel. It mentions the output but not the input parameters meaningfully. The description should compensate for low schema coverage but falls short.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool queues multiple AI agents as roles in a coordinated scene, with specific examples of use cases (publisher+viewer, buyer+seller) and communication via signal/wait_for_signal. It distinguishes itself from sibling tools like submit_test and submit_conversation_test, which do not handle multi-agent interaction.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit examples of when to use the tool (e.g., livestream, marketplace, multi-user chat) and explains that each role gets its own browser, persona, and goal. It lacks an explicit statement of when not to use it, but the context is clear enough for an AI agent to differentiate.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

submit_job_resultsSubmit job resultsAInspect

Submit the worker's outcomes for a claimed job. Triggers the same earnings + report + client notification pipeline a human checker submission triggers. Returns the report id and the final pass/fail status.

ParametersJSON Schema
NameRequiredDescriptionDefault
itemsYes
jobIdYesThe job ID from claim_job.
summaryYesPlain-English summary of what was tested + what worked / didn't.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations indicate a write operation (readOnlyHint=false) that is not destructive or idempotent. Description adds beyond annotations by detailing that it triggers earnings, report, and client notification pipelines, and returns report id and pass/fail status. No contradiction.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three concise sentences effectively convey purpose, trigger effects, and return values. Front-loaded with primary action, no superfluous text.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

With an output schema present (context has true), the description still adds value by naming specific return fields (report id, pass/fail status) and mentioning pipeline effects. For a 3-param tool with required fields, the description covers enough for an agent to understand invocation and outcomes.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 67% (most parameters have descriptions). The tool description does not add new meaning to input parameters, focusing instead on output. Baseline 3 is appropriate as schema already explains parameters sufficiently.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description clearly states the tool submits worker outcomes for a claimed job, using a specific verb and resource. It distinguishes itself from other submit_* siblings by mentioning the triggered pipeline (earnings, report, client notification) that matches the human checker submission process.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly states 'for a claimed job', indicating prerequisite. Compares to human checker submission, providing context on when this tool is appropriate. No explicit when-not-to-use, but the context is clear enough for an agent.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

submit_testSubmit a test jobAInspect

Queue a new TestMyVibes job for a given URL. You explicitly choose the runner: AI agent (headless Chromium + GPT-4o vision, fastest, deterministic for well-specified goals) or human checker (slower, better for visual/UX judgment calls). Returns a jobId you can poll with get_test_status.

ParametersJSON Schema
NameRequiredDescriptionDefault
urlYesThe URL to test. Must be publicly reachable.
goalNoAI runner only. CONCRETE success criterion the agent stops on — e.g. 'Reach a URL containing /dashboard', 'See a Welcome banner on the page header', 'Receive an OTP email and submit the code'. Without a goal the AI runs out its step budget on exploration instead of completing a flow.
runnerNoWho runs the test. 'ai' = headless browser + GPT-4o vision agent (default; use for deterministic flows, signup/login, regression checks). 'human' = real human checker on TMV's panel (use for visual/UX judgment, complex flows the AI can't drive, accessibility passes).ai
jobTypeNoTest category — defaults to 'General QA'. Affects credit cost when billed.General QA
priorityNoJob priority.normal
targetOSNoHuman-runner advisory string for OS (e.g. 'iOS 17', 'Android 14', 'macOS 14'). Surfaced on the checker's claim card.
viewportNoExplicit viewport for non-preset resolutions (e.g. {width: 2560, height: 1440} for a 27" desktop monitor). Wins over devicePreset only when devicePreset is NOT set. Use devicePreset for known phones/tablets and viewport for custom resolutions.
projectIdNoExisting project to attach this job to (optional).
offeringIdNoMarketplace offering id (browse via list_worker_offerings). When set, this job is priced + routed through the marketplace: the customer pays the offering's `creditsCharged` and the worker who fulfills the job earns the offering's pre-locked `workerPayoutCredits` (75% of charged). When omitted, the legacy personality/step-based pre-flight quote applies.
slaMinutesNoTarget turnaround in minutes (human runner only; AI runs finish in ~1-5 min regardless).
descriptionYesPlain-English description of what to test. The platform uses this to seed a checklist.
mcpEndpointNoMCP Auditor only. URL of the MCP server to audit (e.g. https://api.example.com/mcp). Pair with `personalityOfferingId: 'mcp-smoke'` or `'mcp-full-audit'`. The MCP Auditor runs JSON-RPC against this endpoint instead of opening a browser at `url`.
recordVideoNoAI runner only. Opt-in WebM video recording of the entire browser session. When true, the worker captures a continuous screencast via Puppeteer and uploads it to Spaces; signed URL surfaced in get_test_results.aiReport.videoUrl. Free — no credit charge. 30-day retention same as step screenshots. Default off while we measure worker CPU impact; will flip to default true once production data justifies.
useSmsInboxNoAI runner only. When true, TMV provisions a throwaway phone number from Paradise's SMS test-number pool (US/CA available) bound to this run. The agent uses it for any phone field, and `wait_for_sms` blocks until verification SMS arrive. Required for phone+OTP signup flows. Pool is finite — release reserves the number for ~15min then auto-releases.
devicePresetNoOptional device emulation. Pass a Puppeteer KnownDevices name (e.g. 'iPhone 14 Pro', 'iPad Mini', 'Pixel 5', 'Galaxy S9+') and the AI agent runs the test as that device — proper viewport, touch events, user-agent, and DPR. No markup; this is the same Chromium with different emulation flags. Use list_device_presets to see the full 131-device catalog or the curated featured subset. For human runners this is advisory and surfaced on the checker's job card.
mcpTransportNoMCP Auditor only. Transport protocol the customer's MCP server speaks. Most servers built with @modelcontextprotocol/sdk use streamable-http; older ones use sse. No stdio support (we don't run customer code in TMV's sandbox).streamable-http
projectLabelNoAudit label naming which of your projects submitted this test (e.g. 'pm-claude-code', 'shiftsee-claude-code'). Not used for auth.
targetDeviceNoHuman-runner advisory string naming the device (e.g. 'iPhone 14 Pro', 'Pixel 7'). Surfaced on the checker's claim card so they know which device to test on. No effect for AI runners.
useTestInboxNoAI runner only. When true, TMV provisions a per-job inbox at `<job-prefix>-<random>@inbox.testmyvibes.com` bound to this run. The agent uses it for any email field, and `wait_for_email` blocks until verification emails arrive. Required for OTP / email-verify flows; pointless for read-only tests.
mcpAuthHeaderNoMCP Auditor only. Optional auth header passed to the MCP endpoint (e.g. 'Bearer <token>', 'X-API-Key: <key>'). Format: 'HeaderName: value'. Used verbatim on every JSON-RPC request.
targetBrowserNoHuman-runner advisory string for browser (e.g. 'Safari', 'Chrome', 'Firefox'). Surfaced on the checker's claim card.
useFakeProfileNoAI runner only. Adds depth to the test persona beyond default username/displayName/bio. 'basic' (+1 credit): generates a physicalProfile JSON (age, height, hair color, eye color, etc.) so any open-ended profile fields are filled with consistent realistic values. 'full' (+2 credits): basic + 2 photorealistic Flux Schnell photos uploaded to TMV Spaces and exposed to the agent as signed URLs for avatar / profile-image uploads. Skip this for read-only tests; use 'basic' for profile-completion tests; use 'full' for photo-required signup flows.off
keepTestAccountNoAI runner only. When false (default), signup tests end by deleting the account they created so customer user tables don't accumulate orphan rows. Set true to KEEP the account alive after the test — the persona's credentials are persisted so a later submit_test with `existingPersonaId` can sign in as a returning user (repeat-testing offering). Costs more (persona retention fee) but saves signup steps on every subsequent run.
smsInboxCountryNoAI runner only. Used with useSmsInbox=true. Country code of the throwaway number to rent. US (default) covers most American/Canadian flows; CA needed for sites that gate by destination country. India is NOT available (Telnyx has no IN inventory).US
targetScreenSizeNoHuman-runner advisory string (e.g. '1920x1080', '390x844'). Stored on the job and surfaced on the checker's claim card. No effect for AI runners — use devicePreset or viewport instead.
agentInstructionsNoAI runner only. Verbal step-by-step the vision agent follows. Pin exact field values here (e.g. 'When asked for a name use "QA Tester"; when asked for a password use "TestPass!2026"'). Without this the agent invents values and tests become non-reproducible. By default these are advisory — set strictAgentInstructions=true to enforce them as hard rules.
existingPersonaIdNoAI runner only. Task #30 repeat-test. Set to the personaId of a previously-kept persona (from a job submitted with keepTestAccount=true). The worker skips provisioning + signup and instead reuses the persona's stored email + password to log straight in. Use this to exercise return-user flows (profile edits, dashboards, settings, follow-up actions) without paying for signup every time. Discounted -1 credit per run; persona retention itself costs 2 credits per 30-day window (first persona per project free). Call list_device_presets to see all device names.
personalityOfferingIdNoPersonality menu offering id (browse via list_personality_offerings). Locks the step budget, inbox provisioning, personality, and price to the offering. Mutually exclusive with offeringId — offeringId routes to a worker; personalityOfferingId is an AI-only packaged test priced by TMV.
strictAgentInstructionsNoAI runner only. When true, agentInstructions are enforced with a stronger preamble + post-step self-check ("Did my last action violate any rule? If yes, reverse course before continuing"). Use for OTP / mid-form flows where one wrong click (extra OTP request, dropdown change after submit) invalidates state. Default false — instructions are advisory, the agent uses judgment.
expectedEmailFromContainsNoAI runner only. Pin the wait_for_email fromContains filter (substring of the sender address). Use when the sender domain isn't the obvious test target (e.g. delivered from sendgrid.net but the site is acme.com).
provisionTestCardFundedUsdNoAI runner only. Mints a Stripe-Issuing test-payment card just-in-time when the worker picks up this job, funded to this USD amount. The PAN is held in-memory only — never touches the Job record, never returned to the caller. The AI agent receives it via the system prompt and types it at the customer's checkout. Card is frozen automatically at end of run (or 24h, whichever first). Billed at funded + 25% markup + $2 service fee. Currently sandbox-only — cards auth against Stripe test-mode merchants only until live activation lands.
expectedEmailSubjectContainsNoAI runner only. Pin a case-insensitive substring the AI agent MUST use as wait_for_email's subjectContains filter. Useful when your customer's verification email subject doesn't match the site name (e.g. site is 'newvibecity.com' but email subject is 'Newvibecityhotel sign-in code'). Without this, the agent guesses from the URL/brand and can timeout on wrong filters. Surfaced in the system prompt with strict instructions.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations indicate readOnlyHint=false, openWorldHint=true, and destructiveHint=false, which already convey mutation and potential side effects. The description adds 'Queue a new job' and polling, but does not elaborate on costs, provisioning side effects (e.g., test inbox, persona creation), or rate limits. Given the complexity and openWorldHint, more disclosure would improve transparency.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is three sentences long and front-loaded with the core purpose. It is concise, though bullet points for runner comparison could improve structure. Still, every sentence earns its place without waste.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool has 32 parameters, 2 required, nested objects, and an output schema, the description is brief and does not provide a high-level summary of all use cases or nuances like offeringId, devicePreset, etc. While it covers the core idea, it lacks completeness for such a complex tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents all 32 parameters in detail. The description does not add any parameter-specific information beyond referencing get_test_status. Baseline 3 is appropriate as no extra value is added.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool queues a new TestMyVibes job for a given URL, explicitly distinguishes between AI and human runners, and mentions the returned jobId for polling with get_test_status. This provides specific verb and resource, differentiating it from sibling tools.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides guidance on when to use AI vs human runners based on speed and judgment needs. However, it does not explicitly advise when not to use this tool or mention alternatives like submit_test_batch for batch submissions. The context is clear but lacks exclusions.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

submit_test_batchSubmit multiple AI tests as a parallel batchAInspect

Queue up to 20 AI tests at once and run them in parallel instead of one-after-another. Each test in the batch costs 1.15× its base credits (the parallel premium). Returns the shared batchId and a per-test breakdown so you can poll each jobId individually. Use this when you have an independent set of tests to run (e.g. signup + login + dashboard + settings + delete across one customer site) and want them done in minutes rather than queued through a serial worker. AI runner only — human-runner batching ships separately.

ParametersJSON Schema
NameRequiredDescriptionDefault
testsYesArray of 2-20 test specs. Each item has the same shape as submit_test's inputs (AI runner). Tests run concurrently up to a worker concurrency limit of 3.
projectLabelNoAudit label naming which of your projects submitted this batch (e.g. 'shiftsee-regression-suite').

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Discloses cost premium (1.15× credits), return structure (batchId + per-test breakdown), and concurrency limit (3 workers). Annotations already indicate non-destructive write; description adds value with cost and parallelism details.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Compact single paragraph with key info front-loaded. No redundant sentences. Every sentence adds value.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Covers purpose, cost, return values, concurrency, and usage guidance. Output schema existence is noted context but description sufficiently describes return. Could mention error handling or rate limits but not essential.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% and schema fully describes parameters. Description adds minimal value by linking to sibling tool's input shape. Baseline of 3 is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description specifies verb 'submit', resource 'multiple AI tests', and distinguishes from serial submission. Mentions batch size limit (20) and AI runner exclusivity. Clearly separates from human-runner batching.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly tells when to use: independent tests to run in parallel for speed. Also states when not: human-runner batching ships separately. Provides context for decision.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

synthesize_voiceoverSynthesize a script with Hume Octave, return audio URLAInspect

Generates a voiceover from text using Hume Octave TTS. Audio uploaded to Spaces, signed URL returned (24h TTL by default). Charged in credits up-front based on script length (use quote_voiceover for a preview). Best for demo-video narration, tutorial audio, and any one-shot batch TTS. NOT a real-time conversational voice (use Hume EVI for that, different product). Voice options: pass voiceId for a specific Hume voice clone, or omit to use the deployment's default narrator (HUME_OCTAVE_VOICE_ID env var).

ParametersJSON Schema
NameRequiredDescriptionDefault
textYesScript text to read aloud. Max 5000 chars per call; split longer scripts.
voiceIdNoHume voice id. Omit to use the deployment's default narrator.
descriptionNoOptional prosody steering, e.g. "warm and conversational, slight pause before the punchline". Biases delivery without changing the script.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Beyond annotations (readOnlyHint=false, etc.), the description discloses key behaviors: audio is uploaded to Spaces, signed URL with 24h TTL, credits charged up-front, and voice options (voiceId or default). No contradictions with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise (4 sentences) with the most critical information front-loaded. Every sentence adds distinct value without redundancy or fluff.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (3 parameters, output schema exists), the description covers all necessary context: purpose, usage, behavioral details, and parameter nuances. It does not need to repeat the output schema as it is already declared.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Despite 100% schema coverage, the description adds valuable context: 'split longer scripts' for the text maxLength, 'omit to use default narrator' for voiceId, and 'biases delivery without changing script' for the description parameter. This enriches the schema-provided information.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it generates a voiceover from text using Hume Octave TTS and returns an audio URL. It distinguishes itself from sibling tools like 'quote_voiceover' (cost preview) and 'Hume EVI' (real-time conversation), making its purpose unambiguous and differentiated.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly states when to use this tool (demo-video narration, tutorial audio, one-shot batch TTS) and when not to (real-time conversation, pointing to Hume EVI as alternative). It also mentions using 'quote_voiceover' for a cost preview, providing clear usage boundaries.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

top_up_creditsTop up credits (Stripe Checkout)AInspect

Buy more credits to fund test runs that TestMyVibes' agents will execute on your behalf. Returns a Stripe Checkout URL the user must open to complete payment (Stripe requires human payment completion per their agentic-commerce policy). Once the user pays, the credits are added automatically by the Stripe webhook — poll get_credit_balance to confirm.

ParametersJSON Schema
NameRequiredDescriptionDefault
packIndexYesIndex of the credit pack from list_credit_packs.
cancelPathNoOptional path on testmyvibes.com to redirect the user to if they cancel. Defaults to '/dashboard/billing?canceled=1'.
successPathNoOptional path on testmyvibes.com to redirect the user to after a successful payment. Defaults to '/dashboard/billing?success=1'.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description adds significant behavioral context beyond annotations: it reveals that Stripe requires human payment completion due to their agentic-commerce policy, that credits are added automatically via webhook, and that the user must open the returned URL. This supplements the non-readOnly, non-destructive, non-idempotent annotations effectively.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description consists of two concise sentences, each providing essential information without redundancy. It is front-loaded with the purpose and efficiently communicates the workflow and behavioral details.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The description covers everything needed: it states the return value (Stripe Checkout URL), explains the asynchronous credit addition, and directs the user to confirm via get_credit_balance. No critical information is missing, and the output schema existence is acknowledged.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 100% schema coverage, the baseline is 3. The description adds value by explaining that packIndex is an index from list_credit_packs and by detailing the defaults for cancelPath and successPath. This extra context justifies a score of 4.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Buy more credits to fund test runs.' It specifies the resource (credits), the action (top-up), and the return type (Stripe Checkout URL). This distinguishes it from siblings like get_credit_balance, which is for checking balance.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explains when to use this tool: when the user needs more credits. It provides clear context by directing the user to poll get_credit_balance to confirm payment completion. While it does not explicitly state when not to use it, the instruction is clear.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

update_feedbackUpdate a feedback item's status / notes (staff only)AInspect

Staff-only triage write. Move feedback through the state machine (new → triaged → planned/wontfix → in_progress → shipped), attach internal notes, or mark as a duplicate of another item. Returns the updated record.

ParametersJSON Schema
NameRequiredDescriptionDefault
idYesFeedback id from list_feedback.
statusNo
duplicateOfNoWhen status=duplicate, the id of the canonical feedback this collapses into.
internalNotesNoStaff-only commentary. Appended to existing notes if any.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description adds significant behavioral context beyond annotations: details the state machine transitions, note appending behavior, and duplicate collapsing. It aligns with annotations (non-destructive write) and provides clear expectations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three concise sentences, each adding value: context (staff-only), operations (state machine, notes, duplicate), and output (returns record). No redundant or extraneous information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (state machine, multiple operations, permissions), the description covers all key aspects. The presence of an output schema means return value is handled externally. No gaps identified.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The description explains how each parameter is used in context (status for state machine, duplicateOf for marking, internalNotes for appending). With 75% schema coverage, the description complements the schema well, adding meaning beyond field names.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the verb (update, move, attach, mark), resource (feedback item), and scope (staff-only, state machine, duplicate). It distinguishes from sibling submit_feedback by emphasizing staff triage and state machine progression.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

It specifies the tool is for staff-only triage and lists operations (status change, notes, duplicates). While it doesn't explicitly say when not to use it, the context of 'staff-only' and the state machine progression implies appropriate use cases. An explicit alternative like 'submit_feedback' would improve clarity.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

upsert_worker_profileCreate or update your worker profileAInspect

Idempotent create-or-update for the calling account's worker profile. Opt in to the marketplace by setting bio + specialties; opt out by setting isActive=false on every offering. External workers settle in credits/USD; in-house workers (TMV staff doing premium checks) are paid through ShiftSee payroll and require shiftseeUserId.

ParametersJSON Schema
NameRequiredDescriptionDefault
bioNoWorker bio shown on the marketplace menu.
languagesNoISO-639 codes you can test in. Customers filter on this.
specialtiesNoTags like 'payments', 'i18n-spanish', 'react-spa' that route customer searches to you.
employmentTypeNo'external' = pay via credits/Stripe Connect. 'in_house' = TMV staff paid via ShiftSee payroll (requires shiftseeUserId).external
qualificationsNoVerified badges — admin-curated; treated as free-form strings on input.
shiftseeUserIdNoRequired when employmentType='in_house'. Maps to your ShiftSee user id for payroll routing.
defaultPayoutModeNoExternal workers only. Default routing for cleared earnings — credit (TMV-internal) or usd (Stripe Connect).

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior1/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description claims 'Idempotent' but the annotation idempotentHint=false contradicts this, a serious inconsistency. No other behavioral traits (e.g., side effects, required permissions) are disclosed beyond basic upsert behavior.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three sentences covering purpose, opt-in/out, and employment types. Concise but the opt-out mention is slightly tangential. Mostly well-structured and front-loaded.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given 7 optional parameters, enums, and an output schema, the description covers main use cases and key constraints. Does not explain return values but output schema exists. Lacks details on limitations or rate limits, but enough for basic usage.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% so baseline is 3. Description adds value by explaining the logic of employmentTypes, the need for shiftseeUserId when in_house, and the opt-in strategy (bio + specialties). Provides context beyond schema, such as settlement differences.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it's an 'Idempotent create-or-update for the calling account's worker profile,' specifying the resource, action, and scope. It distinguishes from sibling tools like create_worker_offering by focusing on the profile itself, not offerings.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Provides explicit guidance on when to use: opt-in by setting bio + specialties, opt-out via offerings (though indirectly). Explains the two employment types and their requirements (shiftseeUserId for in_house, defaultPayoutMode for external). Lacks a direct 'when not to use' statement but is clear in context.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

whoamiIdentity + billing-mode self-checkA
Read-onlyIdempotent
Inspect

Returns the calling account's id/email/role plus internal-use eligibility: whether the account is staff-flagged, which domains run free, and how a given target URL would be billed if you submitted a test now. Use this first when you bring TMV into a new project — it confirms the project's API key actually maps to the expected operator account.

ParametersJSON Schema
NameRequiredDescriptionDefault
targetUrlNoOptional URL to check billing-mode against (e.g. the project's homepage). When provided, the response includes the exact billing outcome for that target.

Output Schema

ParametersJSON Schema
NameRequiredDescription
resultNoTool result payload (JSON object)
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate readOnly, idempotent, and non-destructive behavior. The description adds specific information about what data is returned (role, staff flag, free domains, billing outcome) beyond the annotations, making the tool's behavior fully transparent.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two sentences with no redundancy. The first sentence front-loads the return values, and the second sentence gives immediate usage guidance. Every sentence is necessary and well-placed.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a tool with one optional parameter and an existing output schema, the description covers all necessary aspects: what it returns, when to use it, and the effect of the parameter. No gaps remain.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% with a full description for the optional 'targetUrl' parameter. The description reinforces that providing a URL gives the billing outcome, adding context that internal-use eligibility includes domain and billing checks, which goes slightly beyond the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states that the tool returns the calling account's identity (id/email/role) and billing eligibility (staff-flagged domains, billing outcome for a target URL). This specific verb+resource combination distinguishes it from siblings like submit_test or list_projects.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly says 'Use this first when you bring TMV into a new project', providing a clear use case and context. It does not, however, discuss when not to use it or compare with alternatives, so it could be more detailed.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Discussions

No comments yet. Be the first to start the discussion!

Try in Browser

Your Connectors

Sign in to create a connector for this server.

Resources