TestMyVibes

Server Details

MCP-native AI browser testing for coding agents. Submit a URL + goal, get back action trail, bugs, screenshots, and WebM video your agent patches from directly. 43 tools, 12 AI evaluation personalities, combo tiers with auto-pause-on-bugs, throwaway email + SMS inboxes.

Status: Healthy
Last Tested: 2026-07-25 05:34
Transport: Streamable HTTP
URL

Glama MCP Gateway

Connect through Glama MCP Gateway for full control over tool access and complete visibility into every call.

MCP client

Glama

MCP server

Full call logging

Every tool call is logged with complete inputs and outputs, so you can debug issues and audit what your agents are doing.

Tool access control

Enable or disable individual tools per connector, so you decide what your agents can and cannot do.

Managed credentials

Glama handles OAuth flows, token storage, and automatic rotation, so credentials never expire on your clients.

Usage analytics

See which tools your agents call, how often, and when, so you can understand usage patterns and catch anomalies.

100% free. Your data is private.

Tool Definition Quality

A3.9/5.0

Tool DescriptionsA

Average 4.3/5 across 43 of 43 tools scored. Lowest: 3.4/5.

Server CoherenceA

Disambiguation4/5

The tools cover a wide range of functionalities, but each has a clearly distinct purpose. For example, submit_test, submit_test_batch, submit_combo, and submit_interaction_scene are all different types of submissions with unique parameters. However, the sheer number of tools (43) might cause some initial confusion, but descriptors resolve ambiguity.

Naming Consistency5/5

All tool names follow a consistent verb_noun pattern in snake_case (e.g., list_projects, create_project, get_test_results). The only exception is 'whoami', which is a common idiom and does not break the pattern. Overall, naming is highly predictable.

Tool Count3/5

43 tools is on the high side for a single server. The domain is broad (testing, worker marketplace, credits, cards, feedback, video), so the count is justifiable. However, it borders on being overwhelming, and some tools could be consolidated (e.g., multiple submit_* variants).

Completeness3/5

The tool surface covers core workflows like project creation, test submission, result retrieval, worker management, and credit operations. However, there are gaps: no update or delete for projects, no delete for worker offerings, and no user-facing combo editing (though combos are predefined). These are minor but noticeable.

Available Tools

51 tools

assemble_demo_videoAssemble a multi-segment demo video (hands-off)AInspect

Stitches video clips + voiceover narration into a single MP4 published to Spaces. Each segment is one of: (a) videoUrl + narrationText (voiceover replaces video's audio track), (b) narrationText only (generates a brand-color title card sized to narration length), (c) videoUrl + audioUrl (drops in a pre-baked audio track). Returns a 24h signed URL to the final MP4. Use this for marketplace catalog submissions, tutorial videos, or any time you'd otherwise screen-record + iMovie by hand. Charged on success only; failed runs are free.

ParametersJSON Schema

Name	Required	Description	Default
`segments`	Yes	Ordered segment list. Concatenated in order. Max 12 segments / ~3 minutes total for the catalog use-case.
`publishAs`	No	Optional. When set, ALSO writes the final MP4 to a stable Spaces path at demo-videos/promoted/<publishAs>.mp4 so a public page (e.g. /ai/demo.mp4) can keep embedding the same URL forever. Re-running with the same publishAs key overwrites. Common values: "ai-landing" (powers /ai page), "anthropic-submission" (catalog submission).
`outputAspect`	No	16:9 for desktop / YouTube, 9:16 for mobile / TikTok / Shorts, 1:1 for square social.	16:9

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.6/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations (readOnlyHint=false, destructiveHint=false) are consistent; description adds details like returning a 24h signed URL, charging on success only, and optional stable publishing path.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Single paragraph of five sentences, front-loaded with core operation, efficient with no extraneous text.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Covers segment types, constraints (max 12 segments, ~3 minutes), output (signed URL), use cases, and pricing; sufficient given presence of output schema.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, but description adds rich context: explains three segment types, defaults for voiceId, common videoUrl source, minDurationSec for title cards, and publishAs usage for persistent URLs.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool stitches video clips and voiceover into an MP4, specifies three segment types, and distinguishes from siblings like synthesize_voiceover by focusing on video assembly.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly lists use cases (marketplace catalog submissions, tutorial videos) and mentions charging on success, but does not explicitly state when not to use or compare with sibling tools.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

cancel_testCancel a queued or running testAInspect

Cancel a test that is pending, claimed, in_progress, or expired. Paid credit jobs are refunded once; internal-use runs cancel without a credit refund. Use this when a site needs to be published or reconfigured before the test should continue.

ParametersJSON Schema

Name	Required	Description	Default
`jobId`	Yes	Job ID returned by submit_test, submit_combo, or the HTTP API.
`reason`	No	Optional short reason stored with the canceled job.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.4/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Beyond annotations (readOnlyHint=false, destructiveHint=false), the description adds valuable context about refund behavior for paid vs internal runs. It does not contradict annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences with no wasted words: first states what the tool does, second provides usage context. Front-loaded with key action.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

With an output schema present, the description need not detail return values. It covers applicable states and refund behavior, though it could mention whether cancellation is reversible.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, but the description adds meaning by specifying that jobId comes from submit_test, submit_combo, or the HTTP API, which clarifies the parameter's origin beyond the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states 'Cancel a test that is pending, claimed, in_progress, or expired,' specifying the exact resource and states. This distinguishes it from sibling tools like submit_test or claim_job.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Provides a clear use case: 'Use this when a site needs to be published or reconfigured before the test should continue.' However, it does not explicitly state when not to use it or mention alternative tools.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

capture_screenshotsCapture screenshots of a URL across viewportsAInspect

Drive a headless Chromium against a URL and return a screenshot for each requested viewport (mobile / tablet / desktop). Optional clickPaths lets you grab the state behind a sequence of clicks (e.g. ['Sign in', '#email', 'Continue']). Pricing: 1 credit per single viewport, 5 credits for the desktop+tablet+mobile triple (otherwise 1 × viewport count). Output: signed Spaces URLs valid for 7 days. Use this for marketing screenshots, design QA, regression-watch baselines — anything where you need pixels without a full AI test.

ParametersJSON Schema

Name	Required	Description
`url`	Yes	Public URL to capture. Must be reachable from TMV's outbound IP.
`settleMs`	No	Milliseconds to wait after navigation + each click before screenshotting. Default 1500ms covers most SPAs.
`viewports`	Yes	Which viewports to capture. 'mobile' = iPhone 14 (390×844), 'tablet' = iPad Air (820×1180), 'desktop' = 1440×900 laptop.
`clickPaths`	No	Optional sequence of selectors / visible text to click before screenshotting. Each entry applied in order; missing selectors are skipped (non-fatal).
`projectLabel`	No	Audit label naming which of your projects requested this capture.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.7/5.0

Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description adds valuable behavioral context beyond annotations: it mentions headless Chromium, pricing model (1 credit per viewport, 5 for triple), output as signed Spaces URLs valid for 7 days, and click path behavior (non-fatal missing selectors). No contradiction with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Four sentences, front-loaded with action and purpose, each sentence adds value (what, how, pricing, output, use cases). No superfluous content.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given output schema exists, description need not detail return structure, but it does mention signed URLs with 7-day validity. All key aspects are covered: input url, viewports, optional clicks, timing, pricing, and intended use.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, baseline 3. Description adds meaning beyond schema by explaining pricing logic for viewports, that clickPaths are non-fatal for missing selectors, and provides default settleMs coverage for SPAs.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description clearly states it drives headless Chromium to capture screenshots across specified viewports, with optional click paths. It distinctly differentiates from sibling tools (none are screenshot-related) and uses specific verbs ('Drive', 'return a screenshot').

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly recommends use cases: marketing screenshots, design QA, regression baselines. While it doesn't list when not to use or alternatives, the sibling tools are sufficiently diverse and unrelated, making the guidance clear.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

check_test_identityCheck a managed test identityBInspect

Read-only validation for a managed identity: ownership, expiry, target-site authorization, and whether credentials exist. Does not return the password.

ParametersJSON Schema

Name	Required	Description	Default
`targetUrl`	No
`testIdentityId`	Yes

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

B3.1/5.0

Behavior1/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description claims the tool is 'read-only validation' but the annotations have readOnlyHint: false, a direct contradiction. No other behavioral traits are disclosed beyond the checks listed.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two short, focused sentences with no unnecessary words. The key information is front-loaded.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Despite having an output schema, the description omits parameter descriptions entirely, leaving a gap for a tool with 0% schema coverage. It also does not address when to use this vs. get_test_status or list_test_identities.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters1/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0% and the description provides no explanation or additional meaning for the parameters (testIdentityId, targetUrl). The agent must guess their roles.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it is a read-only validation tool, listing specific checks (ownership, expiry, authorization, credentials existence) and explicitly says it does not return the password. This differentiates it from other tools like create_test_identity or get_test_status.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description indicates usage context ('read-only validation') but does not explicitly state when not to use or mention alternative tools. However, the purpose is clear enough for an agent to infer appropriate use.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

claim_jobClaim a jobAInspect

Atomically take ownership of a pending job. Returns the full checklist the worker needs to walk through, plus the SLA deadline. After this call, the job is yours; submit results with submit_job_results when done, or it expires after the SLA and is returned to the queue.

ParametersJSON Schema

Name	Required	Description	Default
`jobId`	Yes	The jobId from list_available_jobs.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.5/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description reveals atomicity, the return of checklist and SLA deadline, and the consequence of expiry—all valuable beyond annotations (which only show non-readonly and non-destructive). No contradictions with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences, front-loaded with the core action, followed by essential workflow details. Every sentence adds necessary information without redundancy.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the output schema exists (so return values are documented), the description covers the claim's effect, result contents, and post-claim expectations. For a simple one-parameter tool, this is complete.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% for the single parameter 'jobId', which already includes a description. The description adds context by specifying that the jobId comes from 'list_available_jobs', helping the agent understand provenance. This lifts it above the baseline of 3.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description uses specific verbs ('take ownership') and names the resource ('pending job'), clearly distinguishing it from siblings like 'list_available_jobs' (listing) and 'submit_job_results' (submitting). It fully captures the tool's atomic claim action.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description contextualizes the tool by mentioning that the job comes from 'list_available_jobs' and that results must be submitted via 'submit_job_results', implying the workflow. It also warns about SLA expiry. However, it does not explicitly state when not to use the tool (e.g., if already claimed).

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

complete_checkoutComplete checkout (ChatGPT Instant Checkout)AInspect

Called by ChatGPT (or any agent runtime supporting Stripe's Shared Payment Token flow) after the user clicks Pay in an inline payment widget. Receives the SPT, charges it via Stripe, and credits the user's TMV account synchronously. The checkout_session_id is the Stripe Checkout Session ID returned by top_up_credits.

ParametersJSON Schema

Name	Required	Description
`buyer`	No
`payment_data`	Yes
`checkout_session_id`	Yes	Stripe Checkout Session ID minted by top_up_credits.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description reveals that the tool charges via Stripe and credits the user's TMV account synchronously, which adds behavioral context beyond the annotations. Annotations indicate non-readOnly and non-destructive, but the description clarifies the specific side effects. No contradictions with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise, consisting of three sentences that efficiently convey the purpose, trigger, and key parameter relationship. It is front-loaded with the core functionality and contains no unnecessary words.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity and the presence of an output schema, the description covers the essential flow and references the prerequisite tool. However, it lacks details on error handling, idempotency, or rate limits, which would be helpful for an AI agent. It is minimally adequate.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters2/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is low (33%), and the tool description only adds context for the checkout_session_id parameter, mentioning it comes from top_up_credits. It does not provide additional meaning for buyer or payment_data beyond what the schema already includes. The description should compensate more for the low coverage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's function: completing a checkout by charging a Stripe payment token and crediting a TMV account. It specifies the trigger ('after user clicks Pay') and the source of the checkout_session_id, making the purpose unambiguous and distinguishing it from sibling tools like top_up_credits.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly states when the tool is called ('after the user clicks Pay in an inline payment widget') and refers to the prerequisite call to top_up_credits. However, it does not explicitly mention when not to use it or alternative tools, leaving some room for improvement.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

create_projectCreate projectAInspect

ParametersJSON Schema

Name	Required	Description	Default
`url`	Yes	The site's URL.
`name`	Yes	Display name for the project.
`defaultJobType`	No	Default test category.	General QA

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4/5.0

Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate it is a non-read-only, non-destructive, non-idempotent creation operation. The description adds the return value (projectId) but no further behavioral details like duplicate handling or validation.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single, front-loaded sentence with no wasted words, efficiently conveying purpose and output.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the simple parameters and presence of an output schema (implied by returning projectId), the description is adequate. It could mention which parameter is the site identifier but is sufficient.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

All parameters have descriptions in the schema (100% coverage), so the description adds no additional meaning. Baseline of 3 is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool creates a new site for testing and returns a projectId for future use, distinguishing it from sibling tools like list_projects or submit_test.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explains the tool should be used before submit_test to get a projectId, but does not explicitly state when not to use it or suggest alternatives like reusing existing projects.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

create_test_identityCreate a managed retained test identityAInspect

Creates a first-class managed test identity with a persistent TMV inbox and persona, scoped to this account and one customer-site origin. Use it before multi-run or multi-user tests when you need the same account history to survive. If username/password are omitted, submit_test with identityMode='reuse' and this testIdentityId will perform the first signup and save credentials on PASS.

ParametersJSON Schema

Name	Required	Description	Default
`label`	No
`password`	No	Optional existing password. Stored encrypted at rest and never returned by list tools.
`username`	No	Optional existing login username/email if the account already exists on the customer site.
`autoRenew`	No	When true, expiry is extended by use. Billing renewal enforcement is handled separately from this metadata flag.
`createdBy`	No	Who originally created this customer-site account. Use human for accounts created by a TMV human tester, ai for accounts created by an AI run, and external for credentials you already have.	external
`projectId`	No
`setupNotes`	No	Optional notes about the account setup, role, permissions, or fixture state.
`customerSite`	Yes	Customer site origin or URL this identity is allowed to test against.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.4/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations indicate readOnlyHint=false (write), openWorldHint=true (side effects), destructiveHint=false. Description adds: 'persistent TMV inbox and persona', 'scoped to this account and one customer-site origin', and the deferred signup behavior when credentials omitted. It does not fully detail billing or concurrency but provides essential behavioral context beyond annotations. No contradictions.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Description is two sentences: first sentence states core purpose and scope, second sentence gives usage guidelines with conditional workflow. It is front-loaded, efficient, and every sentence adds value with no waste.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given 8 parameters, 1 required, annotations, and an output schema (exists), the description covers main purpose, usage context, and a key behavioral nuance (deferred signup). It does not explain all traits (e.g., behavior if identity already exists) but is fairly complete for a complex tool. The presence of an output schema reduces need for return value explanation.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 75% (6 of 8 parameters have descriptions). The two unschema-described parameters (label, projectId) are not elaborated in the description. The description adds no new parameter semantics beyond the schema. Baseline 3 is appropriate given high coverage and no additional parameter context.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description clearly states 'Creates a first-class managed test identity with a persistent TMV inbox and persona, scoped to this account and one customer-site origin.' The verb 'creates' and specific resource 'managed test identity' with key attributes (persistent inbox, persona, scoped) distinguish it from siblings like check_test_identity, delete_test_identity, and list_test_identities. It also contrasts with submit_test via the usage note.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly states when to use: 'Use it before multi-run or multi-user tests when you need the same account history to survive.' It also provides an alternative workflow: if username/password omitted, submit_test with identityMode='reuse' will perform first signup. This is clear guidance on context and alternatives.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

create_worker_offeringPublish a new service offeringAInspect

Add a named, priced offering to your worker menu. Customers see name + description + creditsCharged + estDurationHr and pick directly. Worker earns 75% of credits charged (floor-rounded); TMV keeps 25%. Price must be a whole number of credits, ≥ 15. Until your account is uncapped (3 quality-scored jobs, OR 1 four-star+ customer review, OR $100 cleared earnings), the per-offering ceiling is 50 credits.

ParametersJSON Schema

Name	Required	Description
`name`	Yes
`description`	Yes
`specialties`	No
`estDurationHr`	No
`creditsCharged`	Yes	Whole-number credit price. Floor=15. New-worker ceiling=50.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.7/5.0

Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Beyond annotations (readOnlyHint=false, etc.), the description details economic behavior: worker earns 75%, TMV keeps 25%, floor-rounded credits, and per-offering ceiling conditions. It also explains the cap removal criteria, providing full behavioral context.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is three sentences, each delivering essential information: purpose, customer visibility, economics, and constraints. It is front-loaded and concise without unnecessary detail.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given 5 parameters and an output schema, the description covers all critical aspects: what the tool does, how customers see it, economic split, pricing limits, and conditions for lifting caps. It is sufficient for an agent to correctly invoke the tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is low (20%), but the description adds meaning by explaining how parameters (name, description, creditsCharged, estDurationHr) are used by customers. It specifies pricing rules for creditsCharged, which is not in the schema description. However, optional parameters like specialties are not discussed.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool adds a named, priced offering to the worker menu, using specific verbs like 'Add' and 'publish'. It distinguishes from sibling creation tools (e.g., create_project, submit_job) by focusing on service offerings for workers.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explains when to use the tool: to add an offering that customers see and pick. It provides pricing constraints (minimum 15 credits, ceiling 50 for new accounts) and notes the revenue split, but does not explicitly list when to avoid using it or alternatives.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

delete_test_identityDelete a managed test identityAInspect

Deletes TMV's retained credentials for a managed test identity. This does not guarantee deletion inside the customer app; run an account-deletion test first if you need customer-site cleanup.

ParametersJSON Schema

Name	Required	Description	Default
`testIdentityId`	Yes

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.6/5.0

Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Discloses that the tool only removes TMV's credentials, not the identity itself in the customer app. This adds context beyond annotations (destructiveHint: false) by clarifying the scope of deletion, avoiding overstatement.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences only: the first states the primary action, the second adds critical nuance. No unnecessary words, front-loaded with purpose.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the single required parameter and the presence of an output schema (not shown), the description covers the essential nuance about scope. It is complete for a simple deletion tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters2/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The sole parameter 'testIdentityId' has no description in the schema (0% coverage), and the tool description does not explain it. While the purpose makes it inferable, the description adds no direct parameter meaning.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description clearly states it deletes TMV's retained credentials for a managed test identity. This is a specific verb ('deletes') and resource ('retained credentials for managed test identity'), distinguishing it from siblings like 'check_test_identity' and 'create_test_identity'.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly warns that it does not guarantee deletion inside the customer app and advises running an account-deletion test for full cleanup. This provides clear when-to-use and when-not-to-use guidance, with an alternative action.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

freeze_test_cardCancel a previously-issued test cardA

Destructive

Inspect

Cancels the card (no further authorizations). Idempotent. Auto-freeze runs daily for any card past its 24h expiry; call this explicitly to freeze immediately after a successful checkout test.

ParametersJSON Schema

Name	Required	Description	Default
`cardId`	Yes	The test card ID returned by provision_test_card.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A3.8/5.0

Behavior1/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Description claims 'Idempotent' but annotations set idempotentHint=false. This is a contradiction. Additionally, annotations have destructiveHint=true which matches the cancel action, but the idempotency claim is inconsistent and misleading.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences are efficient and front-loaded with the core purpose. Every sentence adds value.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given annotations and output schema existence, the description is largely complete for a simple mutation. It explains behavior, idempotency (though contradictory), and usage context. Lacks mention of return value but output schema likely covers that.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema covers 100% of parameters with adequate description for cardId. The tool description does not add extra parameter semantics beyond the schema, so baseline 3 is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description clearly states the tool cancels a test card and prevents further authorizations. The verb 'Cancels' and resource 'card' are specific. It distinguishes from sibling tools like provision_test_card and list_test_cards.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicit guidance on when to use: 'call this explicitly to freeze immediately after a successful checkout test.' Also mentions the alternative auto-freeze that runs daily for expired cards.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

get_combo_statusRolled-up status for a submitted combo bundleA

Read-onlyIdempotent

Inspect

Single-call combo status: per-leg breakdown + cumulative bug counts across legs + pause-on-bugs-threshold proximity + estimated time remaining. Use this instead of polling N individual jobIds for a combo. Free.

ParametersJSON Schema

Name	Required	Description	Default
`comboId`	Yes	The comboId returned by submit_combo (one per combo run; distinct from each leg's jobId).

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.3/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate read-only, idempotent, and non-destructive behavior. The description adds value by explaining the composite nature of the response, including cumulative bug counts and estimated time remaining. It also notes the tool is 'Free', providing cost transparency.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two sentences, front-loaded with the tool's purpose and key features. Every word adds value, with no redundant or filler content.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the simple input (one parameter) and the presence of an output schema, the description provides sufficient context. It outlines the major output categories (breakdown, bug counts, threshold, time remaining) and usage guidance, making the tool's functionality clear.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% with a detailed description of the single parameter 'comboId'. The tool description does not add further semantic meaning beyond what the schema already provides, so a baseline score of 3 is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool returns a composite status for a combo bundle, listing specific output elements (per-leg breakdown, bug counts, pause threshold, time remaining). It distinguishes from siblings by explicitly advising to use this instead of polling individual jobIds.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description directly tells when to use the tool: 'Use this instead of polling N individual jobIds for a combo.' While it doesn't specify when not to use, this explicit guidance is strong and clear.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

get_credit_balanceGet credit balanceA

Read-onlyIdempotent

Inspect

Get the current account's credit balance. Returns total valid credits, raw batches, and a warning flag if the balance is below the threshold a typical test costs.

ParametersJSON Schema

Name	Required	Description	Default
No parameters

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.6/5.0

Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate the tool is read-only, idempotent, and non-destructive. The description adds valuable context about the return data (total valid credits, raw batches, warning flag), which goes beyond what annotations provide. There is no contradiction.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two sentences long, concise, and front-loaded with the main purpose. Every sentence adds value and there is no extraneous information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a simple read-only tool with no input parameters and an existing output schema (as indicated), the description covers the essential return fields and the threshold warning. It is complete enough for an AI agent to understand the tool's behavior.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The tool has zero parameters, so no semantic information is needed. The description does not attempt to describe parameters, which is appropriate. The schema coverage is 100% by default, meeting the baseline expectation.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Get the current account's credit balance.' It specifies the resource (credit balance) and the action (get). This distinguishes it from sibling tools like top_up_credits (adds credits) and list_credit_packs (lists packs).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies the tool is used to fetch the credit balance, but it does not explicitly state when to use this tool over alternatives or provide any usage restrictions. Since there are no direct alternatives for checking the balance, the lack of explicit guidance is acceptable but not ideal.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

get_integration_guideGet TestMyVibes integration guideA

Read-onlyIdempotent

Inspect

Returns the canonical guide for using TMV from a coding-agent context. Covers the fix-test-retest loop, how to write a good test prompt, how to read the actionTrail / consoleErrors / failedRequests outputs, and common gotchas. Call this first if you're a new agent on a project — it'll save you a debug session. The same content is served at https://testmyvibes.com/docs/coding-agents.

ParametersJSON Schema

Name	Required	Description	Default
No parameters

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.9/5.0

Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint, idempotentHint, and destructiveHint, but the description adds valuable context: it covers the fix-test-retest loop, writing test prompts, reading outputs, and common gotchas. It also mentions the same content is at a URL, which helps the agent understand the nature of the response.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences, front-loaded with purpose, followed by content summary and usage hint. Every sentence earns its place. No verbosity.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

With 0 parameters and a rich output schema (implied), combined with complete annotations, the description fully covers what the tool does, when to use it, and what to expect. No gaps.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The input schema has zero parameters and schema coverage is 100%, so the description does not need to add parameter details. Baseline for 0 params is 4, and the description provides no additional parameter info, which is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states 'Returns the canonical guide for using TMV from a coding-agent context.' It specifies the exact resource (guide) and action (returns), and distinguishes from sibling tools by positioning it as a first stop for new agents.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly says 'Call this first if you're a new agent on a project' – providing a clear when-to-use directive. It also references common gotchas and outputs, implying context where it's valuable, and distinguishes from other tools by focusing on integration guidance.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

get_payout_setup_linkGet Stripe Connect onboarding link (payout setup)A

Read-onlyIdempotent

Inspect

Mint a Stripe Connect onboarding URL so the calling account can link a bank account and start receiving payouts. Opt-in to worker mode at the same time: pass workerKind='ai' if this account represents an AI agent operator (rather than a human checker). Returns a hosted URL the user opens to complete bank linking.

ParametersJSON Schema

Name	Required	Description	Default
`workerKind`	No	Worker classification. 'ai' for AI-agent operators (the common MCP case); 'human' for human checkers.	ai
`agentOperatorLabel`	No	Free-text label for the operator (e.g. 'Acme AI Agency'). Optional metadata only.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.5/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false. The description adds that the tool returns a hosted URL that the user must open to complete bank linking, providing behavioral context beyond the annotations. No contradictions noted.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two sentences long, front-loads the primary purpose, and contains no extraneous information. Every word adds value, making it highly efficient and easy to parse.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's low complexity (2 optional parameters, no required fields, output schema present), the description covers the purpose, parameter usage, return behavior, and expected user action. No gaps remain for an agent to misuse the tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, so baseline is 3. The description adds meaningful guidance: explaining the workerKind enum values ('ai' for AI-agent operators, 'human' for human checkers) and clarifying that agentOperatorLabel is optional metadata. This exceeds the schema's basic descriptions.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states that the tool generates a Stripe Connect onboarding URL for linking a bank account and starting payouts, with a specific verb ('Mint') and resource ('Stripe Connect onboarding link'). It also distinguishes the purpose of opting into worker mode via the workerKind parameter, making it highly specific.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context for when to use the tool (to set up payouts and optionally opt into worker mode). Although no explicit alternative tools are mentioned, the sibling list does not contain a similar tool, so the absence of exclusions is not a drawback. The context is sufficient for correct selection.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

get_scene_statusRolled-up status for an interaction sceneA

Read-onlyIdempotent

Inspect

Single-call view of every role in a scene: per-role job status, signals fired so far, shared state. Use instead of polling N individual jobIds.

ParametersJSON Schema

Name	Required	Description	Default
`sceneId`	Yes	The sceneId returned by submit_interaction_scene.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A3.9/5.0

Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true and idempotentHint=true, so the description does not need to restate safety. It adds value by describing the response content but lacks details on error handling, latency, or what happens with invalid sceneIds, which would be helpful.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two sentences, each earning its place: first sentence defines the tool's purpose and output, second provides usage guidance. No wasted words, front-loaded with key information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's simplicity, the presence of an output schema, and comprehensive annotations, the description covers all necessary context: what the tool does, when to use it, and what the input parameter is. No gaps remain for effective use.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Input schema coverage is 100% and the description only indirectly explains sceneId by referencing its origin (from submit_interaction_scene). The description does not add significant meaning beyond the schema's existing description, resulting in a baseline score.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool provides a single-call view of every role in a scene, listing specific data (job status, signals, shared state). It distinguishes from polling individual jobs but does not explicitly differentiate from similar sibling tools like get_combo_status or get_test_status, leaving some ambiguity.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly states to use this tool instead of polling N individual jobIds, providing clear positive guidance. It does not specify when not to use it or mention alternatives among siblings, but the context is sufficient for the intended use case.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

get_test_resultsGet test resultsA

Read-onlyIdempotent

Inspect

Fetch full results for a completed test: the checklist outcomes, the report summary, and any AI-generated analysis. Returns status='pending' if the test isn't done.

ParametersJSON Schema

Name	Required	Description	Default
`jobId`	Yes	The job ID.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint, idempotentHint, and destructiveHint, so the description adds value by disclosing the pending status behavior. No contradictions with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two concise sentences that front-load the purpose and include a key behavioral detail. No wasted words.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The description covers the core functionality and the pending state. With an output schema present, return values are not needed. It omits prerequisites like how to obtain jobId, but context is sufficient for a low-complexity tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% with jobId described as 'The job ID.' The description does not add additional parameter semantics beyond what the schema provides, resulting in baseline score.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool fetches full results for a completed test, listing specific contents (checklist outcomes, report summary, AI analysis). It distinguishes from sibling tools like get_test_status by implying this is for full results.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage after a test is complete but does not explicitly guide when to use this tool versus alternatives like get_test_status. It mentions the pending status, which offers some context, but lacks explicit exclusions or alternative recommendations.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

get_test_statusGet test statusB

Read-onlyIdempotent

Inspect

Look up the current status of a submitted test job.

ParametersJSON Schema

Name	Required	Description	Default
`jobId`	Yes	The job ID returned by submit_test.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

B3.4/5.0

Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false, so the safety profile is clear. The description adds a behavioral detail (returns current status) but does not elaborate on response format or performance characteristics. With annotations doing most of the work, the description provides marginal addition.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

A single, concise sentence with no extraneous information. It is appropriately sized for a simple lookup tool and front-loads the key information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool is simple with one parameter and has rich annotations and an output schema, the description is complete enough. It could mention that this is solely for status and not results, but overall it adequately serves the agent's needs.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, and the schema already describes jobId as 'The job ID returned by submit_test.' The description does not add any further semantic context for the parameter, so it meets the baseline expectation for high schema coverage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the action (look up) and the resource (current status of a submitted test job). It distinguishes from other tools like get_test_results (which likely provides full results) and get_scene_status (different resource). However, it does not explicitly differentiate from other status lookups, leaving some ambiguity.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives such as get_test_results or get_scene_status. There is no mention of context or prerequisites, leaving the agent to infer usage without support.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

get_worker_earningsGet worker earningsA

Read-onlyIdempotent

Inspect

Show the calling worker's payout balance: lifetime earned, lifetime paid-out, currently pending. Includes Stripe Connect status and whether the pending balance meets the auto-payout threshold.

ParametersJSON Schema

Name	Required	Description	Default
No parameters

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.1/5.0

Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate readOnlyHint=true, destructiveHint=false, idempotentHint=true. The description adds no further behavioral traits such as authentication requirements or rate limits. It does not contradict annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

A single sentence that is concise and well-structured, listing all key information without redundancy. Every phrase earns its place.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given zero parameters and the presence of an output schema, the description sufficiently covers the tool's purpose and the content of its response. No obvious gaps remain.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

There are no parameters, so the description does not add parameter-level meaning. However, it explains the output fields (lifetime earned, paid-out, pending, Stripe status, threshold), which is valuable beyond the empty schema. With 100% schema coverage (no params), a baseline of 3 applies, but the detail on output semantics elevates it.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool shows the calling worker's payout balance including lifetime earned, paid-out, pending, Stripe Connect status, and auto-payout threshold. It specifies the verb 'Show' and the resource 'worker earnings', effectively distinguishing it from sibling tools which involve other operations.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implicitly indicates usage for checking the calling worker's personal earnings, but lacks explicit guidance on when to use this tool versus alternatives like get_credit_balance or get_combo_status. No when-not-to-use or alternative mentions are provided.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_available_jobsList available jobs to claimA

Read-onlyIdempotent

Inspect

Return the queue of pending jobs the calling worker could pick up. Excludes jobs owned by the calling account (you can't test your own site) and jobs already claimed by another worker. Returns the freshest jobs first.

ParametersJSON Schema

Name	Required	Description	Default
`jobType`	No	Filter to a single job-type label (e.g. 'General QA'). Omit to see all types.
`maxResults`	No

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.4/5.0

Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate readOnlyHint, idempotentHint, and destructiveHint. The description adds that it returns the freshest jobs first and excludes owned/claimed jobs, providing context beyond annotations. No contradiction.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three sentences, each providing essential information without redundancy. Front-loaded with the core action. No wasteful words.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The tool has an output schema (not provided but referenced), so return values need not be explained. It covers filtering and ordering. Minor omission: no mention of pagination, but the maxResults parameter implies a limit. Overall adequate for a simple list tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 50% (one parameter described, one not). The description does not add meaning beyond the schema; the schema already explains 'jobType' filter and 'maxResults' default/max/min. Hence baseline 3.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool returns the queue of pending jobs available for the calling worker to pick up, specifying exclusions (jobs owned by the calling account or already claimed). It distinguishes from siblings like 'list_my_claimed_jobs' by noting that owned jobs are excluded.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage context for listing available jobs to claim, and implicitly contrasts with 'list_my_claimed_jobs' by excluding owned jobs. However, it does not explicitly state when not to use this tool or provide direct alternatives.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_combosList packaged AI-agent combo bundlesA

Read-onlyIdempotent

Inspect

Returns the full combo catalog: cheap pre-launch smoke checks → the Whole Kit & Kaboodle (named personalities, specialized auditors, and viewport passes). Each combo lists its legs, estimated credit cost (recomputed from the live personality catalog), estimated duration, the bug-threshold that auto-pauses + refunds the remaining legs, and (where applicable) the cheaper combo we recommend running first. Read-only; charges nothing.

ParametersJSON Schema

Name	Required	Description	Default
No parameters

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.2/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true and destructiveHint=false. The description adds value by specifying that it 'charges nothing' and detailing the returned fields (cost, duration, thresholds), providing additional behavioral context.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single dense sentence that conveys all necessary information efficiently. It could be broken into multiple sentences for better readability, but it is not overly verbose.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given no parameters and the existence of an output schema, the description fully explains what the tool returns (legs, cost, duration, bug-threshold, cheaper recommendation), leaving no gaps for a list operation.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With zero parameters and 100% schema coverage, the description compensates fully by explaining what the output contains, making the tool's semantics clear without needing parameter details.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states 'Returns the full combo catalog' and details the contents (cheap pre-launch smoke checks, Whole Kit & Kaboodle), distinguishing it from sibling tools like get_combo_status which targets a single combo.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage for catalog browsing but does not explicitly state when to use this tool vs alternatives or when not to use it. However, the purpose is clear enough given the sibling context.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_credit_packsList credit packsA

Read-onlyIdempotent

Inspect

List the credit packs available for purchase. Returns pack index, credit count, USD price, and per-credit cost. Use the returned packIndex with top_up_credits.

ParametersJSON Schema

Name	Required	Description	Default
No parameters

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.5/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true and destructiveHint=false, so the description adds value by specifying the exact fields returned (pack index, credit count, USD price, per-credit cost). No contradiction with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences pack all necessary information without redundancy. The critical action (list) and key return fields are front-loaded, and every sentence contributes value.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given no parameters and an output schema (implied), the description is complete: it states what the tool does, what it returns, and how to use the result. No missing context.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

There are no parameters and schema coverage is 100%. The description adds meaning by detailing the output structure, which compensates for the absence of parameters. Baseline for 0 parameters is 4.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it lists credit packs for purchase, with the verb 'List' and specific resource 'credit packs'. It distinguishes from sibling tools like top_up_credits by indicating the returned packIndex is used with that tool.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

It explicitly advises using the returned packIndex with top_up_credits, providing clear usage context. While it doesn't list when not to use it, the tool's simplicity and zero parameters make this guidance sufficient.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_device_presetsList available device emulation presetsA

Read-onlyIdempotent

Inspect

Returns the device presets you can pass as devicePreset on submit_test / submit_test_batch / retest_job. Each entry includes viewport width/height, deviceScaleFactor, isMobile, and hasTouch so the AI agent (and you) can pick the right one. Free — emulation runs as part of the base test cost, no markup. Use featuredOnly=true for the 15 most common phones/tablets; pass featuredOnly=false to see all 131.

ParametersJSON Schema

Name	Required	Description	Default
`featuredOnly`	No	When true (default) returns the curated featured subset (~15 modern phones/tablets). When false returns all 131 Puppeteer-bundled devices.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.5/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false, so the description adds value by noting the tool is free and that emulation runs as part of the base test cost with no markup. This provides behavioral context beyond safety profiles.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is three sentences long, front-loaded with purpose and immediate keywords like 'Returns the device presets'. Every sentence adds essential detail with no waste.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The output schema exists, so return values are covered. The description still adds useful context about what each entry includes (viewport, deviceScaleFactor, etc.) and mentions cost implications. Given the simplicity of the tool, this is thoroughly complete.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, so baseline is 3. The description adds meaning by explaining the effect of the featuredOnly parameter: returning a curated subset vs. all devices. This goes beyond the schema description's generic phrasing.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description clearly states the tool returns device presets for use in submit_test, submit_test_batch, and retest_job. It specifies the resource (device presets) and the action (list), and distinguishes from sibling submission tools by being a preparatory listing action.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit guidance on when to use each parameter value: use featuredOnly=true for the 15 most common devices and featuredOnly=false for all 131. It also implies the tool should be used before submitting tests, though it does not explicitly state alternatives or when not to use.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_feedbackList queued feedback (staff only)A

Read-onlyIdempotent

Inspect

Staff-only triage view. Returns feedback items optionally filtered by status / category / since. Use status="new" at session start to see what came in unaddressed. Returns most recent first.

ParametersJSON Schema

Name	Required	Description
`limit`	No
`status`	No	Filter by status. Omit to see all.
`category`	No
`severity`	No
`sinceIso`	No	ISO 8601 timestamp — only return items created since.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.2/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint, idempotentHint, and destructiveHint. Beyond that, the description adds behavioral context: staff-only access, triage view, and default ordering (most recent first). This provides sufficient transparency for safe invocation.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is three sentences long, each serving a distinct purpose: stating the tool's role, listing filtering options, and giving a real-world usage example. No redundant or vague phrases.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given that an output schema exists (context signal: has output schema = true), the description is not required to detail return fields. It covers purpose, filtering, ordering, and usage tip; the only minor omission is the lack of mention of pagination via the limit parameter, but it is acceptable.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is only 40% (2 out of 5 parameters have descriptions). The description mentions status, category, and sinceIso, adding meaning to those, but does not cover limit or severity. It partially compensates for the low coverage but still leaves gaps.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it is a staff-only triage view that returns feedback items, optionally filtered by status/category/since. It explicitly distinguishes itself from sibling mutating tools like submit_feedback or update_feedback by emphasizing read-only triage.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides a concrete usage guideline: 'Use status="new" at session start to see what came in unaddressed.' This tells the agent when to apply the filter, though it does not explicitly mention when not to use or alternative tools.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_kept_personasList your kept-alive test personasA

Read-onlyIdempotent

Inspect

Returns every persona this account has kept alive (created via submit_test with keepTestAccount=true and successfully signed back in). Each entry includes the personaId you'd pass as existingPersonaId to retest the same user, plus the originating customer site and credential expiry. Personas auto-expire 30 days after their last use; each successful retest bumps the expiry.

ParametersJSON Schema

Name	Required	Description	Default
No parameters

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.7/5.0

Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description adds significant behavioral context beyond annotations (readOnlyHint, idempotentHint, destructiveHint): auto-expiry after 30 days, retest bumps expiry, and what fields are returned. No contradiction with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three sentences, each earning its place: first states the action, second describes the return data structure, third explains auto-expiry behavior. No filler or redundancy.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The description fully explains the purpose, output (personaId, customer site, credential expiry), and key behavior (auto-expiry, retest bump). Given 0 parameters and an existing output schema, this is complete.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

There are 0 parameters and schema coverage is 100% (empty object), so no parameter documentation is needed. The baseline for 0 parameters is 4, and the description does not add unnecessary parameter info.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it returns all kept-alive personas, specifying their origin (submit_test with keepTestAccount=true) and what each entry includes (personaId, customer site, credential expiry). This distinguishes it from sibling list tools like list_test_cards or list_projects.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explains when to use the tool (to retrieve kept-alive personas for retesting) and provides important context about auto-expiry and retest behavior. It does not explicitly list alternatives or when not to use, but the context is clear enough.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_my_claimed_jobsList my claimed jobsA

Read-onlyIdempotent

Inspect

Jobs this worker currently holds: claimed but not yet submitted, plus in-progress.

ParametersJSON Schema

Name	Required	Description	Default
No parameters

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.3/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint, idempotentHint, and destructiveHint. The description adds behavioral context beyond annotations by specifying the included job states (claimed but not submitted, plus in-progress). No contradictions with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single concise sentence that front-loads the purpose and scope. Every part is essential and no wasted words.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool has no parameters, annotations cover safety and idempotency, and an output schema exists, the description is complete. It clearly defines the set of jobs returned without missing critical information.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The input schema has zero parameters, so the baseline is 4. The description does not need to provide parameter details as none exist.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description uses the specific verb 'list' and resource 'my claimed jobs', clarifying it returns jobs the worker currently holds. It distinguishes from sibling tools like 'list_available_jobs' and 'claim_job' by specifying the state: claimed but not submitted plus in-progress.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies when to use this tool (to see current workload) but does not explicitly contrast with alternatives like 'list_available_jobs' for unclaimed jobs. No explicit when-not-to-use or prerequisite information is provided.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_personality_offeringsBrowse the AI personality menuA

Read-onlyIdempotent

Inspect

List all packaged AI tests TMV publishes — each personality has 0..N service offerings (e.g. 'First-Time User • Signup gauntlet • 10 credits • ~5min'). Submit one back to submit_test as personalityOfferingId and the run's step budget, inbox provisioning, personality, and price are all locked to the offering preset. Use this when you want a deterministic, named test product rather than tuning maxSteps / useTestInbox by hand.

ParametersJSON Schema

Name	Required	Description	Default
`tier`	No	Optional filter by personality tier.
`maxCredits`	No	Optional ceiling on credit price (e.g. 10 → exclude offerings that cost more).

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.2/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnly, idempotent, non-destructive. Description adds that returned offerings lock step budget, inbox provisioning, personality, and price – useful behavioral context beyond annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three sentences, no fluff. Front-loaded with purpose, then explains usage flow, then gives recommendation. Every sentence earns its place.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Has output schema (mentioned but details not in description, which is fine). Description covers what the tool returns and how it integrates with submit_test. Adequate for a simple filtered list tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100% (both parameters described in schema). Description mentions 'Optional filter by personality tier' and 'ceiling on credit price', aligning with schema but adding no new meaning beyond the schema's own descriptions.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description specifies 'List all packaged AI tests TMV publishes' – clear verb+resource. Distinguishes from sibling 'submit_test' by explaining the role of the returned ID. Also distinct from other list tools by focusing on personality offerings with presets.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly says 'Use this when you want a deterministic, named test product rather than tuning maxSteps / useTestInbox by hand.' Provides context on when to choose this tool over manual configuration. Doesn't state when not to use, but implication is clear.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_projectsList projectsA

Read-onlyIdempotent

Inspect

List the projects (sites under test) registered to this account.

ParametersJSON Schema

Name	Required	Description	Default
No parameters

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.1/5.0

Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already provide readOnlyHint=true and idempotentHint=true, so the description adds minimal behavioral context beyond clarifying that projects are 'sites under test'. No contradictions.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

A single, clear sentence with no unnecessary words. It is front-loaded and efficient.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a parameterless list tool with full annotation coverage and an output schema, the description is complete. It explains the resource being listed and serves its purpose.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The input schema has no parameters, so baseline is 4. The description does not add parameter info, which is appropriate as there are none.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it lists projects, specifying they are 'sites under test'. This distinguishes it from other list tools like list_test_cards or list_feedback, which list different resources.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

No explicit guidance on when to use this tool versus alternatives, but given its simplicity and parameterless nature, the context is implied. It does not mention exclusions or alternatives.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_test_cardsList test cards issued to this accountA

Read-onlyIdempotent

Inspect

Audit view of every test card this account has minted. PANs are NEVER returned (we don't persist them) — only last4 + funded amount + status + expiry. Useful for reconciling Stripe Issuing balance against TMV spend.

ParametersJSON Schema

Name	Required	Description	Default
No parameters

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.5/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint, idempotentHint, and destructiveHint, so the safety profile is clear. The description adds value by explaining that PANs are never returned and that this is an audit view, which is beyond the annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences, front-loaded with purpose, no wasted words. Every sentence earns its place.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given zero parameters and an output schema (presumably detailed), the description fully covers purpose, return values, and use case, leaving no gaps for the AI agent.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

There are no parameters, so the schema provides full coverage. The description adds context about what is returned, which aids understanding of the output schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it lists test cards for the account, specifies exactly what fields are returned (last4, funded amount, status, expiry), and explicitly notes that PANs are never returned, distinguishing it from other tools.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides a concrete use case ('reconciling Stripe Issuing balance against TMV spend') and implicitly suggests when to use it for auditing. No alternative tools with overlapping functionality exist among siblings.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_test_identitiesList managed test identitiesA

Read-onlyIdempotent

Inspect

Returns managed retained identities owned by this account. These are the first-class successor to kept personas and can be reused with submit_test.testIdentityId or in interaction-scene roles.

ParametersJSON Schema

Name	Required	Description	Default
`projectId`	No
`customerSite`	No
`includeExpired`	No

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint, idempotentHint, and destructiveHint, so the description's burden is lower. It adds value by explaining that these identities are a successor to kept personas and can be reused, which is useful behavioral context beyond the annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two sentences long, front-loaded with the main action ('Returns managed retained identities'), and every word adds value. No redundancy or fluff.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool has 3 optional parameters and an output schema, the description omits parameter semantics but the existence of an output schema partially compensates for return value details. The description adequately covers the core purpose and relationship to sibling tools, but the missing parameter guidance reduces completeness.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters2/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, meaning no parameter descriptions in the JSON schema. The description does not mention any parameters (projectId, customerSite, includeExpired) or explain their purpose, leaving the agent to infer semantics without guidance. This is a significant gap.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it returns managed retained identities owned by the account. It distinguishes from legacy list_kept_personas by calling them 'first-class successor', and specifies their reusability in other contexts like submit_test or interaction-scene roles, making the purpose precise.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies this tool is preferred over list_kept_personas for managed identities, and notes they can be reused in submit_test and interaction-scene roles. However, it does not explicitly state when not to use it or list alternative tools, but the context is clear enough for selection.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_test_identity_plansList persistent persona subscription plansA

Read-onlyIdempotent

Inspect

Shows monthly subscription plans for managed persistent personas. One active persona is included free; paid seats are the same whether AI or human testers create/use them. Test execution still bills as normal AI or human runs.

ParametersJSON Schema

Name	Required	Description	Default
No parameters

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.3/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate read-only, idempotent, non-destructive behavior. The description adds valuable context: one free persona, same pricing for AI/human testers, and separate billing for test execution. This goes beyond what annotations provide.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three sentences, each adding essential information: purpose, free tier and pricing, billing clarification. No redundant words, front-loaded with the core action.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool has no parameters and an output schema exists, the description covers all necessary aspects: what it shows (plans), key pricing details, and billing separation. It is fully adequate for an agent to understand behavior.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The tool has zero parameters, so baseline is 4 per guidelines. No parameter semantics are needed, and the description does not add any param info, which is acceptable.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool shows monthly subscription plans for managed persistent personas. The verb 'Shows' is specific and the resource 'monthly subscription plans' is unambiguous. It naturally distinguishes from sibling tools like list_test_identities or subscribe_test_identities.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage for viewing subscription plans but does not explicitly state when to use this tool over alternatives or provide any exclusion criteria. It lacks guidance on context or prerequisites.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_worker_offeringsBrowse the worker marketplace menuA

Read-onlyIdempotent

Inspect

List active worker offerings. Filter by specialty to find workers fluent in a domain (e.g. 'payments', 'i18n-japanese', 'react-spa'). Each entry includes the worker's bio, specialty tags, employment type ('external' = marketplace, 'in_house' = TMV staff), and the credit price.

ParametersJSON Schema

Name	Required	Description
`limit`	No
`specialty`	No	Optional specialty filter — case-insensitive substring match against the worker or offering specialty tags.
`includeInHouse`	No	Whether to include TMV-staffed in-house workers (premium tier). Default true.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.3/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true and destructiveHint=false, so the agent knows it is safe. The description adds behavioral context by stating it lists only active offerings and explains the output structure (bio, tags, employment type, price), which is beyond what annotations provide.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two sentences long with no extraneous information. It starts with the main purpose, provides filtering guidance, and lists key output fields. Every sentence is valuable and well-structured.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the presence of an output schema, the description does not need to explain return values. It already covers the main output fields and the filtering option. For a read-only listing tool with annotations indicating safety, this is complete.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 67%, and the description adds value for the 'specialty' parameter with examples and substring match explanation. However, it does not mention 'limit' or 'includeInHouse' in the description, though the schema covers them adequately. Baseline 3 is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool lists active worker offerings and provides filtering by specialty, with examples. It distinguishes from siblings like 'list_personality_offerings' and 'create_worker_offering' by focusing on workers and listing output fields.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description gives clear context on when to use the tool (to browse worker offerings) and how to filter by specialty. It does not explicitly state when not to use, but the sibling list provides implicit alternatives. It also does not cover the 'includeInHouse' parameter, but the schema covers it.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

provision_test_cardMint a spendable test-payment cardAInspect

Issues a single-use Stripe-Issuing virtual card hard-capped at fundedUsd, billed at funded + 25% markup + $2 service fee. PAN + CVC are returned ONCE in the response and TMV never persists them. Card auto-freezes 24h after creation. In sandbox mode (test key) cards auth only against Stripe test-mode merchants, perfect for verifying customer checkout flows without real money. Charged in credits at 1 credit = $0.10 (so a $10 funded card costs ~125 credits all-in). Provisioning fee absorbed into the markup.

ParametersJSON Schema

Name	Required	Description	Default
`fundedUsd`	Yes	USD amount to load onto the card (and the card's spending limit).
`testJobId`	No	Optional TMV job ID to associate the card with. Used by the AI worker to surface the card via runContext.testPaymentCard so the agent can type it at checkout.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.7/5.0

Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Discloses key behaviors beyond annotations: card auto-freezes 24h, PAN/CVC returned once and never persisted, credit charging, fees. No contradiction with annotations (readOnlyHint=false, destructiveHint=false).

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise (110 words) yet densely informative. Every sentence adds value—fee breakdown, credit cost, sandbox usage, data privacy. Front-loaded with main action.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the presence of an output schema, the description adequately covers cost, expiration, use case, and privacy. No missing critical behavioral context for a provisioning tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema descriptions already cover both parameters (fundedUsd, testJobId). The description adds extra context: cost calculation (25% markup + $2 fee), credit conversion rate, and purpose of testJobId for surfacing via runContext.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it issues a single-use Stripe-Issuing virtual card, specifying the resource and action. It distinguishes from siblings like freeze_test_card and list_test_cards by detailing unique behaviors (auto-freeze, one-time PAN/CVC).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear usage context: 'perfect for verifying customer checkout flows without real money' and mentions sandbox mode. However, it does not explicitly exclude alternatives or state when not to use this tool.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

quote_demo_videoQuote the credit cost of a multi-segment demo videoA

Read-onlyIdempotent

Inspect

Free preview of assemble_demo_video pricing. Sums the Hume Octave narration cost across segments and adds a flat 10-credit assembly fee. Useful before committing to a longer video.

ParametersJSON Schema

Name	Required	Description	Default
`segments`	Yes

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A3.5/5.0

Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false. The description adds that it is a 'free preview' and sums costs, which is consistent. However, it does not provide additional behavioral context (e.g., rate limits, response format) beyond what annotations convey.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise (three short sentences) and front-loaded with the key action. However, given the complexity of the parameter, it could be slightly more detailed without being verbose. It earns its place but misses some clarity.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The description covers the main idea (cost preview for a multi-segment video) but lacks specifics on segment configuration, cost factors beyond narration, and the output schema. Given the existence of an output schema, the description does not need to detail return values, but it should still guide input usage.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters1/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The input schema has a single complex parameter 'segments' with 8 properties, but schema description coverage is 0%. The description only mentions 'narration cost' without explaining which fields affect pricing (e.g., narrationText). It adds minimal value for parameter understanding.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's function: a free preview of assemble_demo_video pricing, summing Hume Octave narration costs and adding a 10-credit assembly fee. It distinguishes itself from the sibling tool 'assemble_demo_video' by being a quote-only operation.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description says 'Useful before committing to a longer video,' which implies using this tool before assembly to check cost. It indirectly references the sibling 'assemble_demo_video' as the alternative. However, it does not explicitly mention when not to use or other related siblings like 'quote_voiceover'.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

quote_test_cardPreview the cost of a spendable test cardA

Read-onlyIdempotent

Inspect

Pre-flight pricing for provision_test_card. Pass the USD amount you want loaded onto the card; returns funded + markup + service fee + total charged. Funded $1-$200. No credits deducted.

ParametersJSON Schema

Name	Required	Description	Default
`fundedUsd`	Yes

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.8/5.0

Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description explicitly states 'No credits deducted,' which aligns with the readOnlyHint and destructiveHint annotations. It also details the return values (funded, markup, service fee, total charged), providing behavioral clarity beyond the annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three short sentences, each adding value: first states purpose, second explains input/output, third provides constraints. No fluff, front-loaded with key information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The description covers input, output, range, and side-effects. Given the tool's simplicity (1 param, no nested objects) and the presence of an output schema, the description is complete and self-sufficient.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The description fully explains the single parameter 'fundedUsd' by stating its purpose ('amount you want loaded onto the card') and range ('$1-$200'), compensating for the 0% schema description coverage. It also clarifies what the tool returns, adding value beyond the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it provides pre-flight pricing for provision_test_card, specifying the input (USD amount) and the output components (funded, markup, service fee, total charged). It distinguishes itself from the sibling provision_test_card by being a read-only pricing check.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage as a precursor to provision_test_card ('Pre-flight pricing for provision_test_card'), giving clear context. However, it does not explicitly state when not to use it or provide alternatives among siblings, though the context is sufficient.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

quote_voiceoverQuote the credit cost of a Hume Octave voiceoverA

Read-onlyIdempotent

Inspect

Read-only cost preview for synthesize_voiceover. Returns the credit charge derived from script length at Hume's ~$0.05 / 1k char list rate, converted at TMV's 1 credit = $0.10. Free.

ParametersJSON Schema

Name	Required	Description	Default
`text`	Yes	Script text to be synthesized.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.3/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true, idempotentHint=true, and destructiveHint=false. The description adds value by specifying the cost calculation details (rate conversion) and stating it's free, which goes beyond the annotations without contradicting them.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is extremely concise with two short sentences, front-loading the key purpose. Every word serves a purpose with no redundancy.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's simplicity (one parameter, output schema exists), the description fully covers all necessary context for an agent to use it correctly. It explains the cost calculation and that it's a preview.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The single parameter 'text' is fully described in the schema (100% coverage), so the description adds minimal extra meaning. The baseline of 3 is appropriate as no additional parameter semantics are provided.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description explicitly states it is a 'read-only cost preview for synthesize_voiceover' and details the cost calculation, making the purpose unmistakably clear. It distinguishes itself from the sibling 'synthesize_voiceover' by indicating it only provides a cost estimate.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage is for previewing costs before calling 'synthesize_voiceover', and mentions it is 'Read-only' and 'Free'. However, it does not explicitly state when not to use it or suggest alternatives beyond the implied main tool.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

renew_test_identityRenew a managed test identityBInspect

Extends a retained identity for another retention window and marks it active.

ParametersJSON Schema

Name	Required	Description	Default
`testIdentityId`	Yes

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

B3.2/5.0

Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate it is not read-only, not destructive, and not idempotent. The description adds that it extends retention and marks active, but lacks details on side effects, whether it can be called multiple times, or what 'retained identity' means exactly.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single, concise sentence with no unnecessary words. It front-loads the action and resource efficiently.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a simple tool with one parameter and output schema present, the description covers the core purpose but lacks details about 'retained identity' and 'retention window', and does not specify what the output contains or any error states.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters1/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 0% schema description coverage, the description must explain the parameters. It does not mention 'testIdentityId' at all, leaving the agent to infer its meaning from context.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's action: extending a retained identity for another retention window and marking it active. The verb 'renews' is implied, and the resource 'managed test identity' is specified, distinguishing it from siblings like create, delete, or check.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives (e.g., create_test_identity, subscribe_test_identities). It does not mention prerequisites, when not to use it, or context for renewal.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

retest_jobRe-run a previously completed testA

Idempotent

Inspect

Re-run an existing test job against the latest deployment. Useful after pushing a fix surfaced by get_test_results — call this to verify whether the bug is gone. Keeps the original test's URL, custom goal, system prompt, and inbox configuration so the verification covers the same flow.

ParametersJSON Schema

Name	Required	Description	Default
`jobId`	Yes	Job ID returned by submit_test (or a prior retest_job).

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.7/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate idempotent and non-destructive behavior. Description adds that it keeps original test configuration, implying no side effects. Could further state that it does not alter original results, but still good.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences: first states purpose, second provides use case and retained config. No wasted words, front-loaded with key information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given single parameter, output schema availability, and clear annotations, the description covers all needed context: purpose, usage trigger, preserved config, and parameter source.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Only one parameter (jobId) with schema coverage 100%. Description adds meaning by stating that the original configuration is preserved, reinforcing the parameter's role in identifying the test to re-run.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description clearly states 'Re-run an existing test job against the latest deployment' with specific verb and resource. It differentiates from sibling tools like submit_test and get_test_results by focusing on re-running completed tests.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly says to use after pushing a fix surfaced by get_test_results to verify bug fix. Also notes that it keeps original configuration (URL, goal, prompt, inbox), providing clear context for when it is appropriate.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

submit_comboSubmit a packaged AI-agent combo bundleAInspect

Queue a named combo against a URL. Fans into N ordered jobs (cheap bug-finders first, expensive audits last) sharing one batchId. +15% parallel premium applies. If the combo has a pauseOnBugThreshold, the worker auto-cancels remaining pending legs + refunds their credits once cumulative bug count crosses the threshold — so a broken site never burns the full bundle. Use list_combos to browse the catalog first.

ParametersJSON Schema

Name	Required	Description
`url`	No	Target URL all legs run against. Required unless `pages` or `stories` is provided.
`pages`	No	Multi-page mode (legacy): list 1-10 distinct page URLs. Each combo leg fans out × pages.length. Cost scales linearly. Prefer `stories[]` on Whole Kit tiers.
`comboId`	Yes	ID of a combo from list_combos (e.g. 'combo-smoke-stack', 'combo-whole-kit-core').
`stories`	No	Story-based mode (Whole Kit tier preferred). Each story = one end-to-end user flow exercised by every leg in the combo. Hard-capped by the combo's maxStories (Solo=1, Core=3, Plus=6, Max=10). Pricing is the combo's flatCreditPrice (bulk-discounted at higher tiers), not derived from stories.length × per-leg cost.
`projectId`	No
`description`	Yes	Plain-English description applied to every leg as the job title.
`projectLabel`	No

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.5/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Discloses beyond annotations: fan-out ordering, +15% parallel premium, auto-cancellation with refunds on pauseOnBugThreshold. Annotations already indicate non-destructive, non-idempotent, open world; description adds valuable behavioral context.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three sentences plus a note, all front-loaded and essential. No fluff. Every sentence adds unique value.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (7 params, conditional logic), the description covers key aspects: ordering, pricing, cancellation, and prerequisite browsing. Output schema exists, so return values are not needed.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema has good coverage (71%), but description adds meaning around comboId, ordering, pricing, and cancellation logic that are not in schema. Some parameters (projectId, projectLabel) get no extra context, but overall adds value.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Clearly states it queues a named combo against a URL, with specific ordering and batch ID. Distinguishes from sibling submit tools (e.g., submit_test) by focusing on combos and referencing list_combos.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly advises to use list_combos to browse the catalog first. Implicitly tells when to use (for combos) but does not explicitly mention when not to use or alternatives.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

submit_conversation_testRun a two-AI voice conversation through Paradise CommsAInspect

Spawns two AI personas as participants on a real LiveKit voice call (via Paradise's self-hosted comms cluster), each driven by its own LLM (Claude + GPT by default), and runs a structured conversation. Each persona talks aloud (TTS) and listens to the other (Whisper STT) — this isn't simulation, it's a real WebRTC call with real audio. Used to verify Paradise Comms end-to-end (publisher → SFU → subscriber → recording → outbound webhook) and to demo agent-to-agent voice. Returns the full transcript and a recording hint.

ParametersJSON Schema

Name	Required	Description	Default
`turns`	No	Total back-and-forth turns. 6 means A→B→A→B→A→B. Cap of 20 to bound LLM + TTS + Whisper spend per test.
`personaA`	No	Persona for agent A (speaks first). Defaults to skeptical-cto. See lib/personalities.ts for the full list of 8 personas.	skeptical-cto
`personaB`	No	Persona for agent B. Defaults to power-user. The pairing skeptical-cto + power-user is the canonical demo because their voices contrast strongly enough to prove the conversation is real (not echo).	power-user
`scenario`	No	Optional scenario nudge added to both personas' system prompts. Example: 'Topic: should the team switch from PostgreSQL to MongoDB? Have a real disagreement.' Leave empty to let the personas freestyle.
`smokeUrl`	No	URL of the smoke page on the LiveKit SFU droplet. Defaults to staging.	https://livekit-staging.comms.paradisemodern.com/smoke/
`paradiseBase`	No	Paradise Comms API base URL. Defaults to staging; pass production when ready.	https://comms.staging.paradisemodern.com
`paradiseToken`	Yes	Paradise Comms portfolio bearer token (e.g. paradise-staging_test_…). Get one by running scripts/seed-staging.ts in the paradisemodern repo, OR via POST /api/admin/comms/tokens as a super_admin.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.3/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations provide openWorldHint=true and destructiveHint=false. The description adds that it is a real WebRTC call with real audio, TTS, and STT, which is significant behavioral context beyond annotations. No contradictions.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise (4 sentences) and front-loaded with the core action. Every sentence adds value without redundancy.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

All 7 parameters are described in schema, output schema exists, and the description mentions return values (transcript and recording hint). Complete for a complex tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Input schema coverage is 100% and descriptions in schema already explain defaults and purpose. The description adds no new parameter-level information beyond the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it spawns two AI personas on a real LiveKit voice call with TTS and STT, not simulation, and returns transcript and recording hint. This distinguishes it from sibling tools like submit_test or submit_test_batch.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly states it is used to verify Paradise Comms end-to-end and to demo agent-to-agent voice. It does not mention when not to use, but the purpose is clear enough.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

submit_feedbackFile a bug, feature request, or UX nit for operator triageAInspect

Queues feedback for staff review. NOT acted on automatically — items sit in status="new" and are worked through with the operator. Good filing hygiene: one issue per submission, name the surface affected (e.g. "submit_test default step budget too low for OAuth flows"), include reproduction steps in the body. If you're filing while running another job, pass context.relatedJobId so the operator can pull the screenshots / report. Anonymous filers can include reporterEmail for follow-up.

ParametersJSON Schema

Name	Required	Description	Default
`body`	Yes	Full context, repro steps, what you expected vs. what happened. Markdown OK; the admin view renders it.
`title`	Yes	Short imperative summary — what should change. E.g. "submit_test should accept devicePreset by alias".
`category`	No		bug
`severity`	No	critical = blocking real work, major = wrong result / cost, minor = papercut, suggestion = enhancement.	minor
`mcpClient`	No	Your MCP client name so we can spot patterns by tool (Claude Code, Cursor, Codex, etc.).
`relatedJobId`	No	If this feedback is about a specific test result, pass the jobId so staff can pull the report / screenshots.
`reporterEmail`	No	Optional contact for follow-up. Authed accounts already have an email on file; this is for anonymous catalog browsers.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.4/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Discloses that feedback queues for staff review and is not automated, items sit in status='new'. Adds value beyond annotations which show readOnlyHint=false, idempotentHint=false, destructiveHint=false.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Paragraph is dense but effectively conveys key information. Could be slightly more structured, but each sentence adds value.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Covers purpose, usage guidelines, parameter semantics, and behavioral notes comprehensively. Output schema exists, so no need to describe return values. Complete for a feedback submission tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Adds meaning beyond schema: explains relatedJobId usage, reporterEmail for anonymous filers, severity definitions, and title format examples. Schema coverage is high (86%), but description provides usage context.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Title and description clearly state it's for filing feedback (bug, feature request, UX nit) for operator triage. Distinguishes from sibling tools like list_feedback and update_feedback by specifying it queues staff review.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicit guidance: NOT acted on automatically, one issue per submission, name surface, include reproduction steps, pass relatedJobId for context, optional reporterEmail. Could be more explicit about when not to use, but clear context.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

submit_interaction_sceneSubmit a multi-agent interaction sceneAInspect

Queues 2-10 AI agents in parallel as roles in a coordinated scene. Each role gets its own browser, persona, and goal — and uses signal() + wait_for_signal() actions to communicate with sibling roles. Use this for publisher+viewer (livestream), buyer+seller (marketplace), multi-user chat, host+guest flows, anything where one agent must produce a value (URL / order id / stream id) that another agent needs. Returns sceneId + role-to-jobId mapping. Each role billed as a normal AI test + 15% parallel premium on top.

ParametersJSON Schema

Name	Required	Description
`roles`	Yes
`projectId`	No
`description`	Yes	Plain-English description of the scene (e.g. 'PM Comms publisher → viewer livestream verification').
`projectLabel`	No

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.2/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate non-readOnly, non-idempotent, non-destructive, and openWorld. The description adds valuable details: parallel queuing, signal/wait_for_signal actions, per-role browser/persona/goal, response format (sceneId + mapping), and billing premium. No contradictions. Slightly less than perfect by not mentioning any cleanup or error handling.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is four sentences, front-loaded with the core action. Every sentence adds essential information without fluff. Highly concise and well-structured.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (multi-agent, many sub-parameters) and low schema coverage, the description effectively covers the high-level concept, use cases, and billing. It references the output shape (sceneId + mapping). Lacks full parameter details but is reasonably complete for a complex tool with an output schema.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is only 25%, so the description must compensate. It explains the conceptual role of the 'roles' array and its elements but does not detail each sub-property or the other top-level parameters (projectId, projectLabel). The tool description adds value but does not fully make up for the low schema coverage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool submits a multi-agent interaction scene with 2-10 parallel agents, each with its own browser, persona, and goal. It provides specific use cases (publisher+viewer, buyer+seller, multi-user chat) that distinguish it from sibling tools like submit_test or submit_conversation_test.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description specifies when to use the tool: for coordinated scenes where agents communicate via signals and produce values that others need. It mentions billing premium but lacks explicit exclusions or warnings about situations where it should not be used, giving strong but not perfect guidance.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

submit_job_resultsSubmit job resultsAInspect

Submit the worker's outcomes for a claimed job. Triggers the same earnings + report + client notification pipeline a human checker submission triggers. Returns the report id and the final pass/fail status.

ParametersJSON Schema

Name	Required	Description
`items`	Yes
`jobId`	Yes	The job ID from claim_job.
`summary`	Yes	Plain-English summary of what was tested + what worked / didn't.
`createdIdentityCredentials`	No	When the claimed job asked the human worker to create a persistent managed identity, return the login credentials here so TMV can encrypt and save the persona for later AI or human runs.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.2/5.0

Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description discloses key behavioral traits beyond annotations: it triggers earnings, report generation, and client notifications (same as human checker submission), and returns the report id and pass/fail status. This adds valuable context not covered by annotations (readOnlyHint=false, destructiveHint=false).

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two sentences, efficient, and front-loaded with the primary action. Each sentence adds value: the first states the action and context, the second specifies return values. No unnecessary details or repetition.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool has an output schema (assumed rich), the description adequately covers return values and pipeline effects. It could be more complete by explicitly stating that the job must be claimed first, but this is inferred. It handles the complexity of 4 params and nested objects well.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 75%, so most parameters are already described in the schema. The tool description adds no specific parameter details beyond saying it submits 'outcomes', which is already implied. It does not enhance understanding of fields like 'createdIdentityCredentials' or 'summary' beyond what the schema provides.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the action ('Submit') and the resource ('worker's outcomes for a claimed job'). It also distinguishes from siblings by mentioning the pipeline it triggers (same as human checker submission), which sets it apart from other job-related tools like 'claim_job' or 'retest_job'.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage after claiming a job ('for a claimed job') and explains the results pipeline, but it does not explicitly state when to use this tool versus alternatives like 'retest_job' or when not to use it. No direct comparison or exclusion criteria are provided.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

submit_testSubmit a test jobAInspect

Queue a new TestMyVibes job for a given URL. You explicitly choose the runner: AI agent (headless Chromium + GPT-4o vision, fastest, deterministic for well-specified goals) or human checker (slower, better for visual/UX judgment calls). Returns a jobId you can poll with get_test_status.

ParametersJSON Schema

Name	Required	Description	Default
`url`	Yes	The URL to test. Must be publicly reachable.
`goal`	No	AI runner only. CONCRETE success criterion the agent stops on — e.g. 'Reach a URL containing /dashboard', 'See a Welcome banner on the page header', 'Receive an OTP email and submit the code'. Without a goal the AI runs out its step budget on exploration instead of completing a flow.
`runner`	No	Who runs the test. 'ai' = headless browser + GPT-4o vision agent (default; use for deterministic flows, signup/login, regression checks). 'human' = real human checker on TMV's panel (use for visual/UX judgment, complex flows the AI can't drive, accessibility passes).	ai
`jobType`	No	Test category — defaults to 'General QA'. Affects credit cost when billed.	General QA
`priority`	No	Job priority.	normal
`targetOS`	No	Human-runner advisory string for OS (e.g. 'iOS 17', 'Android 14', 'macOS 14'). Surfaced on the checker's claim card.
`viewport`	No	Explicit viewport for non-preset resolutions (e.g. {width: 2560, height: 1440} for a 27" desktop monitor). Wins over devicePreset only when devicePreset is NOT set. Use devicePreset for known phones/tablets and viewport for custom resolutions.
`projectId`	No	Existing project to attach this job to (optional).
`offeringId`	No	Marketplace offering id (browse via list_worker_offerings). When set, this job is priced + routed through the marketplace: the customer pays the offering's `creditsCharged` and the worker who fulfills the job earns the offering's pre-locked `workerPayoutCredits` (75% of charged). When omitted, the legacy personality/step-based pre-flight quote applies.
`slaMinutes`	No	Target turnaround in minutes (human runner only; AI runs finish in ~1-5 min regardless).
`description`	Yes	Plain-English description of what to test. The platform uses this to seed a checklist.
`mcpEndpoint`	No	MCP Auditor only. URL of the MCP server to audit (e.g. https://api.example.com/mcp). Pair with `personalityOfferingId: 'mcp-smoke'` or `'mcp-full-audit'`. The MCP Auditor runs JSON-RPC against this endpoint instead of opening a browser at `url`.
`recordVideo`	No	AI runner only. WebM video recording of the entire browser session. Defaults on; set false to opt out. When true, the worker captures a continuous screencast via Puppeteer and uploads it to Spaces; signed URL surfaced in get_test_results.aiReport.videoUrl. Free — no credit charge. 30-day retention same as step screenshots. Ops can globally disable the default with AI_RECORD_VIDEO_DEFAULT=false.
`useSmsInbox`	No	AI runner only. When true, TMV provisions a throwaway phone number from Paradise's SMS test-number pool (US/CA available) bound to this run. The agent uses it for any phone field, and `wait_for_sms` blocks until verification SMS arrive. Required for phone+OTP signup flows. Pool is finite — release reserves the number for ~15min then auto-releases.
`devicePreset`	No	Optional device emulation. Pass a Puppeteer KnownDevices name (e.g. 'iPhone 14 Pro', 'iPad Mini', 'Pixel 5', 'Galaxy S9+') and the AI agent runs the test as that device — proper viewport, touch events, user-agent, and DPR. No markup; this is the same Chromium with different emulation flags. Use list_device_presets to see the full 131-device catalog or the curated featured subset. For human runners this is advisory and surfaced on the checker's job card.
`identityMode`	No	AI runner only. 'auto' infers when a signup/OTP flow needs a TMV inbox/persona; 'fresh' forces a new persona/inbox for this run; 'keep' creates a managed retained identity with a persistent inbox and saves credentials after a passing signup; 'reuse' signs in with testIdentityId/existingPersonaId; 'none' disables identity provisioning.	auto
`mcpTransport`	No	MCP Auditor only. Transport protocol the customer's MCP server speaks. Most servers built with @modelcontextprotocol/sdk use streamable-http; older ones use sse. No stdio support (we don't run customer code in TMV's sandbox).	streamable-http
`projectLabel`	No	Audit label naming which of your projects submitted this test (e.g. 'pm-claude-code', 'shiftsee-claude-code'). Not used for auth.
`targetDevice`	No	Human-runner advisory string naming the device (e.g. 'iPhone 14 Pro', 'Pixel 7'). Surfaced on the checker's claim card so they know which device to test on. No effect for AI runners.
`useTestInbox`	No	AI runner only. When true, TMV provisions a per-job inbox at `<job-prefix>-<random>@inbox.testmyvibes.com` bound to this run. The agent uses it for any email field, and `wait_for_email` blocks until verification emails arrive. Required for OTP / email-verify flows; pointless for read-only tests.
`mcpAuthHeader`	No	MCP Auditor only. Optional auth header passed to the MCP endpoint (e.g. 'Bearer <token>', 'X-API-Key: <key>'). Format: 'HeaderName: value'. Used verbatim on every JSON-RPC request.
`targetBrowser`	No	Human-runner advisory string for browser (e.g. 'Safari', 'Chrome', 'Firefox'). Surfaced on the checker's claim card.
`videoCallTest`	No	AI runner only. Video/voice call testing (human↔AI calls, WebRTC flows): Chrome launches with a fake camera+microphone (auto-granted; synthetic pattern/tone media the far side really receives) and every RTCPeerConnection on the page is instrumented. The agent gains the check_call_media action, returning hard metrics — ICE state, time-to-first-frame, fps, resolution, packet loss, freezes, and whether remote audio is actually AUDIBLE. The raw metric timeline is surfaced in get_test_results.aiReport.callStats. Screenshots cannot distinguish a live call from a frozen frame; instruct the agent to start the call, then use check_call_media (~10s settle), then re-check later to confirm the call is sustained. For human↔human two-browser calls use submit_interaction_scene with videoCallTest on each role.
`sessionCookies`	No	AI runner only. Session injection — pre-authenticated cookies planted on the browser BEFORE the first navigation, so the agent starts already signed in and skips the login/OTP gate. Purpose-built for gated flows (photoreal video calls, member dashboards) where driving an email-OTP login with the vision agent is slow and flaky. Obtain a real session however you like (server-to-server auth, a scripted OTP redeem) and pass the cookies here; they're domain-scoped to the test URL at inject time and never echoed back in results. Combine with videoCallTest to land a logged-in agent directly on a call surface.
`testIdentityId`	No	AI runner only. Managed retained identity id from list_test_identities/create_test_identity. If it already has credentials the worker signs in as that returning user; if not, the worker uses its persistent email/persona for a fresh signup and saves credentials on PASS.
`useFakeProfile`	No	AI runner only. Adds depth to the test persona beyond default username/displayName/bio. 'basic' (+1 credit): generates a physicalProfile JSON (age, height, hair color, eye color, etc.) so any open-ended profile fields are filled with consistent realistic values. 'full' (+2 credits): basic + 2 photorealistic Flux Schnell photos uploaded to TMV Spaces and exposed to the agent as signed URLs for avatar / profile-image uploads. Skip this for read-only tests; use 'basic' for profile-completion tests; use 'full' for photo-required signup flows.	off
`keepTestAccount`	No	AI runner only. When false (default), signup tests end by deleting the account they created so customer user tables don't accumulate orphan rows. Set true to KEEP the account alive after the test — the persona's credentials are persisted so a later submit_test with `existingPersonaId` can sign in as a returning user (repeat-testing offering). Costs more (persona retention fee) but saves signup steps on every subsequent run.
`smsInboxCountry`	No	AI runner only. Used with useSmsInbox=true. Country code of the throwaway number to rent. US (default) covers most American/Canadian flows; CA needed for sites that gate by destination country. India is NOT available (Telnyx has no IN inventory).	US
`syntheticVisitor`	No	Paradise Modern Growth Kit structured input. When present, TMV queues a Synthetic Visitor Test: an AI visitor simulation focused on CTA/A-B conversion completion rather than general QA.
`targetScreenSize`	No	Human-runner advisory string (e.g. '1920x1080', '390x844'). Stored on the job and surfaced on the checker's claim card. No effect for AI runners — use devicePreset or viewport instead.
`agentInstructions`	No	AI runner only. Verbal step-by-step the vision agent follows. Pin exact field values here (e.g. 'When asked for a name use "QA Tester"; when asked for a password use "TestPass!2026"'). Without this the agent invents values and tests become non-reproducible. By default these are advisory — set strictAgentInstructions=true to enforce them as hard rules.
`existingPersonaId`	No	AI runner only. Task #30 repeat-test. Set to the personaId of a previously-kept persona (from a job submitted with keepTestAccount=true). The worker skips provisioning + signup and instead reuses the persona's stored email + password to log straight in. Use this to exercise return-user flows (profile edits, dashboards, settings, follow-up actions) without paying for signup every time. Discounted -1 credit per run; persona retention itself costs 2 credits per 30-day window (first persona per project free). Call list_device_presets to see all device names.
`personalityOfferingId`	No	Personality menu offering id (browse via list_personality_offerings). Locks the step budget, inbox provisioning, personality, and price to the offering. Mutually exclusive with offeringId — offeringId routes to a worker; personalityOfferingId is an AI-only packaged test priced by TMV.
`strictAgentInstructions`	No	AI runner only. When true, agentInstructions are enforced with a stronger preamble + post-step self-check ("Did my last action violate any rule? If yes, reverse course before continuing"). Use for OTP / mid-form flows where one wrong click (extra OTP request, dropdown change after submit) invalidates state. Default false — instructions are advisory, the agent uses judgment.
`expectedEmailFromContains`	No	AI runner only. Pin the wait_for_email fromContains filter (substring of the sender address). Use when the sender domain isn't the obvious test target (e.g. delivered from sendgrid.net but the site is acme.com).
`provisionTestCardFundedUsd`	No	AI runner only. Mints a Stripe-Issuing test-payment card just-in-time when the worker picks up this job, funded to this USD amount. The PAN is held in-memory only — never touches the Job record, never returned to the caller. The AI agent receives it via the system prompt and types it at the customer's checkout. Card is frozen automatically at end of run (or 24h, whichever first). Billed at funded + 25% markup + $2 service fee. Currently sandbox-only — cards auth against Stripe test-mode merchants only until live activation lands.
`expectedEmailSubjectContains`	No	AI runner only. Pin a case-insensitive substring the AI agent MUST use as wait_for_email's subjectContains filter. Useful when your customer's verification email subject doesn't match the site name (e.g. site is 'newvibecity.com' but email subject is 'Newvibecityhotel sign-in code'). Without this, the agent guesses from the URL/brand and can timeout on wrong filters. Surfaced in the system prompt with strict instructions.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.2/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Description indicates a create operation (queues a new job) and returns a jobId for polling. Annotations already mark non-destructive, non-idempotent. No contradiction; adds value by explaining polling workflow. Could mention auth requirements.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three sentences, front-loaded with purpose, zero waste. Efficient and well-structured.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given complexity (37 params, 2 required), description covers core concept and key choices. Output schema implied. Some details (e.g., credit costs) are in parameter descriptions, but main description is sufficient for agent understanding.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, so baseline is 3. Description adds high-level context (e.g., runner choice) but does not provide additional parameter meaning beyond the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description clearly states the tool queues a new test job for a given URL, explicitly chooses between AI and human runner, and returns a jobId to poll. This distinguishes it from siblings like submit_combo or submit_conversation_test.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Provides clear guidance on when to use AI vs human runner (deterministic vs visual/UX). However, does not mention alternatives like submit_test_batch for batch jobs or when not to use this tool.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

submit_test_batchSubmit multiple AI tests as a parallel batchAInspect

Queue up to 20 AI tests at once and run them in parallel instead of one-after-another. Each test in the batch costs 1.15× its base credits (the parallel premium). Returns the shared batchId and a per-test breakdown so you can poll each jobId individually. Use this when you have an independent set of tests to run (e.g. signup + login + dashboard + settings + delete across one customer site) and want them done in minutes rather than queued through a serial worker. AI runner only — human-runner batching ships separately.

ParametersJSON Schema

Name	Required	Description	Default
`tests`	Yes	Array of 2-20 test specs. Each item has the same shape as submit_test's inputs (AI runner). Tests run concurrently up to a worker concurrency limit of 3.
`projectLabel`	No	Audit label naming which of your projects submitted this batch (e.g. 'shiftsee-regression-suite').

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.4/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Discloses side effects (credit deduction with premium), parallel execution, return structure (batchId, per-test breakdown), and AI runner limitation. No contradiction with annotations. Could mention partial failure handling, but adequate.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Five sentences, each earning its place: main action, cost, return, usage guidance, runner type. Front-loaded with key information, no fluff.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given output schema exists, description covers return values adequately. Could mention batching limits more explicitly, but combined with schema it's complete for typical use.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% with parameter descriptions. Description adds context about concurrency limit (3) and references submit_test inputs. Adds value beyond schema without redundancy.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool submits multiple AI tests as a parallel batch, specifying count limits (up to 20), parallelism vs serial, and distinguishes from sibling submit_test and human-runner batching.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly advises use when tests are independent and speed is needed, provides an example, notes cost premium, and mentions AI runner exclusivity. Could be more explicit about when not to use (e.g., dependent tests), but overall strong.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

subscribe_test_identitiesSubscribe to persistent persona seatsAInspect

Creates a Stripe Checkout subscription for managed persistent persona seats. MCP clients can start the subscription flow and return the checkout URL, but a human must approve payment in Stripe. This subscription covers persona storage/inbox/credential retention only; test runs still require credits or internal-use billing.

ParametersJSON Schema

Name	Required	Description
`planId`	Yes
`cancelPath`	No	Optional testmyvibes.com redirect path after canceled checkout.
`successPath`	No	Optional testmyvibes.com redirect path after successful checkout.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.5/5.0

Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Disclosures beyond annotations: that a human must approve payment in Stripe, the subscription covers only persona storage/inbox/credential retention, and test runs still need credits. No contradiction with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three sentences front-loading the key action, required human step, and coverage limitations. No redundant information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the presence of an output schema, the description adequately covers the tool's purpose, flow, and limitations, including what is not included (test run billing). Appropriate for a subscription creation tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Description adds overall context but does not detail individual parameters. Schema coverage is 67% and includes enum descriptions, so the description provides some value but not substantial parameter-level meaning.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description clearly states it creates a Stripe Checkout subscription for managed persistent persona seats, distinguishing from sibling tools by specifying that test runs require separate credits.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Provides clear context on when to use: for subscribing to persona seats, with explicit note that human approval is required and what the subscription covers versus test runs. Does not explicitly mention alternatives or when not to use, but context is sufficient.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

synthesize_voiceoverSynthesize a script with Hume Octave, return audio URLAInspect

Generates a voiceover from text using Hume Octave TTS. Audio uploaded to Spaces, signed URL returned (24h TTL by default). Charged in credits up-front based on script length (use quote_voiceover for a preview). Best for demo-video narration, tutorial audio, and any one-shot batch TTS. NOT a real-time conversational voice (use Hume EVI for that, different product). Voice options: pass voiceId for a specific Hume voice clone, or omit to use the deployment's default narrator (HUME_OCTAVE_VOICE_ID env var).

ParametersJSON Schema

Name	Required	Description
`text`	Yes	Script text to read aloud. Max 5000 chars per call; split longer scripts.
`voiceId`	No	Hume voice id. Omit to use the deployment's default narrator.
`description`	No	Optional prosody steering, e.g. "warm and conversational, slight pause before the punchline". Biases delivery without changing the script.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A5/5.0

Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Beyond annotations (readOnlyHint=false, etc.), the description discloses key behaviors: audio is uploaded to Spaces, signed URL with 24h TTL, credits charged up-front, and voice options (voiceId or default). No contradictions with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise (4 sentences) with the most critical information front-loaded. Every sentence adds distinct value without redundancy or fluff.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (3 parameters, output schema exists), the description covers all necessary context: purpose, usage, behavioral details, and parameter nuances. It does not need to repeat the output schema as it is already declared.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Despite 100% schema coverage, the description adds valuable context: 'split longer scripts' for the text maxLength, 'omit to use default narrator' for voiceId, and 'biases delivery without changing script' for the description parameter. This enriches the schema-provided information.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it generates a voiceover from text using Hume Octave TTS and returns an audio URL. It distinguishes itself from sibling tools like 'quote_voiceover' (cost preview) and 'Hume EVI' (real-time conversation), making its purpose unambiguous and differentiated.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly states when to use this tool (demo-video narration, tutorial audio, one-shot batch TTS) and when not to (real-time conversation, pointing to Hume EVI as alternative). It also mentions using 'quote_voiceover' for a cost preview, providing clear usage boundaries.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

top_up_creditsTop up credits (Stripe Checkout)AInspect

Buy more credits to fund test runs that TestMyVibes' agents will execute on your behalf. Returns a Stripe Checkout URL the user must open to complete payment (Stripe requires human payment completion per their agentic-commerce policy). Once the user pays, the credits are added automatically by the Stripe webhook — poll get_credit_balance to confirm.

ParametersJSON Schema

Name	Required	Description
`packIndex`	Yes	Index of the credit pack from list_credit_packs.
`cancelPath`	No	Optional path on testmyvibes.com to redirect the user to if they cancel. Defaults to '/dashboard/billing?canceled=1'.
`successPath`	No	Optional path on testmyvibes.com to redirect the user to after a successful payment. Defaults to '/dashboard/billing?success=1'.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.7/5.0

Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description adds significant behavioral context beyond annotations: it reveals that Stripe requires human payment completion due to their agentic-commerce policy, that credits are added automatically via webhook, and that the user must open the returned URL. This supplements the non-readOnly, non-destructive, non-idempotent annotations effectively.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description consists of two concise sentences, each providing essential information without redundancy. It is front-loaded with the purpose and efficiently communicates the workflow and behavioral details.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The description covers everything needed: it states the return value (Stripe Checkout URL), explains the asynchronous credit addition, and directs the user to confirm via get_credit_balance. No critical information is missing, and the output schema existence is acknowledged.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 100% schema coverage, the baseline is 3. The description adds value by explaining that packIndex is an index from list_credit_packs and by detailing the defaults for cancelPath and successPath. This extra context justifies a score of 4.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Buy more credits to fund test runs.' It specifies the resource (credits), the action (top-up), and the return type (Stripe Checkout URL). This distinguishes it from siblings like get_credit_balance, which is for checking balance.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explains when to use this tool: when the user needs more credits. It provides clear context by directing the user to poll get_credit_balance to confirm payment completion. While it does not explicitly state when not to use it, the instruction is clear.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

update_feedbackUpdate a feedback item's status / notes (staff only)AInspect

Staff-only triage write. Move feedback through the state machine (new → triaged → planned/wontfix → in_progress → shipped), attach internal notes, or mark as a duplicate of another item. Returns the updated record.

ParametersJSON Schema

Name	Required	Description
`id`	Yes	Feedback id from list_feedback.
`status`	No
`duplicateOf`	No	When status=duplicate, the id of the canonical feedback this collapses into.
`internalNotes`	No	Staff-only commentary. Appended to existing notes if any.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.8/5.0

Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description adds significant behavioral context beyond annotations: details the state machine transitions, note appending behavior, and duplicate collapsing. It aligns with annotations (non-destructive write) and provides clear expectations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three concise sentences, each adding value: context (staff-only), operations (state machine, notes, duplicate), and output (returns record). No redundant or extraneous information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (state machine, multiple operations, permissions), the description covers all key aspects. The presence of an output schema means return value is handled externally. No gaps identified.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The description explains how each parameter is used in context (status for state machine, duplicateOf for marking, internalNotes for appending). With 75% schema coverage, the description complements the schema well, adding meaning beyond field names.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the verb (update, move, attach, mark), resource (feedback item), and scope (staff-only, state machine, duplicate). It distinguishes from sibling submit_feedback by emphasizing staff triage and state machine progression.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

It specifies the tool is for staff-only triage and lists operations (status change, notes, duplicates). While it doesn't explicitly say when not to use it, the context of 'staff-only' and the state machine progression implies appropriate use cases. An explicit alternative like 'submit_feedback' would improve clarity.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

upsert_worker_profileCreate or update your worker profileAInspect

Idempotent create-or-update for the calling account's worker profile. Opt in to the marketplace by setting bio + specialties; opt out by setting isActive=false on every offering. External workers settle in credits/USD; in-house workers (TMV staff doing premium checks) are paid through ShiftSee payroll and require shiftseeUserId.

ParametersJSON Schema

Name	Required	Description	Default
`bio`	No	Worker bio shown on the marketplace menu.
`languages`	No	ISO-639 codes you can test in. Customers filter on this.
`specialties`	No	Tags like 'payments', 'i18n-spanish', 'react-spa' that route customer searches to you.
`employmentType`	No	'external' = pay via credits/Stripe Connect. 'in_house' = TMV staff paid via ShiftSee payroll (requires shiftseeUserId).	external
`qualifications`	No	Verified badges — admin-curated; treated as free-form strings on input.
`shiftseeUserId`	No	Required when employmentType='in_house'. Maps to your ShiftSee user id for payroll routing.
`defaultPayoutMode`	No	External workers only. Default routing for cleared earnings — credit (TMV-internal) or usd (Stripe Connect).

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A3.7/5.0

Behavior1/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description claims 'Idempotent' but the annotation idempotentHint=false contradicts this, a serious inconsistency. No other behavioral traits (e.g., side effects, required permissions) are disclosed beyond basic upsert behavior.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three sentences covering purpose, opt-in/out, and employment types. Concise but the opt-out mention is slightly tangential. Mostly well-structured and front-loaded.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given 7 optional parameters, enums, and an output schema, the description covers main use cases and key constraints. Does not explain return values but output schema exists. Lacks details on limitations or rate limits, but enough for basic usage.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% so baseline is 3. Description adds value by explaining the logic of employmentTypes, the need for shiftseeUserId when in_house, and the opt-in strategy (bio + specialties). Provides context beyond schema, such as settlement differences.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it's an 'Idempotent create-or-update for the calling account's worker profile,' specifying the resource, action, and scope. It distinguishes from sibling tools like create_worker_offering by focusing on the profile itself, not offerings.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Provides explicit guidance on when to use: opt-in by setting bio + specialties, opt-out via offerings (though indirectly). Explains the two employment types and their requirements (shiftseeUserId for in_house, defaultPayoutMode for external). Lacks a direct 'when not to use' statement but is clear in context.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

whoamiIdentity + billing-mode self-checkA

Read-onlyIdempotent

Inspect

Returns the calling account's id/email/role plus internal-use eligibility: whether the account is staff-flagged, which domains run free, and how a given target URL would be billed if you submitted a test now. Use this first when you bring TMV into a new project — it confirms the project's API key actually maps to the expected operator account.

ParametersJSON Schema

Name	Required	Description	Default
`targetUrl`	No	Optional URL to check billing-mode against (e.g. the project's homepage). When provided, the response includes the exact billing outcome for that target.

Output Schema

ParametersJSON Schema

Name	Required	Description
`result`	No	Tool result payload (JSON object)

Tool Definition Quality

A4.7/5.0

Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate readOnly, idempotent, and non-destructive behavior. The description adds specific information about what data is returned (role, staff flag, free domains, billing outcome) beyond the annotations, making the tool's behavior fully transparent.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two sentences with no redundancy. The first sentence front-loads the return values, and the second sentence gives immediate usage guidance. Every sentence is necessary and well-placed.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a tool with one optional parameter and an existing output schema, the description covers all necessary aspects: what it returns, when to use it, and the effect of the parameter. No gaps remain.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% with a full description for the optional 'targetUrl' parameter. The description reinforces that providing a URL gives the billing outcome, adding context that internal-use eligibility includes domain and billing checks, which goes slightly beyond the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states that the tool returns the calling account's identity (id/email/role) and billing eligibility (staff-flagged domains, billing outcome for a target URL). This specific verb+resource combination distinguishes it from siblings like submit_test or list_projects.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly says 'Use this first when you bring TMV into a new project', providing a clear use case and context. It does not, however, discuss when not to use it or compare with alternatives, so it could be more detailed.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Claim this connector by publishing a /.well-known/glama.json file on your server's domain with the following structure:

{
  "$schema": "https://glama.ai/mcp/schemas/connector.json",
  "maintainers": [{ "email": "your-email@example.com" }]
}

The email address must match the email associated with your Glama account. Once published, Glama will automatically detect and verify the file within a few minutes.

Discussions

No comments yet. Be the first to start the discussion!

Try in Browser

Your Connectors

Resources

Need Help?