Keploy
Server Details
End-to-end API testing — generate and run tests from OpenAPI, curl, Postman, or real user traffic.
- Status
- Healthy
- Last Tested
- Transport
- Streamable HTTP
- URL
- Repository
- keploy/keploy
- GitHub Stars
- 17,094
Glama MCP Gateway
Connect through Glama MCP Gateway for full control over tool access and complete visibility into every call.
Full call logging
Every tool call is logged with complete inputs and outputs, so you can debug issues and audit what your agents are doing.
Tool access control
Enable or disable individual tools per connector, so you decide what your agents can and cannot do.
Managed credentials
Glama handles OAuth flows, token storage, and automatic rotation, so credentials never expire on your clients.
Usage analytics
See which tools your agents call, how often, and when, so you can understand usage patterns and catch anomalies.
Tool Definition Quality
Average 3.5/5 across 66 of 69 tools scored. Lowest: 2.4/5.
Many tools have overlapping purposes, e.g., createTestSuite vs create_test_suite, and multiple list-like tools. The long instructional descriptions add confusion rather than clarity, making it hard to distinguish tool intent.
Naming is inconsistent: mix of snake_case (create_branch, delete_test_suite) and camelCase (createAPIKey, createApp, bulkDeleteTestSuites). Some tools have very long names that include instructions, breaking convention.
With 69 tools, the server is far over-scoped. Many tools are redundant or overly specific (e.g., scaffold_pipeline_workflow, get_session_report as separate tools). A typical focused server should have 3–15 tools.
Covers a wide range of operations (CRUD, recordings, reports, generation, CI scaffolding), but has odd gaps and redundancies (e.g., two tools for creating a test suite with different mechanics). The surface is broad but not cleanly scoped.
Available Tools
71 toolsbulkDeleteTestSuitesBDestructiveInspect
POST /apps/{appId}/test-suites/bulk-delete — Bulk-delete test suites — Requires scope: write.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| test_suite_ids | Yes |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate destructive behavior (destructiveHint=true). Description adds the scope requirement, but omits details about reversibility, partial success, or error handling.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Single concise sentence with all key information: HTTP method, path, action, and scope requirement. No wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Provides basic action and scope, but lacks details about the array parameter, behavior on failure, and output. Given the tool's destructive nature and lack of output schema, the description is minimally adequate.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is only 50% (appId described, test_suite_ids not). Description provides no additional meaning for test_suite_ids, failing to compensate for the gap.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states the tool does bulk deletion of test suites, with explicit HTTP method and path. Distinguishes from single-delete siblings like delete_test_suite and deleteTestSuite.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Mentions required scope ('write') but provides no guidance on when to use this tool versus alternatives, nor any prerequisites or exclusions.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
createAPIKeyADestructiveInspect
POST /api-keys — Create an API key — Requires scope: admin. The raw key is returned only once in the response.
| Name | Required | Description | Default |
|---|---|---|---|
| name | Yes | ||
| scopes | Yes | ||
| ttl_days | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description adds value beyond annotations by noting that the raw key is returned only once, which is critical for agents to handle the response correctly. Annotations already indicate mutation and destructiveness, but the description provides specific behavioral context.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single sentence followed by a critical note, front-loading the purpose. Every piece of information earns its place without redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
While the description covers the unique behavioral aspect (key returned once) and required scope, it omits details about parameter meanings and error conditions. Given no output schema, agents are left guessing about the return format beyond the key.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 0%, yet the description provides no explanation for any of the three parameters (name, scopes, ttl_days). Agents lack understanding of what 'scopes' or 'ttl_days' mean, severely hindering correct invocation.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the HTTP method (POST), resource (/api-keys), and action (create an API key). It distinguishes itself from siblings like listAPIKeys and revokeAPIKey by specifying creation.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions the required scope (admin) and a key behavior (raw key returned only once), providing clear context for usage. It does not explicitly state when not to use, but the purpose is self-evident.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
createAppBDestructiveInspect
POST /apps — Create an app — Requires scope: write.
| Name | Required | Description | Default |
|---|---|---|---|
| auth | No | Authentication configuration for test execution. The runner injects the matching headers on every step request. | |
| docs | No | Free-form developer docs the AI uses as additional context. | |
| name | Yes | App name. IMMUTABLE — cannot be changed via updateApp. | |
| schema | No | OpenAPI/Swagger doc the validators use to suggest test cases. | |
| endpoint | No | ||
| webhook_url | No | Optional webhook URL invoked at run lifecycle events. | |
| api_examples | No | Sample request/response pairs the AI consults when authoring suites. | |
| max_test_suites | No | Cap on how many suites generate-tests will mint at once. Server default applies if omitted. | |
| disable_schema_assertion | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate destructiveHint=true and readOnlyHint=false. The description adds the scope requirement, which is not in annotations. However, it does not disclose any other behavioral traits (e.g., side effects, idempotency, or error conditions). With annotations present, this is adequate but not rich.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is extremely concise, consisting of one sentence that includes the HTTP method, resource, action, and scope. It is front-loaded and contains no unnecessary words. However, it could be slightly expanded to include parameter information without losing conciseness.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given no output schema, the description should explain what the response contains (e.g., the created app object). It does not. Additionally, it omits parameter descriptions and preconditions beyond scope. For a creation tool, this leaves significant gaps in understanding the full context.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 0%, meaning the input schema provides no descriptions. The tool description also provides no information about the parameters (`name`, `endpoint`), failing to compensate for the coverage gap. An agent cannot infer the meaning or format of these parameters from the description alone.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the HTTP method (POST), resource (/apps), and action (Create an app), along with a required scope. This provides a specific verb-resource pairing that distinguishes it from sibling tools like createTestSuite or createAPIKey.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions the required scope `write`, giving a prerequisite, but fails to specify when to use this tool versus alternatives (e.g., when to create an app vs. update an app). No explicit guidance on preconditions or exclusions.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
create_branchADestructiveInspect
Create a Keploy branch on an app — find-or-create on name conflict.
Pass the dev's CURRENT GIT BRANCH as the name. Detect it BEFORE calling this tool:
Bash: git rev-parse --abbrev-ref HEAD in the dev's app_dir.
Exit non-zero or output "HEAD" → not in a git repo / detached HEAD; ASK the dev for a name and re-call.
Find-or-create semantics: if a writable Keploy branch with that name already exists, it's returned (no error, no duplicate). Idempotent — safe to call on every retry.
Output: {branch_id, name, status, created} where created=true means a fresh branch was minted, created=false means an existing one was reused.
Pass the returned branch_id to subsequent write tools (create_test_suite, replay_sandbox_test, record_sandbox_test, etc.). Branch_id is sticky for the conversation — don't re-call create_branch unless the dev's git branch changed.
| Name | Required | Description | Default |
|---|---|---|---|
| name | Yes | REQUIRED. Pass the dev's current git branch as the name. Detect via Bash `git rev-parse --abbrev-ref HEAD` in app_dir before calling. Don't invent a name — ASK the dev when not in a git repo / detached HEAD. | |
| app_id | Yes | Keploy app ID |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description claims idempotency ('safe to call on every retry'), but annotations set idempotentHint=false, a direct contradiction. Per the rubric, this warrants a score of 1.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is longer than necessary but every sentence adds value, and it is well-structured with clear sections. Could be slightly more concise.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Despite no output schema, the description fully explains the output fields, usage context, and dependencies (sticky branch_id). Covers all needed information for correct invocation.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, and the description adds meaningful context for the 'name' parameter with detection logic and find-or-create behavior, though the schema already includes detailed description.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool creates a Keploy branch with find-or-create semantics, distinguishing it from siblings like list_branches or createApp.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly instructs to pass the dev's current git branch, provides detection commands, and warns not to re-call unless the branch changed. Clearly differentiates from other tools.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
createCIBranchADestructiveInspect
POST /apps/{appId}/branches/ci — Find-or-create a CI branch — Creates a new Keploy branch (or returns the existing one) for a CI pipeline run. Idempotent on the (appId, name) pair. Requires scope: write.
| Name | Required | Description | Default |
|---|---|---|---|
| name | Yes | Branch name (e.g., `pr-123`) | |
| appId | Yes | Path parameter: appId | |
| git_ref | No | Optional Git provider context (PR, repo, etc.) |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description claims idempotency (find-or-create) which directly contradicts the annotation 'idempotentHint: false'. While it provides useful scope and destructive context, this contradiction significantly undermines reliability.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise with front-loaded endpoint and purpose. Each sentence adds value, though the structure could be slightly improved for readability.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given no output schema and three parameters, the description covers purpose, idempotency, permissions, and key pair. Missing explicit return value info but still fairly complete for a creation tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the description adds limited meaning beyond what the schema already provides. The mention of idempotency on (appId, name) pair reinforces their role but adds little new information.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool creates or finds a CI branch for a pipeline run, specifying the HTTP endpoint and distinguishing it from sibling tools like 'create_branch' by focusing on CI pipeline context.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description indicates usage for CI pipeline runs and mentions idempotency and required scope, but does not explicitly exclude alternatives or state when not to use this tool versus 'create_branch'.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
create_test_suiteADestructiveInspect
Create a new API test suite with test steps. Each step defines an HTTP request and assertions to validate the response. Steps can extract values from responses into variables for chaining requests.
═══════════════════════════════════════════════════════════════════ STEP 0 — read the canonical schema BEFORE drafting: ═══════════════════════════════════════════════════════════════════
If you've already called get_app_testing_context, the canonical step schema is in its response under the step_schema field — read it from there. Otherwise run keploy test-suite-format once before writing any suite JSON. The schema describes the MANDATORY rules below in detail plus the two-step prelude+POST skeleton you must follow. Authors who skip this and draft from training-data priors burn ~50s per validator rejection on iter 1.
═══════════════════════════════════════════════════════════════════ MANDATORY FOR EVERY STEP — the validator rejects on iter 1 if any of these are violated: ═══════════════════════════════════════════════════════════════════
R10 — every step MUST carry a captured "response": {status, body, headers} block. Hit the endpoint locally before authoring (curl) and paste the real response. Steps with no response block are rejected outright; four downstream rules (R4 / R11 / R15-R16 / R27) silently no-op until R10 is satisfied, so missing a response also hides every assertion / extract problem in that step. R9 — every POST / PUT / PATCH body MUST reference at least one {{var}} whose generator is declared on an EARLIER step's "extract" (typically a /health prelude as step 0). Without this, the second run collides on the first run's database state. App-level appLevelCustomVariables DO NOT qualify for R9 — the validator only credits step-level extracts. R2 — pre-request fields ("body", "url", "headers") CANNOT reference the CURRENT step's own "extract" outputs. Extract runs AFTER the response comes back; pre-request substitution sees nothing yet. Together R9 + R2 force the prelude pattern: declare generators on step 0, use them from step 1+. The STEP SHAPE example below shows the canonical two-step layout. R15 — every assertion's path / status / header MUST resolve against the AUTHORED response block. JSONPath uses gjson dot-array syntax: $.orders.0.id — NOT $.orders[0].id (the bracket form does not resolve in gjson; the assertion is rejected as "key not present in recorded body"). For status_code / header_* assertions, the values must match what's in response.status / response.headers verbatim — capture the real response via curl before authoring. R32 — every step-level extract key MUST NOT collide with the app's appLevelCustomVariables (enumerate them via get_app_testing_context or getApp before authoring). The runtime's variable lookup resolves app-level first, so a colliding key means the suite's function silently never runs. Suite-suffix when in doubt: userNonceForSuite, not genUserId. Don't invent a parallel generator with the same name as an existing app-level one.
═══════════════════════════════════════════════════════════════════ APP CONFIG FIRST — read the app before authoring: ═══════════════════════════════════════════════════════════════════
Before any other step, call getApp({app_id}) and read these fields:
appLevelCustomVariables — dynamic generators and static fixtures pre-configured by the dev, shared across every suite for this app. Common shapes:
genUserId, genProductName (JS functions returning fresh entropy per run, e.g.
alice_<rand1-10000>)staticUser (a fixed user the dev wants tests to use)
zeroQuantity, negativePrice, invalidUser (static fixtures for validation tests) PREFER these over inventing your own JS-function in
extract. They're the dev's authoritative dynamic-input set — using them in POST/PUT/PATCH bodies via{{varName}}means each replay hits a fresh row, sidestepping duplicate-key errors. Inventing a parallel generator with the same intent risks name-collision rejection (see Name-collision check below).
auth — the auth shape suites must satisfy (header / cookie / oauth / none).
ignoreEndpoints, rateLimit, timeout — runtime knobs that shape what assertions can hold.
If a relevant gen* already exists in appLevelCustomVariables, ALWAYS reference it via {{name}} rather than authoring a parallel one. The dev configured it for a reason.
═══════════════════════════════════════════════════════════════════ BEFORE CREATING — check for duplicates AND for existing recordings: ═══════════════════════════════════════════════════════════════════
A (app_id, branch_id) tuple holds at most one suite per scenario, AND if the dev has already captured the relevant traffic via keploy record, you should seed from that recording instead of curling the app fresh. Two bounded checks before create_test_suite:
(1) Duplicate-suite check — call listTestSuites({app_id, branch_id, q: ""}) where is a substring of the name you're about to author (e.g. "checkout", "auth"). The server filters by name regex, so the response is bounded to relevant matches regardless of how many suites the app has. If you can't pick a keyword (the dev's intent is vague), call with page_size=20 and NO q, then scan the first page only — DON'T paginate further. Match by name (case-insensitive) AND by intent. If any existing suite covers the same scenario: - Same scenario, refresh wanted → call update_test_suite (preserves history) or delete_test_suite + create_test_suite (loses history). - Adjacent but distinct scenario (e.g. "checkout with discount" vs "checkout without discount") → create with a name that distinguishes them clearly.
(2) Recording-reuse check — call listRecordings({app_id, limit: 10}) to fetch the 10 most recent keploy record sessions. Recordings cluster by scenario; the top 10 cover what's likely relevant — DON'T paginate the full history. For any recording whose name/timestamp suggests it covers the scenario you're authoring, call download_recording({app_id, test_set_id}) to pull its captured test cases (real request/response pairs from the live app). Seed your steps_json from those test cases — convert each one into a step (method/url/headers/body → request fields; recorded response → step's response field). This is more faithful than re-curling and saves the dev's time.
If no recent recording covers the scenario, fall through to the normal validate-locally-before-inserting flow (curl each endpoint yourself).
(3) Only after both checks → proceed with create_test_suite.
Skipping (1) leaves the dev with two suites covering the same flow — confusing reports, double rerecord cost, and orphaned sandbox tests on whichever suite they stop using. Skipping (2) re-curls endpoints whose traffic the dev already captured.
═══════════════════════════════════════════════════════════════════ ONE SCENARIO PER SUITE — load-bearing constraint: ═══════════════════════════════════════════════════════════════════
A suite represents EXACTLY ONE user-facing scenario / use-case (e.g. "user registers and creates their first order", "admin promotes a user role", "checkout with discount applied"). Do NOT pack multiple unrelated scenarios into a single suite — every step in a suite shares state and ordering with every other step. Mixing scenarios breaks idempotency (cleanup for one scenario can wipe state another scenario assumed), makes failures harder to diagnose, and inflates rerecord cost.
Tests for "auth + payments + cleanup" → THREE suites, not one. Related steps that share extracted vars and a state assumption belong in the same suite; unrelated flows don't.
When in doubt: if you can't write a single sentence describing what the suite tests in user-facing terms, split it.
═══════════════════════════════════════════════════════════════════ IDEMPOTENCY CONTRACT — the load-bearing rule for every suite: ═══════════════════════════════════════════════════════════════════
Every suite MUST be replayable indefinitely without state drift. The same suite run twice in a row, or 100 times back-to-back, must produce the same per-step outcomes. Failing this makes the suite useless for sandbox replay (the captured mocks freeze a single point-in-time response, so any state-dependent step diverges on rerun).
How to design for it:
Duplicate-key 500 on POST / PUT / PATCH replay ("duplicate key" / "already exists" / "unique constraint violated") is ALWAYS a SUITE design problem, NEVER an app problem. Fix order: (1) Reference an app-level
gen*var via{{name}}in the body — works if one exists (you read appLevelCustomVariables in APP CONFIG FIRST). (2) If no fitting app-level generator, declare your own JS-function in a PRIOR step'sextract(see PRELUDE PATTERN); reference it via{{name}}in the failing step's body. (3) Add a DELETE cleanup step earlier in the suite to clear the conflicting row. NEVER propose modifying app code (e.g. addingON CONFLICTto the INSERT, retry loops, transactional wrappers). The app's dedup is correct; the suite is what's missing entropy. See DO NOT MODIFY APP SOURCE CODE below.If a step CREATES a resource, a later step in the same suite MUST clean it up (DELETE the row, revert the state) — OR the create must be idempotent on the server side (PUT-by-key, upsert). A naked POST that always allocates a new ID will diverge on every replay.
If a step depends on a resource, EXTRACT its identity from a prior step's response into a
{{var}}— never hard-code an ID that "happens to exist right now". Hard-coded IDs rot.Reject "natural-language idempotency" reasoning ("the dev will reset the DB before each run"). The suite must work without external setup. If you can't guarantee it, you've packed two scenarios into one suite — split them.
Do not assume time-of-day, ordering relative to other suites, or random-but-stable values. Each suite is its own universe.
Pagination / list endpoints: extract the count or a known item, don't assert on absolute indices ("the third item is X") — index drifts as the dataset grows.
Auth tokens: pull from app-level custom variables or extract from a login step IN THE SUITE. Never inline a token that expires.
If the dev's request implies non-idempotent behaviour (e.g. "create user, then test that creating the same user fails"), capture both states explicitly inside the suite — first step creates, second step asserts the conflict response, third step deletes — so the suite as a whole is still replayable. Don't push the cleanup outside the suite.
A suite that fails idempotency is rejected at create_test_suite time by the dynamic validator (2 live runs check). When that fails, do NOT retry by tweaking syntax — restructure the scenario.
═══════════════════════════════════════════════════════════════════ DO NOT MODIFY APP SOURCE CODE during suite authoring: ═══════════════════════════════════════════════════════════════════
At create_test_suite time, your job is to author a suite that fits the app AS IT IS. You may patch the app's CONFIG (auth, appLevelCustomVariables, ignoreEndpoints, rateLimit) via updateApp({app_id, ...}) — those are runtime knobs the dev expects to tune. You may NOT modify the app's SOURCE CODE.
If a step is failing because of how the app behaves (500s, contract mismatches, missing endpoints, validation errors), the response is ONE of:
Adjust the suite to match observed behavior (steps_json edits before insert).
Use an app-level dynamic var (see APP CONFIG FIRST) or a JS-function generator to avoid the failure (see IDEMPOTENCY CONTRACT's duplicate-key fix order).
Patch the app's CONFIG via updateApp if the cause is auth / vars / rate limit.
If the dev confirms the app is broken AND the suite is correct, ASK the dev to fix the app — do NOT propose code changes yourself during authoring.
NEVER propose ON CONFLICT clauses, retry loops, transactional wrappers, or any code-level change to the dev's application as a way to make the suite work. The suite must accommodate the app, not the other way around.
═══════════════════════════════════════════════════════════════════ NEVER-MISS-THESE — the validator HARD-REJECTS suites missing any of these: ═══════════════════════════════════════════════════════════════════
responseon EVERY step — { status: , headers: {…}, body: "" }. Captured from a real curl against the dev's app. body MUST be a JSON-encoded STRING (the raw body bytes), NOT a parsed object. Wrap with json.dumps / JSON.stringify if your tool gave you a dict.extractis the ONLY authoring slot — neverextract_variables.extract_variablesis a post-run runtime SNAPSHOT field; the extract_variables-input rejection rule hard-rejects it on input. If you read an existing suite via getTestSuite / get_app_testing_context / download_recording and seeextract_variablespopulated there — IGNORE IT, that's the runtime's display state, not the suite's input. Always author withextract.POST/PUT/PATCH bodies need a per-run dynamic
{{var}}(mutating-step dynamism check). Declare a JS-function generator on a PRIOR step'sextract(typically a /health prelude as step 0). The declare-and-use-same-step check forbids declaring-and-using on the same step. See PRELUDE PATTERN below.JSONPath uses gjson dot-array syntax:
$.orders.0.user_id— NOT$.orders[0].user_id.
═══════════════════════════════════════════════════════════════════
STEP SHAPE (steps_json is an ARRAY — copy this two-step skeleton verbatim, preserve the prelude pattern): [ { // Step 0: cheap read prelude. Its sole job is to declare JS-function generators that // later POST/PUT/PATCH bodies reference. Required by R9 (mutating bodies need a per-run // dynamic var) + R2 (same-step extract isn't usable in pre-request fields). If your // suite already has a natural read step (/health, /me, version), reuse it as the prelude. "name": "health prelude (declares generators)", "method": "GET", "url": "/health", "headers": { "Accept": "application/json" }, "extract": { "genUserId": "function genUserId(){return 'u_'+Date.now()+'_'+Math.random().toString(36).slice(2,8);}" }, "assert": [ { "type": "status_code", "expected": "200" }, { "type": "json_equal", "key": "$.status", "expected": "healthy" } ], "response": { "status": 200, "headers": { "Content-Type": "application/json" }, "body": "{"status":"healthy"}" } }, { // Step 1: the actual mutation. Body references {{genUserId}} from the PRIOR step's // extract — satisfies R9 (per-run dynamic var) and R2 (not same-step). This step's // own "extract" captures the SERVER's response value (JSONPath) so a later step can // chain to {{user_id}} — JSONPath captures on the same step ARE legal because they // resolve post-response and only matter for subsequent steps. "name": "create user", "method": "POST", "url": "/api/users", "headers": { "Content-Type": "application/json" }, "body": "{"name":"{{genUserId}}"}", "extract": { "user_id": "$.data.id" }, "assert": [ { "type": "status_code", "expected": "201" }, // assert a STATIC field of the response, not a dynamic one. R30 // forbids {{genUserId}} in assert.expected (the runtime would // re-evaluate the function at assertion time and the value // wouldn't match the body's earlier call). Pick something the // server always returns the same — here the literal "status" // field. To assert against the dynamic id the server minted, // capture it via extract (above) and reference {{user_id}} in // a LATER step's assertion or url, not this step's. { "type": "json_equal", "key": "$.status", "expected": "created" } ], "response": { "status": 201, "headers": { "Content-Type": "application/json" }, "body": "{"data":{"id":"abc-123","name":"u_1700000000_xyz"},"status":"created"}" } } ]
VALID assertion types (ONLY use these — anything else fails the step at runtime with "invalid assertion type"):
status_code — exact HTTP status match. {type, expected:"201"}
status_code_class — match by class 2xx/3xx/… {type, expected:"2xx"}
status_code_in — any of a set. DELETE STEPS ONLY (status_code_in-scope check). For POST/GET/PUT/PATCH this is rejected. If you reach for it to absorb a duplicate-key 500 on re-runs, the right fix is a JS-function {{var}} in the body (see
extractrules below) so each run hits a fresh row and only 201 is ever returned. {type, expected:"200,201,204"}header_equal — response header exact match. {type, key:"Content-Type", expected:"application/json"}
header_contains — header value substring. {type, key:"Location", expected:"/orders/"}
header_exists — header is present. {type, key:"X-Request-Id"}
header_matches — header regex. {type, key:"Etag", expected:"^W/\".+\"$"}
json_equal — response body JSON path exact match. {type, key:"$.order.status", expected:"created"}
json_contains — response body JSON path substring/partial. {type, key:"$.message", expected:"success"}
custom_functions — inline JS function: (request, response, variables, steps) => boolean. {type, expected:"function f(request,response){return response.status===201;}"}
DO NOT use any assertion type not in the closed list above. The set is fixed at exactly 10 entries — there are no wildcards. These are types AIs commonly invent that DO NOT EXIST in keploy and will fail the assertion-type closed-list check: ✗ json_type — there is no type-of check; assert against the literal value via json_equal, or use custom_functions with a typeof predicate. ✗ json_path — paths are passed via the "key" field of json_equal / json_contains; there is no separate path-only type. ✗ json_schema — no schema validation; closest is custom_functions with an inline schema check. ✗ json_array_length — no length-only assertion; capture .length via extract, or use custom_functions. ✗ header_starts / header_ends — only header_equal, header_contains, header_exists, header_matches (regex) exist. ✗ status_in / status_range — the real names are status_code_in / status_code_class. ✗ body / body_equal — no body-level type; assert against parsed paths via json_equal / json_contains, or use custom_functions.
Anything not literally in the bulleted list above will get rejected by the validator — don't extrapolate from prefixes.
expected values must be STRINGS (put numbers like 201 in quotes). expected_string is auto-populated; you can omit it.
VARIABLES — purpose-first: a suite is a SCENARIO CHAIN; variables carry continuity between steps. Step N creates or fetches a resource → extracts its identity into a named var → step N+M uses {{var}} to reference that identity. If you find yourself extracting a value that NO LATER STEP references, DROP the extract — it's noise that hides which fields actually drive the scenario. Mechanically: extract values from one step's response with extract: {varname: "$.path"} (JSONPath). Reference later with {{varname}} in headers, body, url, or assertion expected values.
STEP IDS & TRACKING HEADERS are auto-injected — don't provide them. The server assigns a UUID per step and adds X-Keploy-Test-Step-ID / X-Keploy-Test-Suite-ID / Keploy-Test-Name so the sandbox runner can correlate responses to steps.
VARIABLE RULES (the runner follows these exactly — see pkg/service/atg/customFeatures.go ResolveCustomVariables):
Syntax: {{name}} — regex matched: {{(\w+)}} (letters, digits, and underscore only; NO whitespace, hyphens, or dots inside the braces). Names like {{gen-user}} or {{gen.user}} will NOT be substituted — use {{gen_user}} instead.
Substitution happens in: url, body, headers values, AND assertion "expected" values (so an assertion expecting {{genUserId}} gets the SAME resolved value the body used).
Resolution sources (looked up in this order):
The step's own
extractmap (seeded into the vars pool at step entry — pkg/service/atg/core.go:3698).Variables produced by EARLIER steps'
extractmaps (post-response JSONPath captures).App-level custom variables (stored on the app record, shared across all suites).
THE extract FIELD IS THE ONLY AUTHORING SLOT — use it for BOTH static values and JS-function generators.
extract_variables IS NOT AN AUTHORING SLOT. It's a post-run runtime SNAPSHOT — the runner writes resolved {{var}} values there after each step executes so the UI can show what landed at runtime. The validator now HARD-REJECTS any step with extract_variables populated (extract_variables-input rejection). If you see extract_variables while reading an existing suite via getTestSuite / download_recording / get_app_testing_context, IGNORE IT — that's the runtime's display state, not the suite's input. To author the equivalent, put every entry into extract instead (same keys, same values: JSONPath strings stay JSONPath, JS-function strings stay JS).
TWO SHAPES THE extract FIELD ACCEPTS — pick the right one:
(a) JSONPath capture — "order_id": "$.order.id"
Evaluated against the step's recorded response.body after the request returns. The captured value is staged into vars for LATER steps to reference via {{order_id}}. Use this when the value you need is in the server's response.
(b) Inline JS-function generator — "genUserId": "function genUserId(){ return 'alice_' + Date.now() + '_' + Math.random().toString(36).slice(2,8); }"
The value string must contain the keyword function — that is how the runner (core.go:4107 isInlineJs branch) distinguishes JS from a JSONPath. Signature: function (steps) { ... return ''; } returning a string. The steps arg is a map of prior-step {request,response} snapshots; ignore it if unused. Use this for inputs that must be unique per run (user_ids, timestamps, uuids) so the suite stays idempotent on re-runs against the same DB.
Examples:
{"genUserId": "function genUserId() { return 'alice_' + Date.now() + '_' + Math.random().toString(36).slice(2,8); }"}
{"genTs": "function genTs() { return String(Date.now() * 1e6 + Math.floor(Math.random()*1e6)); }"}
{"genOrderId": "function genOrderId(steps) { return 'ord-' + Math.random().toString(36).slice(2,10); }"}Put the JS-function entry on the FIRST step that needs it (often a health-check step whose own body doesn't reference the var — that's fine, the seed fires on step entry regardless). Later steps reference {{genUserId}} in body/url/headers/assertion-expected and see the same resolved value within one run, a fresh value on the next run.
WHY THIS MATTERS (the mistake to avoid): if you pin a static user_id like "alice_1776638347063146000" into extract AND your validation curl ALREADY inserted that row, the very next record/sandbox replay run will fire the same POST body, the producer (with deterministic ids or unique-constraint indexes) will reject the duplicate with a 500, and every downstream $.order.* / $.shard.* assertion will hit . Fix: use a JS-function entry so every run gets a fresh user_id.
VARIABLE CHAINING (JS-function generator on step 1, JSONPath capture for step-2 chaining): step 1: body: {"user_id":"alice_{{genUserId}}","product_name":"Keyboard_{{genUserId}}"} extract: { "genUserId": "function genUserId(){ return Date.now() + '' + Math.random().toString(36).slice(2,8); }", "order_id": "$.order.id" } assertions: [ {type:"status_code", expected:"201"}, {type:"json_equal", key:"$.order.user_id", expected:"alice{{genUserId}}"} -- SAME resolved value as the body used ] step 2: url: /api/orders/{{order_id}} -- resolves from step 1's JSONPath extract at run time assertions: [ {type:"json_equal", key:"$.id", expected:"{{order_id}}"} ]
PRELUDE PATTERN — when MULTIPLE POST steps each need their OWN per-run dynamic var:
A common mistake is to put the JS-function generator on the SAME step that uses it in the request body. The declare-and-use-same-step check rejects this — same-step extract is post-response, so its values aren't in scope when the request fires. Pattern that works: declare the generator on an EARLIER step's extract (typically a cheap /health GET as a "prelude"). The runner seeds extract values at STEP ENTRY, so a generator on step 0 is in scope from step 0 onwards — every later POST can reference it.
step 0 — prelude (the extract entry is what matters; the step itself can be anything cheap):
method: GET, url: /health
extract: { "uniq": "function uniq(){ return 'p_'+Date.now()+'_'+Math.random().toString(36).slice(2,8); }" }
assertions: [ {type:"status_code", expected:"200"} ]
step 1, 2, 3 — POSTs that all reference {{uniq}}:
method: POST, url: /api/orders
body: '{"user_id":"alice","product_name":"{{uniq}}",...}'
// (no `extract` needed — the generator is in scope from step 0)The prelude itself doesn't need to USE the var; declaring it is enough. This is the right shape for "create N orders with different unique keys" — without the prelude, you'd hit the declare-and-use-same-step check on every POST that tries to declare-and-use a generator on the same step.
VALIDATE-LOCALLY-BEFORE-INSERTING (CRITICAL for a usable suite): DO NOT call this tool with raw un-tested steps. EVERY step you send MUST have its "response" and "extract" fields populated from a live run. These are NOT optional. Without them:
The UI cannot render the step (shows an empty panel).
The rerecord runs blind and fails.
The step's {{variables}} won't resolve.
Required per-step fields when calling this tool (in steps_json): • name, method, url, headers, assert — obviously • body — for POST/PUT/PATCH; MUST reference random inputs as {{varname}} placeholders, NOT inline timestamps. Inline timestamps get baked into the suite and collide on re-run. • extract — MUST contain a resolvable entry for every {{varname}} you reference in body/url/headers. JS-function entries are fine (they're the canonical "dynamic input" shape); JSONPath entries chain values from one step's response into later steps. Example: body has {"user_id":"alice_{{genUserId}}"} → step's extract must have {"genUserId":"function genUserId(){ ... }"}. • response — the raw captured response from your local curl. Shape: {"body":"","status":201,"headers":{"Content-Type":"application/json",...}}.
MANDATORY validate-locally flow (do this BEFORE calling create_test_suite):
Bring the dev's app up locally (Bash: docker compose up -d, or instruct the dev). Wait for /health readiness.
For EACH step in order (simulating what the runner will do): a. For dynamic inputs (user_id, timestamps, uuids): DON'T inline a value — write a JS function into the step's
extractmap, e.g. {"genUserId":"function genUserId(){return 'alice_'+Date.now()+'_'+Math.random().toString(36).slice(2,8);}"}, and reference it in body/url/headers/assertion-expected as {{genUserId}}. For the local curl, you still need a CONCRETE value for that run — so as you're building the step, JS-eval the function yourself (or just pick a value consistent with the function's output shape) to do ONE concrete local curl for capturing the response. That concrete value goes ONLY into the captured "response" body — NOT intoextract(which keeps the JS function verbatim). b. Substitute {{name}} everywhere in the step's url/body/headers using accumulated variables (this step's extract + earlier steps' extract results). c. curl the SUBSTITUTED request against the live app. Capture the response. d. Check each "assert" against the captured response. If any fails → regenerate (different inputs / loosen assertion / change body shape) and retry this step. DO NOT move on with a failing step. e. Save the captured response into the step's "response" field as {"body":"","status":,"headers":{...}}. f. If the step has JSONPath entries inextract, evaluate each path against the response and note those values so later steps can use them in their {{var}} substitutions.AFTER the first pass of all steps, run the WHOLE SEQUENCE a SECOND TIME against the same live app — no DB wipe in between. Because you're using JS-function generators, each run should pick fresh random inputs → no unique-constraint collision. If ANY step that passed the 1st run fails the 2nd run (common symptom: 500 "failed to save order" / "duplicate key" / "already exists"), the suite is NOT idempotent. Go back to step 2a and either (i) increase the entropy of the JS function, (ii) restructure the step to be READ-AFTER-WRITE instead of POST-then-POST, (iii) drop the step if it genuinely can't be made idempotent.
Only once every step passes BOTH validation runs → call create_test_suite. Pass each step's response + extract (with JS functions verbatim) through steps_json.
If you call create_test_suite without response + extract on every step, you are creating a suite that is broken by construction. The UI + rerecord WILL fail.
SELF-CONTAINED TESTS (required for repeat runs):
The suite will be re-run many times (local validation + record + sandbox replay + ad-hoc UI runs). It must not depend on prior state.
Put random inputs in
extractJS functions with high entropy (timestamps, uuid). Plain "alice" will collide on re-run against producers with dedupe.Prefer READ-AFTER-WRITE chaining: POST creates resource → extract id → GET uses id. Validates without depending on PRE-EXISTING ambient seed data (rows that "happen to exist" in the dev's DB).
SEED → TESTED → CLEANUP roles within a suite:
When the scenario is "user can read X" or "user can list X", you can't assert against ambient state — there's no guarantee the X exists when the suite runs. Pattern:
step seed: POST /X (with dynamic {{var}} body) → extract id step tested: GET /X (or list) → assert against the seeded id step cleanup: DELETE /X/{{id}} → restore baseline
Only the "tested" step is what the suite is FOR; the seed and cleanup steps are scaffolding so the test can run any number of times against any starting state. Without an explicit seed step, you're either testing nothing (empty list) or relying on pre-existing data the next replay won't have.
DO NOT assert "the user has 3 orders" — that's ambient state. Seed N orders inside the suite first, then assert the count. Same applies to any "list / search / count" scenario: seed the data the test depends on, never assume it's there.
═══════════════════════════════════════════════════════════════════ SUBSTITUTION RULES — where {{var}} is and isn't allowed ═══════════════════════════════════════════════════════════════════
extract values come in two flavours and they substitute very differently:
TYPE A — JS-function generator (extract: { genUserId: "function genUserId(){return 'apple_'+Date.now()}" }):
The runtime STORES the source string and RE-RUNS the function on EVERY {{genUserId}} substitution site. Each call returns a fresh value. ONLY safe in POST / PUT / PATCH request BODIES — that's the one place you actually want a fresh value (uniqueness for inserts).
TYPE B — JSONPath extract (extract: { createdId: "$.order.user_id" }):
Evaluated ONCE against the step's recorded response, the resulting STRING is stored. Subsequent {{createdId}} substitutions resolve to that same fixed string. Safe everywhere — URLs, assertions, downstream bodies.
ALLOWED placements for {{generatorFn}} (TYPE A):
✓ POST / PUT / PATCH request body — the canonical "give me a fresh value to insert" use case.
FORBIDDEN placements for {{generatorFn}} (TYPE A) — the validator REJECTS these (generator-placement checks):
✗ assert[*].expected — the assertion's expected value will be a fresh function call, NOT the value the body sent. Static literals or {{TYPE_B_extract}} only.
✗ GET / DELETE / HEAD / PUT / PATCH URL (path or query) — those target an existing resource; the URL must encode the SAME id the creating POST used. Use a TYPE B extract from the creating step.
✗ Path/query of a downstream step's URL when the value should match what the upstream step inserted — same reason.
CANONICAL PATTERN — read-after-write with a stable id: Step 0 (POST creates resource): body: {"user_id":"{{genUserId}}","name":"widget"} // TYPE A in body — fresh per run, OK extract: {"genUserId":"function genUserId(){...}", // TYPE A — body source "createdId":"$.order.user_id"} // TYPE B — capture server's stored id assert: [{type:"status_code", expected:"201"}, {type:"json_equal", key:"$.order.id", expected:"{{createdId}}"}] // TYPE B in assertion — stable
Step 1 (GET reads it back): url: "/api/orders?user_id={{createdId}}" // TYPE B in GET URL — stable assert: [{type:"json_equal", key:"$.orders.0.user_id", expected:"{{createdId}}"}] // TYPE B — stable
DO NOT do this (the validator will reject it): Step 0: body: {"user_id":"{{genUserId}}"} assert: [{type:"json_equal", key:"$.order.user_id", expected:"{{genUserId}}"}] // generator-placement check — TYPE A in assertion Step 1: url: "/api/orders?user_id={{genUserId}}" // generator-placement check — TYPE A in GET URL
Name-collision check — do NOT pick an extract key that already exists on the app's appLevelCustomVariables. Use get_app_testing_context (or check the app payload) to enumerate them first; if a collision is unavoidable, scope-suffix your key (e.g. genUserId_smokeTest). Otherwise the runtime resolves the app-level variable first and silently shadows the suite's extract.
You can also construct steps from data fetched via download_recording or get_app_testing_context, but the validate-locally-before-inserting rule still applies.
CRITICAL — READING EXISTING SUITES: when the data you fetch via getTestSuite / get_app_testing_context / download_recording shows steps with extract_variables populated, that's the runtime's POST-EXECUTION SNAPSHOT (resolved values the runner wrote back for UI display). It is NOT what was authored. Treating that field as a copy-paste template makes the validator's extract_variables-input rejection reject every suite you produce. To replicate the authored behavior: copy every entry into extract instead, preserving keys and values verbatim (JS-function strings stay JS, JSONPath strings stay JSONPath). When in doubt: extract_variables is read-only output state; extract is input.
===== FROM-SCRATCH SCOPE RULE =====
When the dev asks to "generate / create / add / build keploy tests" without narrowing the scope, DEFAULT = ALL ENDPOINTS. Enumerate every non-trivial endpoint the app exposes (OpenAPI spec, router code, handler files) and author ONE suite per logical grouping — e.g. "user-crud", "auth-flow", "order-happy-path", "order-validation-errors". A single-endpoint app might produce one suite; a typical microservice produces 3-8.
Groupings should be READ-AFTER-WRITE coherent (each suite's steps chain via extract variables rather than depending on outside state). TELL THE DEV up-front how many suites you're about to create and what each covers, then proceed — do NOT ask for confirmation mid-flow. If the dev explicitly narrows it ("just the happy path for orders", "only the auth flow"), honor that.
===== HARD RULE — NO DB-STATE-DEPENDENT STEPS =====
Do NOT include any step whose response body depends on total DB / queue / file-system state. Concretely: GET /items with no filter, GET /orders with no user_id filter, any "list-all" / "count" / "search" that returns more rows the longer the app has been running, any endpoint returning the CURRENT time / UUID / request-id.
The auto-replay after record byte-compares the recorded response with what the live app returns under mocks, and non-deterministic bodies make gate 2 skip → the suite is never linked → sandbox replay fails with "no sandboxed tests". If you find yourself reasoning "this list might vary slightly but the mock should handle it" — STOP and drop the step.
The suite should contain ONLY steps whose response body is fully determined by that step's own request: health checks, create-with-fresh-ids, read-back-by-id for the ids you just minted, validation-error 400s on bad payloads. Filtered reads using a freshly-extracted id are fine; unfiltered reads are not.
===== SERVER-SIDE IDEMPOTENCY ENFORCEMENT =====
This tool REJECTS (with a typed error, before creating anything) any suite whose mutating steps aren't idempotent on re-run. The rule:
For each step with method ∈ {POST, PUT, PATCH} and a non-empty body, the body MUST contain at least one {{name}} placeholder that resolves to a per-run dynamic value. "Dynamic" means one of:
(a) a JS-function entry in any step's extract map (e.g. {"genUserId": "function genUserId(){ return 'u_'+Date.now()+'_'+Math.random().toString(36).slice(2,8); }"}), OR
(b) a JSONPath extract output from an EARLIER step (transitively dynamic if that step's own body was idempotent).
If the endpoint is GENUINELY idempotent (e.g. POST /auth/refresh, PUT /tags/apply-same-input — repeat calls don't hit unique constraints) set "idempotent": true on the step to waive the check. Use this sparingly — the default is "assume it'll collide" because that's the common case.
On rejection the error names the offending step and lists the dynamic variable names already in scope so you can see what's wireable. Fix the suite and retry — do NOT just flip idempotent: true to bypass.
===== MANDATORY OUTPUT — Phase 1 section =====
After all create_test_suite calls in a FROM-SCRATCH flow succeed, your final message to the dev MUST contain a section with this exact heading (do NOT collapse into prose; emit even for a single suite):
### Phase 1 — Inserted suites
| Suite name | suite_id | Step count |
| --- | --- | --- |
| <name> | <suite_id> | <N> |One row per suite created in this flow. Next step after Phase 1 is record_sandbox_test (see its description for Phase 2).
===== HOW THIS TOOL ACTUALLY INSERTS THE SUITE =====
This tool DOES NOT POST the suite to api-server itself. It returns a "playbook" — a small array of shell steps for you (Claude) to walk via Bash. The playbook spawns the enterprise CLI keploy create-test-suite which:
Reads the suite JSON the playbook wrote to disk.
Runs every static structural check — exits 1 with violations on stdout if anything fails.
Fires the suite against the dev's local app TWICE (idempotency check) — exits 1 if the second run diverges from the first.
Runs dynamic checks (generator-dynamism + GET-coupling) — exits 1 with violation messages on failure.
POSTs the validated suite to api-server (HTTP 201 → success; HTTP 426 → CLI is older than api-server's rule set, dev needs to upgrade
keploy).
Walk the playbook in order. If step 2 (the CLI run) exits non-zero, surface its stdout to the dev — it lists the offending step / check / fix-it hint and includes a canonical step skeleton on structural failures. ITERATE LOCALLY: revise the JSON in your draft, REWRITE the same suite file via Bash, and RE-RUN step 2 directly. DO NOT call create_test_suite again per iteration — that mints a fresh playbook and a new nonce-path for no reason; the existing one is reusable. The CLI ALSO requires every step to have response and extract populated (step completeness check plus the validate-locally rules above), so the validate-locally curl flow described earlier is still required BEFORE calling this tool.
PREREQUISITES the playbook assumes:
The dev's app is up and reachable at app_url.
keploybinary is on PATH. If missing, install before calling this tool:curl --silent -O -L https://keploy.io/install.sh && source install.sh.Either ~/.keploy/cred.yaml exists (API key) or KEPLOY_API_KEY is exported. The CLI uses the API key for the api-server POST (different from the OAuth-JWT path the sandbox tools use).
| Name | Required | Description | Default |
|---|---|---|---|
| name | Yes | Suite name | |
| app_id | Yes | Keploy app ID | |
| labels | No | Comma-separated labels for filtering | |
| app_dir | No | Absolute path to the dev's repo root (where the app was started). Defaults to '.' (cwd). The CLI invocation cd's here. | |
| app_url | Yes | Base URL the dev's local app is listening on, e.g. http://localhost:8080. The enterprise CLI hits this when running the suite twice for the idempotency, generator-dynamism, and GET-coupling checks. | |
| branch_id | Yes | REQUIRED. Keploy branch ID (uuid). Resolve via the explicit two-step flow BEFORE calling: (1) Bash `git rev-parse --abbrev-ref HEAD` in app_dir to detect the dev's git branch; (2) call the create_branch MCP tool with {app_id, name: <git branch>} — find-or-create returns {branch_id, ...}; pass that branch_id here. Direct writes to main are blocked. | |
| steps_json | Yes | JSON array of test steps | |
| description | No | Suite description |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Reveals key behavioral trait not in annotations: the tool returns a playbook instead of directly posting to the API server. Also covers internal validation rules (R1-R27, D3-D7) and prereqs. No contradiction with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Highly verbose and detailed, lacking conciseness. While well-structured with headings and bullets, the volume of text is excessive for a tool description, hindering quick scanning.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (many validation rules, variable substitution, idempotency checks), the description covers all necessary aspects: validate-locally-flow, step shape, assertion types, substitution rules, and output format. Complete for its context.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with descriptions. The description adds significant meaning beyond schema: explains step shape, assertion types, variable rules, and how to resolve branch_id. Adds value despite high coverage.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool creates a new API test suite with test steps, detailing step structure, assertions, and extraction. It distinguishes from alternatives like updateTestSuite by focusing on creation of new suites from scratch.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Provides extensive when-to-use and when-not-to-use guidelines: explicitly states forbidden assertion types, forbidden variable placements, required fields, and validation flows. Includes hard rules like no DB-state-dependent steps and idempotency enforcement.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
createTestSuiteADestructiveInspect
POST /apps/{appId}/test-suites — Create a test suite — Requires scope: write.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| branch_id | No | Query parameter: branch_id |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already mark destructive=true. Description adds the scope requirement 'write', which is useful beyond annotations but does not explain the destructiveness or side effects.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Single sentence with high information density: HTTP method, endpoint, action, and scope requirement. No wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Missing details: no request body or output schema described. For a creation operation, the agent needs to know what is created and what response to expect.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema covers 100% of the single required parameter. Description uses the parameter in the path, reinforcing its role as a path parameter.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states the action: 'Create a test suite' with explicit HTTP method and endpoint, distinguishing it from sibling mutation tools like update or delete.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance on when to use this vs. the similar sibling 'create_test_suite' or other tools. Only notes scope requirement.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
deleteAppBDestructiveInspect
DELETE /apps/{appId} — Delete an app — Requires scope: admin.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate destructive hint; description adds admin scope requirement but does not disclose cascading effects or irreversibility beyond what annotations imply.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Single sentence is efficient and front-loaded; no superfluous content, though structure could be slightly expanded.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Adequate for a simple one-param destructive tool, but lacks completeness regarding consequences and related operations.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, and the description adds no additional meaning beyond the schema's parameter description.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states 'Delete an app' with the specific HTTP method and resource, and distinguishes from sibling tools like createApp and updateApp.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance on when to use this tool versus alternatives; mentions scope requirement but no context for decision-making.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
delete_test_suiteADestructiveInspect
Delete a test suite on a Keploy branch — synchronous, no playbook to walk.
USE THIS when:
The dev's update_test_suite call was rejected with "preserves no steps from the existing suite — that's a full rewrite, not an edit". Delete the existing suite and re-author from scratch via create_test_suite. The error message itself routes here.
The dev explicitly says "delete the suite", "remove suite X", "wipe my orderflow suite".
A genuine wholesale redesign — every step changed in shape — that the audit trail shouldn't try to reconcile as edits.
DO NOT USE THIS when:
The dev wants a real edit (one assertion, one step's body). Use update_test_suite + preserve existing step IDs instead — keeps audit history intact.
The dev wants to "redo" a single failed run. Test runs are independent of suite state; just rerun via replay_test_suite.
INPUT
app_id (required) — Keploy app id
suite_id (required) — UUID of the suite to delete
branch_id (required) — Keploy branch UUID. The delete creates a branch-scoped DeleteTestSuite audit event so reads on the same branch see the suite as gone. Direct main writes are blocked.
OUTPUT
On success: {"deleted": true} — suite is tombstoned at the branch overlay; subsequent reads (getTestSuite / listTestSuites) on this branch return 404 / exclude it.
404 if the suite_id doesn't exist on this app/branch (verify via getTestSuite or listTestSuites first if you're unsure).
After delete, the standard re-create flow is: (1) call create_test_suite with a freshly authored steps_json. The new suite gets a fresh suite_id; the old id is tombstoned, not reusable.
═══════════════════════════════════════════════════════════════════ DISCOVERY — when the dev hands you a bare suite_id with no app_id / branch_id: ═══════════════════════════════════════════════════════════════════
Suites live on a (app_id, branch_id) tuple. A bare suite_id has no on-disk hint about which app or branch holds it; you have to RESOLVE both before calling this tool. Walk these steps in order — STOP as soon as getTestSuite returns 200:
Detect the dev's git branch: Bash
git rev-parse --abbrev-ref HEADin app_dir. If exit non-zero / output is "HEAD" → not a git repo / detached HEAD; ASK the dev for the Keploy branch name (don't invent one).Resolve candidate apps via the cwd basename: Bash
basename $(pwd)→ call listApps with q=. Usually 1–2 candidates. If 0 → ASK; if >1 → walk every candidate in step 4.For each candidate app, call list_branches({app_id}) and find the branch whose
namematches the git branch from step 1. That gives you {branch_id}. If no match → not this app, try next.Verify with getTestSuite({app_id, suite_id, branch_id=<from step 3>}). 200 → resolved; 404 → wrong app/branch, try next.
If steps 2–4 exhaust, walk every OPEN branch on each candidate app, then try main (branch_id omitted). If still nothing → ASK the dev for the {app_id, branch_id} pair.
After resolving once in a session, REUSE the {app_id, branch_id} for subsequent suite-targeted calls; don't re-walk discovery for every action.
| Name | Required | Description | Default |
|---|---|---|---|
| app_id | Yes | Keploy app ID | |
| reason | No | Optional human-readable reason recorded on the audit event ("AI rewrite", "deprecated endpoint", etc). | |
| suite_id | Yes | UUID of the test suite to delete | |
| branch_id | Yes | REQUIRED. Keploy branch UUID. Resolve via the two-step flow: `git rev-parse --abbrev-ref HEAD` → create_branch tool. Direct main writes are blocked. |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations indicate destructiveHint: true and readOnlyHint: false. The description aligns, noting synchronous deletion, tombstoning, branch-scoped audit events, and 404 behavior. It adds context beyond annotations, such as the tombstone mechanism and re-create flow.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is long but well-structured with clear sections (USE, DO NOT USE, INPUT, OUTPUT, DISCOVERY). Each section adds value; the discovery section, while extensive, is necessary for the use case. Organization is effective, though slightly verbose.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given no output schema, the description fully covers output (success response, 404) and edge cases (bare suite_id resolution). It also explains post-delete flow and prerequisites, making it complete for a destructive, branch-scoped tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%. The description adds meaning beyond schema by explaining why branch_id is required (branch-scoped audit, main writes blocked) and the optional reason parameter's purpose. This compensates for the high schema coverage, providing extra context.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Delete a test suite on a Keploy branch—synchronous, no playbook to walk.' It specifies the action (delete), resource (test suite), and branch context, distinguishing it from siblings like update_test_suite and create_test_suite.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Provides explicit 'USE THIS when' and 'DO NOT USE THIS when' sections with specific conditions and alternatives, such as using update_test_suite for edits or run_test_suite for reruns. Also includes a detailed discovery flow for resolving bare suite_id.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
deleteTestSuiteADestructiveInspect
DELETE /apps/{appId}/test-suites/{suiteId} — Delete a test suite — Requires scope: write.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| suiteId | Yes | Path parameter: suiteId | |
| branch_id | No | Query parameter: branch_id |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Description matches destructiveHint annotation, adds required scope 'write'. No contradiction. Provides clear language about the operation's effect.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two concise sentences, front-loaded with method and path. No redundant information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple delete operation with no output schema, the description covers purpose, path, and authorization requirement. No gaps.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with descriptions for both parameters. Description adds scope requirement, but no extra parameter semantics beyond schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description explicitly states 'Delete a test suite', which is the exact purpose. It includes the HTTP method and path, clearly identifying the resource and action.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance on when to use this tool versus sibling tools like bulkDeleteTestSuites or deleteApp. Does not mention prerequisites or context for deletion.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
download_recordingADestructiveInspect
Download a recording — a session of captured API traffic (request/response pairs + outbound mocks) stored as a test_set. Recordings are INPUT artifacts captured by keploy record: they're raw traffic that AI generation (generate_and_wait) and manual create_test_suite flows turn into test suites. Use this to inspect what was captured before deciding how to turn it into suites.
NOT a sandbox-test export. A "sandbox test" (the suite + its captured mocks, produced by record_sandbox_test) lives behind a suite's test_set_id link — to inspect a suite, use getTestSuite for the step shape or replay_sandbox_test to see behavior. For the suite's mock bundle, do this two-step: (1) call getTestSuite to read the suite's test_set_id; (2) call listMocks({app_id, test_set_id}) (use ?include_specs=true to also fetch parsed mock YAML). Or for raw mock files on disk, point the dev at the artifact directory printed by record_sandbox_test (data.artifact_dir on the phase=done event).
| Name | Required | Description | Default |
|---|---|---|---|
| app_id | Yes | Keploy app ID | |
| test_set_id | Yes | ID of the recording session to download | |
| include_mocks | No | Include dependency mocks in the download (default true) |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already provide destructiveHint=true and readOnlyHint=false. The description adds context about what data is included in the download but does not elaborate on side effects, permissions, or limitations. Since annotations carry the main burden, the description adds moderate value.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences, no wasted words. First sentence states purpose; second gives usage guidance. Efficient and well-structured.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
No output schema is present, so the description should clarify the output format or behavior. It lacks details about response structure, file type, or streaming. However, the tool's purpose is clear and the schema covers parameters, so it is moderately complete but not fully.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Input schema has 100% description coverage for all 3 parameters. The description does not add any additional meaning beyond what is already in the schema (e.g., default behavior of include_mocks). Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Download a complete recording of captured API traffic' and specifies what it includes (all request/response pairs, dependency mocks, test-to-mock mappings). This distinguishes it from siblings like exportRecording or getRecording.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly says 'Use this to understand what was recorded, then decide how to turn it into test suites using create_test_suite, or use the data to inform AI-powered generation via generate_and_wait.' It provides clear downstream alternatives but does not mention when to avoid using this tool.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
exportRecordingADestructiveInspect
GET /apps/{appId}/recordings/{testSetId}/export — Export a recording bundle — Export a complete recording bundle: test set metadata, all test cases, mocks, and test-to-mock mappings as a single JSON response. Use ?include_mocks=false to exclude mocks. Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| testSetId | Yes | Path parameter: testSetId | |
| include_mocks | No | Query parameter: include_mocks |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description states this is an export operation, which implies a non-destructive read. However, annotations set 'destructiveHint' to true, creating a contradiction. The description does not clarify any destructive side effects.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise with three sentences: URL and purpose, content of export, and usage of optional parameter. No unnecessary words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
No output schema, but the description explains the return format (single JSON) and contents. It also states scope requirement. However, the contradiction with annotations reduces completeness.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline is 3. The description adds functional meaning by explaining that 'include_mocks=false' excludes mocks, going beyond the schema's simple boolean type.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb 'Export', the resource 'recording bundle', and specifies what it includes (test set metadata, test cases, mocks, mappings). It distinguishes from siblings like 'download_recording' by noting the output is a single JSON response.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides advice on using the 'include_mocks' parameter to exclude mocks and mentions the required scope 'read'. However, it does not explicitly contrast this tool with alternatives such as 'download_recording' or 'importRecording'.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
generate_and_waitBDestructiveInspect
Generate test suites from an OpenAPI spec and wait for completion.
| Name | Required | Description | Default |
|---|---|---|---|
| app_id | Yes | Keploy app ID | |
| schema | No | OpenAPI spec (YAML or JSON) | |
| base_url | Yes | Target API base URL | |
| user_prompt | No | Instructions for the AI generator | |
| max_test_suites | No | Max suites to generate (default 30) |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Only mentions 'wait for completion' but does not disclose side effects, duration, or error behavior. Annotations indicate destructive=true but description adds little beyond that.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Single sentence, no unnecessary words, front-loaded with core purpose.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Lacks details on return value, blocking behavior, and error scenarios. With no output schema, more context is needed.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline is 3. The description adds no additional parameter meaning beyond what the schema provides.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Clearly states 'Generate test suites from an OpenAPI spec' and adds 'wait for completion', distinguishing from sibling tools like generateTestSuites.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance on when to use this tool vs alternatives (e.g., generateTestSuites). No mention of prerequisites or context.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
generateTestSuitesCDestructiveInspect
POST /apps/{appId}/test-suites/generate — Generate test suites via AI — Requires scope: write.
| Name | Required | Description | Default |
|---|---|---|---|
| auth | No | Authentication configuration for test execution. The runner injects the matching headers on every step request. | |
| docs | No | API documentation text | |
| appId | Yes | Path parameter: appId | |
| schema | No | OpenAPI spec (YAML or JSON) | |
| timeout | No | ||
| base_url | Yes | ||
| examples | No | Example curls or request/response pairs | |
| rate_limit | No | ||
| user_prompt | No | Additional instructions for AI generation | |
| webhook_url | No | ||
| code_snippet | No | Relevant source code for context | |
| max_test_suites | No | ||
| ignore_endpoints | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate destructiveHint=true. The description adds that the tool requires write scope, which is helpful, but does not disclose other behavioral traits like whether existing suites are overwritten, what the AI generation entails, or any side effects.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is extremely concise with zero wasted words. However, it could be expanded with a brief sentence on tool purpose or usage without losing conciseness.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given 13 parameters, no output schema, and many sibling tools, this description is insufficient. It does not explain the AI generation process, return value, or how to interpret results. The agent would lack critical context for correct invocation.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 54%, but the description does not elaborate on any parameters. It only provides the endpoint context. The description adds no meaning beyond what the input schema already provides, leaving many parameters (e.g., timeout, rate_limit, webhook_url) unexplained.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the HTTP method, endpoint path, the action (generate test suites via AI), and required scope. It distinguishes from sibling tools like create_test_suite or createTestSuite that likely create suites manually or via non-AI methods.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance on when to use this tool versus alternatives. The description mentions scope requirement but does not explain scenarios where AI generation is appropriate or when to prefer manual creation tools like create_test_suite.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
getAppBDestructiveInspect
GET /apps/{appId} — Get an app — Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description indicates a safe GET operation with 'read' scope, but annotations contradict this by setting destructiveHint=true and readOnlyHint=false. This contradiction severely undermines transparency.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single, front-loaded sentence conveying the HTTP method, path, and authorization requirement with zero redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple single-parameter tool without output schema, the description adequately covers the basic action. However, it lacks explanation of what an 'app' is and typical return data, and the annotation contradiction leaves behavioral context incomplete.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Input schema coverage is 100%, with a single parameter described as 'Path parameter: appId'. The description adds no further meaning beyond the schema, meeting the baseline for full coverage.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Get an app', which is a specific verb and resource. The tool name 'getApp' reinforces this purpose, and it distinguishes itself from sibling tools like listApps and deleteApp by focusing on a single app retrieval.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides no guidance on when to use this tool versus alternatives. It only states the endpoint and required scope, without discussing context or when not to use it.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
get_app_testing_contextADestructiveInspect
Fetch comprehensive context about an app's API plus the canonical test-suite authoring schema. Returns:
app — config, auth shape, appLevelCustomVariables (READ THIS for R32 — your extract keys must not collide with these names)
coverage — API coverage report (which endpoints have tests, which don't)
recordings — summaries of captured traffic sessions
test_suites — existing suites (check before authoring to avoid duplicates)
generated_schema — AI-extracted OpenAPI for the app
step_schema — THE CANONICAL TEST SUITE STEP SCHEMA. Same content as
keploy test-suite-format, shipped inline so you don't need a separate tool-call hop. Read this BEFORE authoring or curling endpoints — it contains the MANDATORY rule block (R10 / R9 / R2 / R15 / R32) the validator enforces on iter 1, plus the canonical two-step prelude+POST skeleton.authoring_directive — one-line reminder pointing at step_schema.
Call this FIRST when authoring suites. The step_schema field eliminates the most common iter-1 failure (AI authors based on training-data priors before reading the validator's rules).
| Name | Required | Description | Default |
|---|---|---|---|
| app_id | Yes | Keploy app ID |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Description describes a read-only fetch operation but annotations have destructiveHint=true and readOnlyHint=false, creating a contradiction. No additional behavioral context is provided to resolve this inconsistency.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two efficient sentences covering action, return values, and usage context. No unnecessary words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Lists all major return components (schema, coverage, traffic, test suites) despite no output schema. Lacks detail on structure but sufficient for understanding the tool's purpose.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Single required parameter app_id with adequate schema description. Description adds no extra meaning beyond the schema, which covers 100% of parameters.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states it fetches comprehensive context about an app's API, listing specific components (OpenAPI schema, coverage, traffic summaries, test suites). Differentiates from siblings like getApp or getSchemaCoverage by aggregating multiple data sources.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Directly states usefulness for understanding current state before taking action, providing clear context for when to use. Does not explicitly exclude alternatives or mention when not to use, but the guidance is sufficient.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
get_coverage_gapsBDestructiveInspect
Get API coverage and prioritized suggestions for uncovered endpoints.
| Name | Required | Description | Default |
|---|---|---|---|
| app_id | Yes | Keploy app ID |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description says 'Get' implying a read operation, but annotations set destructiveHint=true, contradicting that. No additional behavioral disclosure beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Single sentence with no wasted words, directly stating purpose.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Missing details about output format, what 'prioritized suggestions' entails, and behavioral implications despite the contradiction. Does not compensate for lack of output schema.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100% (app_id described as 'Keploy app ID'), so the description adds little value beyond what schema already provides.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description 'Get API coverage and prioritized suggestions for uncovered endpoints' clearly states the verb (get) and the resource (coverage gaps/suggestions), distinguishing it from siblings like getSchemaCoverage.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance on when to use this tool versus alternatives such as getSchemaCoverage or get_app_testing_context.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
getCurrentUserADestructiveInspect
GET /users/me — Get current user — Requires scope: read. Returns the user associated with the API key.
| Name | Required | Description | Default |
|---|---|---|---|
No parameters | |||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description describes a read-only operation (GET), yet annotations set destructiveHint=true and readOnlyHint=false, creating a contradiction. The description does not disclose any additional behavioral traits beyond the contradictory annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single sentence that covers the purpose and authorization, but it includes the HTTP method and path which are somewhat redundant with the tool name.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With no output schema, the description minimally states 'Returns the user associated with the API key', but does not detail what user fields are returned, leaving room for ambiguity.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
No parameters are defined in the input schema, so the description adds value by noting the required scope 'read', which provides authorization context.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Get current user' and specifies the HTTP method and resource path, distinguishing it from sibling tools that modify or create resources.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions the required scope 'read', implying it's a read operation, but does not explicitly state when to use this versus other 'get' tools or when not to use it.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
getGeneratedSchemaBDestructiveInspect
GET /apps/{appId}/generated-schema — Get auto-generated OpenAPI schema — Returns the OpenAPI schema auto-generated from recorded traffic. Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description does not address the behavioral traits implied by annotations (destructiveHint: true, readOnlyHint: false). A GET request is typically read-only, but the tool may have side effects; this is not disclosed. The mention of 'requires scope: read' furthers the confusion.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise (single line) and front-loaded with the HTTP method and path, making it efficient. However, it could be more structured by separating purpose from details.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the simple schema and contradictory annotations, the description should provide more context about when and why to use this tool, and clarify the potentially destructive nature. It lacks this completeness.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the baseline is 3. The description adds no additional meaning beyond what the schema already provides (it only repeats that appId is a path parameter).
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the action ('Get') and the resource ('auto-generated OpenAPI schema'), and distinguishes this tool from other list/get tools on the server by specifying the specific schema type.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance is provided on when to use this tool versus alternatives. It only mentions a required scope, but does not explain the context or trade-offs compared to sibling tools.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
getGenerationHistoryDetailsCDestructiveInspect
GET /apps/{appId}/generation-history/{jobId} — Get generation history details — Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| jobId | Yes | Path parameter: jobId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description states 'GET' and 'Requires scope: read', implying a read-only operation. However, annotations indicate readOnlyHint=false and destructiveHint=true, creating a contradiction. The description does not disclose any behavioral traits beyond the HTTP method.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single sentence, concise and front-loaded. It includes the HTTP method and scope requirement, but the path format is slightly cryptic. Overall, it is efficient with minimal redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With no output schema, the description should explain the return values or structure of the generation history details. It does not. It also fails to mention error handling, pagination, or any additional contextual cues needed for a tool with contradictions.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, and the schema already defines appId and jobId as path parameters. The description adds no additional meaning or context about parameter values, formats, or constraints.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description specifies the HTTP method and resource path, clearly stating it retrieves generation history details for a specific app and job. It distinguishes from listing tools like listGenerationHistory, but does not explicitly differentiate from other detail retrieval tools like getJob.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description includes a scope requirement ('read') but provides no guidance on when to use this tool versus alternatives such as listGenerationHistory or getJob. There is no explicit when-to-use or when-not-to-use instruction.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
getJobCDestructiveInspect
GET /jobs/{jobId} — Get a job — Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| jobId | Yes | Path parameter: jobId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description implies a read operation ('Get'), but annotations indicate destructiveHint=true and readOnlyHint=false, creating a contradiction. The description does not clarify the actual behavior or side effects, which is misleading.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is very concise in one sentence with front-loaded method and path. However, given the annotation contradictions and missing details, the conciseness is not beneficial; it sacrifices necessary information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a tool with no output schema and contradictory annotations, the description fails to explain what is returned, potential side effects, or resolve the destructive hint. It is incomplete and may confuse the agent.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema covers 100% of parameters with a description for jobId. The tool description adds no additional meaning beyond the schema, so it meets the baseline for high schema coverage.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb 'Get' and resource 'job', along with the HTTP method and path. It does not explicitly differentiate from sibling tools like listJobs or getTestRun, but the tool name and description make the purpose clear.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions required scope ('read') but provides no guidance on when to use this tool versus alternatives such as listJobs or getTestRun. No exclusions or context for selection are given.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
getLoadTestReportCDestructiveInspect
GET /apps/{appId}/load-tests/{runId} — Get a load test report — Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| runId | Yes | Path parameter: runId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The annotation destructiveHint: true contradicts the description's implication of a read operation ('GET /...get...'). This is a serious inconsistency that misleads the agent.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is very short and front-loaded, but it omits necessary details. It is efficient but not fully adequate.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The tool has no output schema and the description does not explain what the report contains. Combined with the annotation contradiction, completeness is poor.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the description adds no value beyond the schema. Parameters are clearly defined as path parameters in the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Get a load test report' and includes the HTTP GET method and path. However, it does not differentiate from sibling tools like getTestRun or getTestReport.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance is given on when to use this tool versus alternatives such as listLoadTestRuns or getTestRun. The description lacks situational context.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
getRecordingCDestructiveInspect
GET /apps/{appId}/recordings/{testSetId} — Get recorded test cases — Returns individual recorded test cases within a test set, including HTTP request/response data. Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| limit | No | Query parameter: limit | |
| offset | No | Query parameter: offset | |
| testSetId | Yes | Path parameter: testSetId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Description says 'GET' and 'Get recorded test cases', implying a read-only operation, but annotations have destructiveHint: true and readOnlyHint: false, creating a direct contradiction. No disclosure of destructive behavior.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Concise single sentence with path, purpose, and scope. Front-loaded with method and resource.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
No output schema; description gives high-level return info but fails to clarify destructive behavior or provide thorough usage context. Contradiction further reduces completeness.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100% so baseline is 3. Description adds return value context but not parameter semantics beyond schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Clearly states the HTTP method, resource (test case within a test set), and that it returns individual recorded test cases with HTTP request/response data. Differentiates from listRecordings by specifying individual test case retrieval.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Only mentions required scope 'read' but provides no guidance on when to use this tool vs siblings like listRecordings or download_recording.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
getSchemaCoverageBDestructiveInspect
GET /apps/{appId}/schema-coverage — Get schema coverage — Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description states 'Requires scope: `read`' which contradicts the annotations: readOnlyHint=false, destructiveHint=true. This implies the operation may modify data, yet it claims only read scope is needed. No other behavioral traits (e.g., rate limits, side effects) are disclosed.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is extremely concise, using a single sentence that front-loads the HTTP method and endpoint. Every part is necessary for basic identification, with no waste.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the absence of an output schema, the description should hint at what 'schema coverage' means or what the response contains. It does not. Combined with the annotation contradiction, the description leaves significant gaps for an agent.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The schema already covers the single parameter with 100% description coverage (though the description is redundant). The tool description adds no additional meaning or constraints beyond what the schema provides, meeting the baseline for high schema coverage.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the resource ('schema coverage' for a given app) and the action ('Get'). The inclusion of the HTTP method and endpoint provides unambiguous identification. It distinguishes itself from siblings like 'get_coverage_gaps' and 'getGeneratedSchema' by name and context.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description offers no guidance on when to use this tool versus alternatives, nor does it mention prerequisites or exclusions. The agent is left to infer its applicability from the name alone.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
get_session_reportADestructiveInspect
Fetch the report for a completed run. ONE tool, THREE report kinds — the response's top-level kind field discriminates which kind it is (rerecord / sandbox_run / test_suite_run) and which question the report answers (see core glossary's "three reports"). Read kind first, then pick the matching reading rules below; do NOT assume the kind from how you got here.
Call this as the final step of the playbook, AFTER you read the terminal NDJSON event (phase=done) and confirmed data.ok=true. Pass app_id and test_run_id — extract test_run_id from data.test_run_id on the phase=done line of the progress_file returned by record_sandbox_test or replay_sandbox_test (for replay_test_suite, the CLI prints test_run_id to stdout instead).
===== OUTPUT SHAPE ===== (Conditional verbosity so the dev isn't drowned in noise on a green run.)
Always includes totals at the SUITE level only (total_suites / passed_suites / failed_suites) and a per_suite array where each entry carries suite_id, suite_name, total_steps, passed_steps, failed_steps. Aggregate step counts across suites are intentionally omitted — they hide where damage actually is.
PER-KIND READING of passed_steps / failed_steps — same column names, different meaning per kind:
RERECORD (kind=rerecord): passed_steps = steps whose auto-replay byte-comparison matched the live capture. failed_steps = steps that diverged on auto-replay. EVEN IF every suite shows passed_steps == total_steps, the rerecord is only successful when every suite is also linked=true (a sandbox test got produced). Always check
linked; the step counts alone do not indicate "did the rerecord work".SANDBOX_RUN (kind=sandbox_run): passed_steps = steps whose assertions held under captured-mock replay. failed_steps = assertion failures or response diffs against the captured baseline.
TEST_SUITE_RUN (kind=test_suite_run): passed_steps = steps whose assertions held against the live app. failed_steps = same against live, no mocks involved. No linkage to report.
Top-level
kinddiscriminates the report:"rerecord"for record_sandbox_test runs (rerecord report — answers "did the sandbox test get created and linked?"),"sandbox_run"for replay_sandbox_test runs (sandbox run report — answers "does the suite still hold up against its captured baseline?"),"test_suite_run"for replay_test_suite runs (test suite report — live execution, no mocks; answers "does the suite hold up against the actual current system?"). Use kind to pick the right reading; do NOT mix them in one response.RERECORD runs (kind="rerecord") carry a
linkedbool +test_set_idstring on every per_suite[] entry. linked=true means the rerecord produced a sandbox test for the suite (replay-ready). linked=false means rerecord did NOT produce a sandbox test for the suite — it cannot be replayed until rerecord succeeds. ALWAYS surface this on rerecord output — even when every step's capture passed at the wire level, a suite without a sandbox test is a real failure. For the per-suite table, add a "Linked" column (yes/no from per_suite[].linked). For the one-line all-green reply, report "N/N suites passed, L/N have a sandbox test (test_run_id=)".When any suite has failures (or verbose=true), also includes failed_steps[] with per-step diagnostics (suite, step name, method+url, diff excerpt, error, mock_mismatches, assertion_failures, mock_mismatch_failure, authored_assertions, authored_response_body) PLUS mock_mismatch_failed_steps (count) and mock_mismatch_dominant (bool — true when the majority of failed steps have unconsumed recorded mocks, which points at a keploy-side egress-hook issue rather than dev app breakage). On RERECORD, failed_steps[] also carries
linked(whether the owning suite has a sandbox test after this rerecord) and the mock_mismatch_* fields are suppressed (irrelevant in rerecord context).authored_assertions / authored_response_body — the SUITE's authored contract for the failing step (the assert array and response.body as defined when the suite was created/updated). Surfaced inline so route B vs route C can be decided without a second getTestSuite round-trip. KEY DECISION POINT: if any authored_assertions entry is pinned to the value the diff shows as "expected" (e.g. assert {path: "$.order.status", expected: "created order"} and the diff says "expected 'created order', got 'created'"), route C is MANDATORY — re-record alone leaves that assertion stuck on the old contract and the next rerecord/replay will gate-1-fail on the same step. If authored_assertions is empty/absent (suite asserts nothing structural on that field), route B or route-C-without-assertion-edit may suffice.
When everything passes and verbose is false, failed_steps is omitted.
===== HOW TO RESPOND TO THE DEV =====
status == "all_passed" AND kind == "sandbox_run" → ONE-LINER: "<passed_suites>/<total_suites> suites passed (test_run_id=)". Do not dump the JSON, do not list per-suite rows unless asked.
status == "all_passed" AND kind == "test_suite_run" → ONE-LINER: "<passed_suites>/<total_suites> suites passed live (test_run_id=)". No mocks involved, no linkage to report.
status == "all_passed" AND kind == "rerecord" → ONE-LINER including linkage: "<passed_suites>/<total_suites> suites passed, / linked (test_run_id=)" where = count of per_suite[] entries with linked=true. If linked < total, ALSO list the unlinked suite names so the dev knows which ones are silently broken (skip sandbox replay on them, or investigate the linking failure). Never drop linkage reporting on rerecord even when it's all green.
status == "has_failures" → response MUST contain (in order, no collapsing rows even when failures look homogeneous — the dev needs the full inventory):
per-suite table — one row per suite in per_suite (passing suites included), columns = Suite name | passed/total steps.
failed-steps table — ONE ROW per entry in failed_steps[], columns = Suite | Step name | Method + URL | Expected → Actual status | mock_mismatch y/n.
Diagnosis + Recommendation (rules below). Do NOT print aggregate step totals across suites.
Frame the diagnosis from the glossary: a mock mismatch IS the signal that the sandbox test has drifted from current app behavior. The three routes below (SKIP / FIX-CODE / FIX-TEST-RERECORD) are not separate buckets — they're three possible SOURCES of that drift:
keploy proxy didn't replay correctly → drift is artificial, no real change → route A (SKIP).
app regressed → drift is unintended, fix the code → route B.
contract changed on purpose → drift is intentional, refresh the sandbox test → route C. Your repo inspection picks which source applies; the routes are the prescription for that source.
DIAGNOSE WITH THE REPO, NOT THE DEV. Before recommending anything on a failing run, inspect the source tree yourself (git log / git diff against the last green run or main, read the failing handler + its downstream call sites). DO NOT ask the dev "did you change X since the last green run" — you have the repo, find the answer. Only come back with a concrete conclusion.
mock_mismatch_dominant == true → failure signature is "keploy didn't intercept the app's egress traffic". Use git to check whether the failing endpoints or their dependency wiring have been modified recently: (a) NO relevant changes → tell the dev this is almost certainly a KEPLOY-SIDE issue and ask them to file a keploy issue with test_run_id. Do NOT ask them to re-record. (b) Relevant changes EXIST → name them (file:line or commit hash), explain how each plausibly caused the failure, say whether the change looks intended or accidental, and tell the dev exactly what to fix.
status == "has_failures" AND mock_mismatch_dominant == false → same discipline: identify the commit(s) / diff hunks that most likely caused each failure, state whether they look intended, and prescribe a fix (rerecord, revert, patch the handler). Don't hand the investigation back to the dev.
===== HANDLING "FIX IT" FOLLOW-UPS ===== (After the dev has seen the analysis and asks you to fix.)
═══════════════════════════════════════════════════════════════════ DO NOT JUMP TO RECORD — diagnose FIRST. ═══════════════════════════════════════════════════════════════════
A sandbox-replay failure is NOT a signal to rerecord. Re-recording without diagnosis silently captures the broken behavior as the new "expected" — masking a real app regression and erasing the evidence the dev needs.
When sandbox replay fails, your FIRST move is ALWAYS the diagnosis below (B vs C vs SKIP). You only call record_sandbox_test as part of route C, AND only AFTER update_test_suite has updated the suite to match the new intentional contract. If the contract hasn't changed (route B), DO NOT record — the captured mocks are still valid; only the app needs fixing.
If you find yourself thinking "let me just rerecord to fix this", STOP. Read failed_steps, inspect the repo for what changed, decide which route applies. Re-recording is a tool for capturing a NEW intentional contract, not a remedy for a failed run.
You have exactly THREE options for each failing step. Pick one per step based on your repo inspection; do not ask the dev which branch to take, decide:
A. SKIP — do nothing code-side. Pick this when mock_mismatch_dominant=true AND your repo inspection found no relevant changes in the failing handler or its dependencies. Rationale: this is a keploy egress-hook / proxy issue; editing the app or the test won't help. Tell the dev "flagged for keploy support, no app or test change needed" and move on to the next step (if any) or close.
B. FIX THE CODE — edit the handler / dependency wiring. Pick this when your repo inspection shows a recent change that broke the endpoint's contract AND the ORIGINAL test intent still matches what the endpoint SHOULD do (the test is correct, the code regressed). Make the minimal edit to restore expected behavior, tell the dev exactly which file:line you changed and why, then re-run: call replay_sandbox_test for the suite(s) whose steps you just un-broke. DO NOT record — the captured mocks are still valid if the contract hasn't changed intentionally.
C. UPDATE-FIRST, THEN RECORD — order matters: (1) update_test_suite first, (2) record_sandbox_test second, (3) replay_sandbox_test to verify. Calling record before update means you'd capture mocks against the OLD suite shape — defeats the purpose. Pick this when the endpoint's contract LEGITIMATELY changed (a deliberate new field, renamed response key, different status code, new required header) AND your repo inspection confirms the change is intended (commit message, surrounding diff, or obvious product direction). The update_test_suite call should edit the step's body / expected response / assertions / extract to match the new contract. Tell the dev which assertions you updated and why the contract change is considered intentional.
╔═══ ROUTE C — DECISION + RECOMMENDATION TEMPLATE (use verbatim) ═══╗ Decision input: read failed_steps[].authored_assertions and authored_response_body INLINE in this report. Do NOT call getTestSuite again unless those fields are absent (older runs). * If an authored assertion's expected value matches the diff's "expected" side → route C is MANDATORY. The suite's contract pins the old value; you MUST update_test_suite before record_sandbox_test, otherwise the next rerecord gate-1-fails on the same assertion and the suite comes back unlinked. * If authored_response_body has the old value but no assert is pinned to it → route C is still recommended (the captured response baseline drifts), but record_sandbox_test alone CAN succeed; choosing update_test_suite first keeps the suite source-of-truth aligned with the new contract. * If neither pins the diverging value → route C without assertion edits is sufficient (or route B if the change is unintentional).
Mandatory recommendation phrasing for the dev (one bullet per failing step that routes to C): "(1) update_test_suite for suite '<suite_name>' (id=<suite_id>) — change step '<step_name>' (id=<step_id>): set <field_path> from '' to '' and update assertion <assert_index> on the same path; (2) record_sandbox_test on that suite to refresh the captured baseline; (3) replay_sandbox_test to verify."
BANNED wording — never write any of these on a route-C recommendation: × "re-record the sandbox tests so the baseline picks up the new value" × "just rerecord to refresh the captured response" × "re-record and the new value will become the expected" × "re-record OR update assertions" (or any phrasing that joins update_test_suite and record_sandbox_test with "or" / "either … or" / "one of these two") × "you can either update the assertions or re-record" × "options: (a) update assertions, (b) re-record the suite" All five drop step (1) or present the two steps as interchangeable. They are NOT alternatives — they are sequential steps in a single route-C flow: (1) update_test_suite, (2) record_sandbox_test, (3) replay_sandbox_test. Skipping (1) leaves the suite's authored assertion pinned on the old value; the next replay gate-1-fails on the same diff. If you catch yourself reaching for "or" between these two tools on a route-C recommendation, restate using the mandatory template. ╚════════════════════════════════════════════════════════════════════╝
Multiple failing steps can land in DIFFERENT branches — e.g. one step is a real app regression (B), another is a contract change (C). In that case, explain the split up-front, apply each fix, and run sandbox replay once at the end covering every affected suite.
After any B or C branch completes, the final message uses the same 3-subsection format (per-suite table → failed-steps table → diagnosis + recommendation) on the follow-up sandbox replay, PLUS a short "Fix applied" preamble naming the file:line edits (for B) or update_test_suite calls (for C). For A-only responses (all failures route to keploy), no follow-up run is needed — just restate the keploy-issue recommendation.
===== REPLAY / "EXPLAIN MY LATEST SANDBOX REPORT" =====
When the dev asks "explain my latest sandbox report" / "analyse the last run" / "why did it fail" — call this tool again with the SAME app_id + test_run_id and verbose=true so the full diagnostics come back even if nothing failed. Use that detail to answer their question. If you don't have the test_run_id to hand, list the app's most recent runs OF THE RIGHT KIND via /client/v1/apps/{app_id}/test-runs?kind=<rerecord|sandbox_run|test_suite_run> and pick the top one. NEVER list /test-runs without the kind filter and pick the latest blindly — different kinds are co-mingled in that collection, and an unfiltered list will surface a rerecord run when the dev asked for the latest sandbox replay (or vice versa). Match the kind to what the dev asked: "explain my latest record" → kind=rerecord; "explain my latest sandbox replay" / "integration test report" → kind=sandbox_run; "explain my latest live run" → kind=test_suite_run. If the dev's verb is ambiguous, ASK which kind first (per the verb-routing's explain-branch rule).
| Name | Required | Description | Default |
|---|---|---|---|
| app_id | Yes | Keploy app ID | |
| verbose | No | Force detailed per-step diagnostics even when all steps passed. Default false. Set true for explain/analyse intents so the dev sees the full picture, not just a one-line summary. | |
| test_run_id | Yes | Test run ID. Extracted from data.test_run_id on the phase=done NDJSON event written to the progress_file by the headless sandbox CLI. |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Despite annotations already providing a destructive hint, the description adds extensive behavioral context: conditional output verbosity, linkage reporting for rerecord, step-level diagnostics, mock_mismatch detection, and detailed diagnosis rules. This goes far beyond what annotations convey, giving the agent a complete mental model of the tool's behavior.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is very long and contains redundant sections (e.g., repeated extraction instructions). While it is well-structured with ===== headings and front-loaded purpose, many sentences are procedural instructions for the agent rather than concise tool definition. It earns a 3 because it is clear but not minimal.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (conditional output, no output schema), the description fully covers return value shape, error handling, diagnosis steps, and follow-up actions. It leaves no gaps for an agent to guess behavior, making it contextually complete.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The schema already covers all three parameters with descriptions (100% coverage). The description adds modest extra meaning: for verbose, it clarifies 'Set true for explain/analyse intents' and for test_run_id it reiterates extraction guidance. However, these additions are minor and the baseline of 3 is appropriate since the schema carries the semantic load.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description opens with a specific verb (Fetch) and resource (per-suite / per-step pass/fail for a completed sandbox session), clearly defining the tool's scope. It distinguishes from sibling tools like getSuiteReport or getTestReport by explicitly framing it as the final step in a sandbox playbook.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides explicit when-to-use instructions: 'after you read the terminal NDJSON event (phase=done) and confirmed data.ok=true'. It also details how to extract the required test_run_id, and includes guidance for re-use (e.g., 'Call this again with verbose=true for explain/analyse intents'). This fully covers usage context without conflicting with siblings.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
getSubscriptionBDestructiveInspect
GET /company/subscription — Get company subscription — Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
No parameters | |||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description describes a read operation ('Get'), but annotations indicate destructiveHint=true and readOnlyHint=false, which contradicts the description. This is a serious inconsistency that misleads the agent about the tool's behavior.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is extremely concise, containing only one sentence with all necessary information (endpoint, action, scope). No wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity (no parameters, no output schema), the description provides sufficient context including the required scope. The only shortcoming is the contradiction with annotations, but dimension does not penalize for that here.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema has no parameters, so the description does not need to add parameter details. The baseline of 4 is appropriate as the description adds no unnecessary param info.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the HTTP method and resource ('GET /company/subscription') and the action ('Get company subscription'), making the tool's purpose immediately understandable. However, it does not distinguish it from sibling tools.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies the tool is for reading subscription data and requires a 'read' scope, providing basic usage context. No alternatives or when-not-to-use guidance is given.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
getSuiteReportCDestructiveInspect
GET /apps/{appId}/test-runs/{runId}/suite-reports/{reportId} — Get a suite report — Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| runId | Yes | Path parameter: runId | |
| reportId | Yes | Path parameter: reportId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description indicates it is a GET operation requiring read scope, but annotations report readOnlyHint: false and destructiveHint: true, creating a direct contradiction. No additional behavioral traits are disclosed beyond the path.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is extremely concise and front-loaded with the HTTP method and path. However, it omits critical behavioral context, sacrificing completeness for brevity.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the contradictory annotations and lack of output schema, the description fails to provide sufficient context about what the report contains or side effects. It leaves the agent with significant ambiguity.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
All three parameters have descriptions in the schema (100% coverage), but they are trivial ('Path parameter: appId'). The description repeats the path structure but adds no semantic meaning beyond what the schema provides.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it is a GET request to retrieve a suite report, with the path structure indicating specific parameters. However, it does not differentiate from sibling tools like listSuiteReports, leaving room for confusion.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance is provided on when to use this tool versus alternatives such as listSuiteReports. The description only states the HTTP method and required scope, which is insufficient for decision-making.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
getTestCaseBDestructiveInspect
GET /apps/{appId}/recordings/{testSetId}/test-cases/{testCaseId} — Get a single test case — Returns a single recorded test case by ID within a recording session. Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| testSetId | Yes | Path parameter: testSetId | |
| testCaseId | Yes | Path parameter: testCaseId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description describes a read-only GET operation, but annotations indicate destructiveHint: true and readOnlyHint: false, contradicting the description. No disclosure of potential side effects or required authorization beyond scope mention.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two compact sentences with HTTP method, URL template, purpose, and authorization scope. Front-loaded with key information, no superfluous text.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple GET with 3 required params and no output schema, the description provides essential info. However, it lacks details on response format, error handling, or differentiation from numerous sibling tools. Contradiction with annotations also undermines completeness.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema covers 100% of parameters with basic descriptions. Description adds mild context via the URL pattern and 'within a recording session', but does not explain semantic meaning (e.g., what appId, testSetId represent) beyond the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description explicitly states it is a GET to retrieve a single test case by ID within a recording session. The verb 'get' and resource 'single test case' provide clear purpose, distinguishing it from list or delete siblings.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance on when to use this tool vs alternatives. No explicit conditions, exclusions, or references to sibling tools like listTestCaseReports or getTestSuite. Usage must be inferred.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
getTestReportCDestructiveInspect
GET /apps/{appId}/test-reports/{reportId} — Get a test run report — Returns a single test run report by ID. Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| reportId | Yes | Path parameter: reportId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description states that the tool retrieves a test run report, implying a read-only operation. However, the annotations include destructiveHint: true, which explicitly marks it as destructive—a direct contradiction. The description fails to reconcile or explain this conflict, and provides no additional behavioral context beyond the basic verb.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is quite short (two sentences) and conveys the essential information. It is efficiently written, though the inclusion of the full HTTP path could be considered redundant. Overall, it earns its length without unnecessary verbosity.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The description does not describe the return value structure, pagination, error handling, or any side effects. With no output schema, the agent is left with minimal context about what the report contains. Additionally, the destructiveHint annotation contradicts the described behavior, leaving a critical gap in understanding the tool's full effect.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema has 100% description coverage, specifying both parameters as path parameters. The description adds no further semantic meaning beyond what the schema already provides, so a baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the HTTP method, resource path, and action: 'GET ... Get a test run report'. It specifies the unique identifier parameters and required scope, leaving no ambiguity about what the tool does. This distinguishes it from sibling tools like getLoadTestReport or getSuiteReport by the specific resource type.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance is provided on when to use this tool versus alternatives such as getSuiteReport or getLoadTestReport. Given the large number of sibling report tools, the lack of contextual hints or use-case differentiation is a significant gap.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
getTestRunCDestructiveInspect
GET /apps/{appId}/test-runs/{runId} — Get a test run — Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| runId | Yes | Path parameter: runId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description indicates a GET operation, implying a safe, non-destructive read. However, the annotations set destructiveHint=true and readOnlyHint=false, contradicting the description. This inconsistency undermines trust, and the description adds no further behavioral context beyond the HTTP method.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single, efficient sentence that conveys the endpoint and scope requirement. Every word serves a purpose, and there is no redundant or extraneous information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the number of sibling tools and the lack of output schema, the description is insufficient. It does not explain what a test run is, what the response contains, or how this tool differs from similar get* tools. The agent would lack context for proper selection and invocation.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline is 3. The description does not add any additional meaning beyond the schema, which already lists the path parameters with minimal descriptions (e.g., 'Path parameter: appId'). No parameter defaults, constraints, or usage hints are provided.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the HTTP method, path template, and that it retrieves a test run. It explicitly mentions the required scope 'read'. However, it does not differentiate this tool from sibling get* tools like getTestSuite or getTestReport, which also perform reads on similar resources.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance is provided on when to use this tool versus alternatives (e.g., listTestRuns for multiple runs, getTestReport for a report). There are no usage conditions or exclusions beyond the implied scope requirement.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
getTestSuiteCDestructiveInspect
GET /apps/{appId}/test-suites/{suiteId} — Get a test suite — Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| suiteId | Yes | Path parameter: suiteId | |
| branch_id | No | Query parameter: branch_id |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description says 'Get a test suite', implying a read-only operation, but annotations indicate destructiveHint: true and readOnlyHint: false, constituting a contradiction. No additional behavioral context is provided to resolve this.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single efficient sentence that includes the endpoint, action, and authentication requirement. It could be improved by leading with the action rather than the full URL.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
No output schema exists, and the description does not hint at the return value (e.g., test suite details). The contradictory annotations also reduce completeness.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Input schema has 100% coverage with basic descriptions ('Path parameter: appId'). The tool description adds minimal value (the URL pattern) beyond the schema, meeting the baseline.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it is a GET operation to retrieve a test suite, with specific path parameters. However, it does not explicitly differentiate from sibling tools like getTestCase or listTestSuites.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions the required scope ('read') but provides no guidance on when to use this tool versus alternatives like listTestSuites or other get tools.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
getUsageBDestructiveInspect
GET /company/usage — Get company usage — Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
No parameters | |||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description claims it is a GET request (implying read-only), but annotations indicate readOnlyHint=false and destructiveHint=true, creating a direct contradiction. No behavioral details beyond the method are provided.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is extremely concise, containing only essential information: HTTP method, path, resource, and scope requirement. Every word earns its place with no fluff.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The description is minimal but sufficient for a simple, parameterless tool. However, it lacks explanation of what 'usage' means or the return format, especially important given no output schema.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
There are no parameters, and the input schema is empty with 100% coverage. The description adds no parameter info, but the baseline for zero parameters is 4, as the schema fully handles it.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Get company usage', specifying the HTTP method and resource. However, it doesn't differentiate from siblings or clarify what 'usage' encompasses, making it somewhat vague.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance is provided on when to use this tool versus alternatives. The only additional info is the required scope, which does not aid in tool selection.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
getValidationResultBDestructiveInspect
GET /jobs/{jobId}/validation-result — Get job validation result — Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| jobId | Yes | Path parameter: jobId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description states it is a GET request, implying idempotent read-only behavior, but annotations include destructiveHint: true, which is a direct contradiction. No additional behavioral context is provided beyond the HTTP method.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single sentence that front-loads the HTTP method and resource, with no wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With only one parameter and no output schema, the description lacks details on what the validation result contains or how to interpret it, leaving the agent with incomplete context for using the response.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The only parameter, jobId, is fully described in the schema (coverage 100%), and the description merely restates it as 'Path parameter: jobId', adding no new meaning.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the HTTP method and resource: 'GET /jobs/{jobId}/validation-result — Get job validation result'. This is a specific verb+resource combination that distinguishes it from siblings like getJob or validateTestSuite.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions a prerequisite ('Requires scope: read') but does not indicate when to use this tool versus alternatives like getJob or validateTestSuite, nor when not to use it.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
importRecordingADestructiveInspect
POST /apps/{appId}/recordings/{testSetId}/import — Import test case changes into a recording — Bulk import test case changes: update existing test cases (by ID), insert new ones (without ID), and delete specified test cases. Requires scope: write.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| testSetId | Yes | Path parameter: testSetId | |
| test_cases | No | ||
| delete_test_case_ids | No | IDs of test cases to delete |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description discloses that the tool is destructive (can delete test cases) and requires write scope, adding context beyond the annotations that already indicate destructiveness.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two efficient sentences that front-load the endpoint and operations, with no wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the parameters and annotations, the description adequately covers the tool's behavior and requirements, though a more explicit format for test_cases would improve completeness.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The description adds meaning by explaining that test_cases array contains objects (with or without IDs for update/insert) and that delete_test_case_ids specifies deletions, going beyond the bare schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool imports test case changes into a recording and details the three operations (update, insert, delete). It distinguishes itself from siblings like updateTestCase by indicating bulk import.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explains the bulk operations and required scope, but does not explicitly state when to use this tool over alternatives like updateTestCase.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
listAPIKeysADestructiveInspect
GET /api-keys — List API keys — Requires scope: admin.
| Name | Required | Description | Default |
|---|---|---|---|
No parameters | |||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Description claims read-only listing but annotations have destructiveHint=true, creating a contradiction. No additional behavioral context beyond listing.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Single sentence, efficient, covers essential information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Minimal but adequate for a list endpoint; lacks details like return format or pagination, and siblings are similar.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
No parameters in schema, so description adds no param info; baseline 4 applies.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Clearly states 'List API keys' with HTTP method and required scope, distinguishing it from create/revoke siblings.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Mentions required admin scope but provides no guidance on when to use this tool versus alternatives like createAPIKey.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
listAppsADestructiveInspect
GET /apps — List apps — Returns the tenant's apps. Use the optional q query parameter to name-filter (case-insensitive substring, e.g. ?q=orderflow → apps whose name contains 'orderflow'); without it the full paginated list is returned. Callers that know the app's folder / repo name should pass it as q to avoid paginating through hundreds of apps. Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| q | No | Query parameter: q | |
| limit | No | Query parameter: limit | |
| offset | No | Query parameter: offset |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description indicates a read-only GET operation ('Returns'), but annotations mark destructiveHint=true. This is a direct contradiction. Beyond this, the description does not disclose other behavioral traits like idempotency or side effects.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three sentences front-loaded with the main purpose (GET /apps — List apps). The second sentence explains q with a concrete example, and the third gives a practical usage tip. No redundant words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Adequate but incomplete: the description covers the main functionality and q parameter, but it lacks explanation of pagination details (e.g., default limit, response structure) and does not mention output format. The annotation contradiction further reduces completeness.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline is 3. The description adds substantial value for the q parameter (explaining case-insensitive substring filter with an example). Limit and offset are left implicit in 'paginated list,' but the schema already defines them.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'List apps' and 'Returns the tenant's apps,' with a specific verb and resource. It also explains the optional q parameter for filtering. While it doesn't explicitly differentiate from siblings like listAppsByCluster, the purpose is unambiguous.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Provides clear guidance: (a) use q when you know the app's name to avoid pagination; (b) without q, full paginated list is returned. It also notes the required scope 'read.' However, it does not compare with alternative sibling tools like listAppsByCluster.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
listAppsByClusterADestructiveInspect
GET /apps/by-cluster/{clusterId} — List apps in a cluster — Returns apps belonging to a specific cluster. More efficient than iterating all apps. Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| clusterId | Yes | Path parameter: clusterId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description claims a read-only listing operation, yet annotations set destructiveHint: true. This is a serious contradiction; also readOnlyHint: false is inconsistent. The description fails to clarify the tool's side effects.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Description is three brief segments but includes redundant HTTP path. Not overly verbose, but could be more front-loaded without the path repetition.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Despite incomplete annotations and no output schema, the description covers purpose, efficiency, and scope. However, it omits return format and doesn't resolve the annotation contradiction, leaving gaps for a simple tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema has 100% coverage with 'Path parameter: clusterId'. Description adds 'belonging to a specific cluster', linking parameter to usage context, though only slightly beyond schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb 'List' and resource 'apps in a cluster', explicitly differentiating it from 'iterating all apps' to distinguish from sibling tool listApps.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly mentions efficiency advantage over iterating all apps (implying use when clusterId is known) and required scope 'read'. No explicit when-not-to-use, but context with sibling tools makes it clear.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
listAppsWithRecordingsADestructiveInspect
GET /apps/with-recordings — List proxy apps with network recordings — Returns all k8s-proxy apps (origin.type=PROXY). These apps are auto-created by the Keploy k8s-proxy agent on first recording and contain network recordings of ingress HTTP traffic (as Keploy test cases) and egress dependency calls — database queries, external API calls, message queues — captured as Keploy mocks. Use listRecordings and getRecording to access the recorded request/response pairs and dependency mocks from live environments. Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
No parameters | |||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description indicates this is a read-only GET operation, but annotations include destructiveHint=true and readOnlyHint=false, which contradict the description. This is a serious inconsistency that undermines transparency. The description does not disclose any destructive behavior, so it fails to align with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise, front-loads the key action (GET /apps/with-recordings), and packs essential information (return type, purpose, related tools, scope) without wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given no parameters and no output schema, the description provides adequate context: what the tool returns, the nature of the apps, and next steps. It could mention pagination or ordering, but this is not critical for a simple list endpoint.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema has no parameters (100% coverage), so the description does not need to add parameter details. Baseline score for zero parameters is 4, and the description adds no unnecessary param information.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb (List/GET), the specific resource (k8s-proxy apps with network recordings), and distinguishes from siblings like listApps by specifying 'proxy apps with network recordings'. It also explains the origin and purpose of these apps.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides context that this tool lists proxy apps, and directs users to listRecordings and getRecording for accessing the actual recordings. It also mentions required scope 'read'. While it doesn't explicitly state when not to use it, the context is sufficient for an agent to differentiate from siblings.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
list_branchesADestructiveInspect
List Keploy branches on an app. Use this BEFORE any write tool (create_test_suite, update_test_suite, sandbox flows) to see if a branch already exists for the dev's current work. If one matches the dev's git branch, pass its branch_id (or branch_name) to the write tool. Otherwise call create_branch.
Output: JSON array of {id, name, status, created_at}. Status one of: open, review_requested, approved, changes_requested, merged, closed, conflict.
Optional status filter narrows the list (e.g., "open" hides merged/closed branches).
Output: {"branches": [{id, name, status, created_at, updated_at}, ...]}. Synchronous — no playbook to walk.
| Name | Required | Description | Default |
|---|---|---|---|
| app_id | Yes | Keploy app ID | |
| status | No | optional status filter — open / review_requested / approved / changes_requested / merged / closed / conflict |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Description claims a read-only list operation, but annotations set destructiveHint=true, creating a direct contradiction. Per rules, score 1 when description contradicts annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Description is well-structured with purpose, usage, and output format front-loaded. Slightly redundant output mention, but overall efficient and easy to parse.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a list tool with no output schema, description fully specifies return format, status values, and synchronous behavior. Provides complete guidance for agent decision-making.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, but description adds value with an example of how the status filter works ('open hides merged/closed branches') and explains the effect, going beyond schema definitions.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool lists Keploy branches on an app, with specific verb and resource. It distinguishes from sibling tools like create_branch and update_test_suite by recommending its use before write operations.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly provides when to use (before write tools) and when not (if branch exists, pass ID; otherwise call create_branch). Includes context for filtering by status.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
listBranchesCDestructiveInspect
GET /apps/{appId}/branches — List Keploy branches on an app — Returns every Keploy branch on an app, optionally filtered by status. Used by MCP write tools and CI integrations to enumerate branches without shelling out to the CLI. Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| status | No | Query parameter: status |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description says it's a GET list operation (read-only), but annotations set destructiveHint=true and readOnlyHint=false, directly contradicting the described behavior. This is a serious inconsistency.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences, front-loads the operation and resource, and efficiently conveys usage and authorization. No unnecessary words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the annotation contradiction, the description fails to clarify the true behavioral nature. It also omits details like allowed status values or output format, making it incomplete for a tool with contradictory annotations.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so the baseline is 3. The description mentions 'optionally filtered by status', which merely restates the schema without adding new meaning or constraints.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool lists Keploy branches on an app, with a specific verb and resource. However, it does not differentiate from the sibling tool 'list_branches', which appears to serve the same purpose, preventing a score of 5.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description indicates it is used by MCP write tools and CI integrations, providing usage context. But it lacks explicit guidance on when not to use this tool or alternatives, and does not compare to the sibling 'list_branches'.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
listClustersBDestructiveInspect
GET /clusters — List clusters — Returns all clusters for the authenticated company. Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
No parameters | |||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations indicate destructiveHint=true, yet description describes a GET operation with 'read' scope, implying non-destructive read. This contradiction is not addressed, severely undermining transparency.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Extremely concise two sentences, front-loaded with essential information. No superfluous content.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Covers basic purpose and auth requirement, but lacks details on response format, pagination, or reconciliation of contradictory annotation. Adequate for a simple tool but incomplete given annotation conflict.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
No parameters in schema; baseline score of 4 per rubric. Description adds no parameter info as none exist.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states 'List clusters' returning all clusters for the authenticated company. However, it does not differentiate from siblings like listAppsByCluster, leaving potential ambiguity.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Implies use for retrieving all clusters, but provides no guidance on when to use alternatives or exclusion criteria. No explicit when-not-to-use.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
listGenerationHistoryBDestructiveInspect
GET /apps/{appId}/generation-history — List generation history — Requires scope: read. Returns all entries (no pagination).
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description describes a read operation ('List generation history'), but annotations indicate destructiveHint: true, contradicting the description. This is a serious inconsistency.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single concise sentence front-loaded with essential information (HTTP method, path, purpose, scope, pagination). No redundant words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The description covers purpose, scope, and pagination, which is adequate for a simple 1-param listing tool. However, the annotation contradiction undermines completeness regarding side effects.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
With 100% schema description coverage, the schema already documents 'appId' as a path parameter. The description adds no new semantic meaning beyond restating the path template.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool lists generation history for an app, specifying the HTTP method and path. It distinguishes from sibling list tools by focusing on generation history.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions a required scope ('read') but does not explicitly state when to use this tool vs alternatives or provide when-not guidance. Usage is implied by the name.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
listJobsCDestructiveInspect
GET /jobs — List jobs — Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| limit | No | Query parameter: limit | |
| offset | No | Query parameter: offset |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description says 'GET /jobs' implying a read-only, idempotent operation, but annotations set destructiveHint=true and readOnlyHint=false, contradicting the described behavior. This creates confusion about whether listing jobs has side effects.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is very short (two items) and front-loaded with the key action. No wasted words, but could benefit from a slightly more structured presentation.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The description lacks information about the return value (e.g., list of job objects), pagination behavior, or any additional context beyond the scope requirement. Given no output schema, this is a significant gap.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Input schema covers 100% of parameters (limit, offset) with basic descriptions. The description adds no additional meaning beyond the schema, so a baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the HTTP method and resource ('GET /jobs — List jobs'), specifying what the tool does. However, it does not distinguish this listing tool from other similar listing tools like listTestRuns or listRecordings, which slightly reduces clarity.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides no guidance on when to use this tool versus alternatives (e.g., getJob, stopJob). It only mentions the required scope 'read', but lacks explicit context for selection among siblings.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
listLoadTestRunsBDestructiveInspect
GET /apps/{appId}/load-tests — List load test runs — Requires scope: read. Returns all runs (no pagination).
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description states it is a GET operation that lists runs, suggesting read-only behavior. However, annotations have destructiveHint: true and readOnlyHint: false, contradicting this. The description does not resolve this inconsistency and adds no further behavioral details beyond scope and pagination.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single sentence that includes the endpoint, action, required scope, and behavior (no pagination). It is concise without unnecessary words, every element adds value.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a list operation with one parameter, the description covers the main behavior: it lists all runs without pagination and requires read scope. The absence of an output schema is partly compensated by mentioning the return nature, but no format details are provided. The contradiction with annotations reduces completeness slightly.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema has 100% coverage with a single required parameter 'appId' described as 'Path parameter: appId'. The description does not add additional meaning beyond what the schema provides, so baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'List load test runs' with the endpoint path, indicating a specific verb and resource. It distinguishes from the sibling 'listTestRuns' by specifying 'load tests' in the path and name, though not explicitly differentiating.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides the required scope ('read') and notes that it returns all runs with no pagination, which implies usage context. However, it does not explicitly state when to use this tool versus alternatives or when not to use it.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
listMocksADestructiveInspect
GET /apps/{appId}/recordings/{testSetId}/mocks — List mocks for a recording — Returns mock reference metadata and optionally parsed mock specs for a test set. Use ?include_specs=true to download and parse the actual mock YAML from object storage. Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| testSetId | Yes | Path parameter: testSetId | |
| include_specs | No | Query parameter: include_specs |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description states it is a GET request listing mocks, implying idempotent and non-destructive behavior. However, annotations set destructiveHint=true, contradicting the description. This inconsistency undermines trust and requires resolution.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise, using two sentences to convey endpoint, purpose, parameters, and scope. Information is front-loaded with the HTTP method and endpoint URL, leaving no room for ambiguity.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The description adequately explains the tool's behavior for its complexity level, including return types (metadata, optionally parsed specs). It does not provide a full return structure but is sufficient for selection. The main gap is the annotation contradiction.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, and the description adds significant value beyond the schema by explaining that include_specs downloads and parses mock YAML from object storage. It clarifies the parameter's purpose and impact.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb (list) and resource (mocks for a recording), specifies the endpoint path, and distinguishes it from other list tools by focusing on mocks. It provides precise scope and return details.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explains when to use the include_specs query parameter, but does not provide explicit guidance on alternatives or when not to use this tool. It implies usage context (listing mocks) but lacks comparisons to sibling tools like listRecordings or listTestSuites.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
listRecordingsBDestructiveInspect
GET /apps/{appId}/recordings — List recording sessions — Returns test sets (recording sessions) for an app. Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| limit | No | Query parameter: limit | |
| offset | No | Query parameter: offset |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description claims a read operation ('List recording sessions' and 'Returns test sets'), yet the annotations set destructiveHint=true, indicating potential side effects. This contradiction undermines transparency and misleads the agent.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single concise sentence front-loading the endpoint and purpose. Every part contributes, though the structure could be slightly improved with clearer separation of sections.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With 3 parameters and no output schema, the description is too brief. It omits pagination (limit/offset), response format, and any detail beyond the basic purpose, leaving the agent underinformed.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so the schema already documents the three parameters. The description adds no additional meaning beyond the schema, meeting the baseline but not exceeding it.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the HTTP method, endpoint, and resource ('List recording sessions – Returns test sets (recording sessions) for an app'). It precisely identifies the action and resource, distinguishing it from siblings like 'getRecording' or 'importRecording'.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description does not provide explicit guidance on when to use this tool versus alternatives. It only implies usage by stating the resource and scope requirement, but no when-not or alternative recommendations are given.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
listSuiteReportsADestructiveInspect
GET /apps/{appId}/test-runs/{runId}/suite-reports — List suite reports for a test run — Requires scope: read. Supports cursor-based pagination.
| Name | Required | Description | Default |
|---|---|---|---|
| after | No | Query parameter: after | |
| appId | Yes | Path parameter: appId | |
| runId | Yes | Path parameter: runId | |
| before | No | Query parameter: before | |
| page_size | No | Query parameter: page_size |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description presents the tool as a read-only GET operation requiring 'read' scope, but annotations mark readOnlyHint as false and destructiveHint as true, creating an annotation contradiction. The description does not address this discrepancy or disclose any potential side effects.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences long, front-loads the endpoint and action, and includes essential details (scope, pagination) without any wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a list operation with no output schema, the description adequately covers the action, scope, and pagination. However, it does not mention sorting, filtering options (beyond cursor), or output format. Annotations fill some gaps but the description could be more complete.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Input schema has 100% description coverage, but the parameter descriptions are minimal (e.g., 'Query parameter: after'). The description adds value by stating 'Supports cursor-based pagination', which clarifies the purpose of the 'after', 'before', and 'page_size' parameters beyond the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the action ('List suite reports'), the resource ('for a test run'), and includes the REST endpoint and required scope. It distinguishes itself from sibling tools like 'getSuiteReport' (singular) and 'listTestSuites' (different resource).
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides no guidance on when to use this tool versus alternatives (e.g., getSuiteReport for a single report). No exclusions or context for selection among siblings.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
listTestCaseReportsCDestructiveInspect
GET /apps/{appId}/test-reports/{reportId}/test-set-reports/{testSetReportId}/test-cases — List test case reports — Returns individual test case results with expected/actual diffs within a test set report. Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| limit | No | Query parameter: limit | |
| offset | No | Query parameter: offset | |
| reportId | Yes | Path parameter: reportId | |
| testSetReportId | Yes | Path parameter: testSetReportId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description states a GET operation that lists data, but annotations include destructiveHint=true and readOnlyHint=false, contradicting the read nature. The description fails to clarify this inconsistency or disclose other behaviors like pagination or error states.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single sentence with an HTTP path, which is concise but lacks structure. It includes redundant method/path info. It could be more efficiently front-loaded with key information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With no output schema, the description partially explains return values (test case results with diffs) but omits pagination details implied by limit/offset parameters. The annotation contradiction further undermines completeness.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the description is not required to detail parameters. However, it adds no additional meaning beyond the schema, such as explaining the limit and offset for pagination. Baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool lists test case reports and returns individual test case results with expected/actual diffs within a test set report. It also specifies the HTTP method and path. However, it does not differentiate from sibling list tools like listTestReports or listTestSetReports, lacking explicit distinction.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions a required scope ('read') but provides no guidance on when to use this tool versus alternatives. Given many sibling list tools, explicit context or exclusions are absent.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
listTestReportsCDestructiveInspect
GET /apps/{appId}/test-reports — List test run reports — Returns test run report summaries for an app with pass/fail counts. Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| limit | No | Query parameter: limit | |
| offset | No | Query parameter: offset |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Description indicates a read operation (list), but annotations set readOnlyHint=false and destructiveHint=true, creating a direct contradiction. No additional behavioral context is provided.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Single sentence plus required scope. Efficient and front-loaded with HTTP method and path. No unnecessary details.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
No output schema, but description gives minimal return info (summaries with pass/fail). Lacks details on response format, pagination, or error cases. Annotation contradiction undermines completeness.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema covers all parameters (100% coverage). Description adds no extra meaning beyond the path mention; limit and offset are not explained beyond schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Clearly states the tool lists test run reports for an app with pass/fail counts. Identifies the HTTP method and path, distinguishing it from sibling list tools.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Only mentions required scope ('read'). No guidance on when to use this vs alternatives like getTestReport or listSuiteReports. Missing exclusion criteria.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
listTestRunsBDestructiveInspect
GET /apps/{appId}/test-runs — List test runs — List test runs for an app. Optional kind query param filters by run kind: rerecord (record_sandbox_test runs), sandbox_run (replay_sandbox_test runs), or test_suite_run (replay_test_suite live runs). Omit to return runs of every kind. Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| kind | No | Query parameter: kind | |
| appId | Yes | Path parameter: appId | |
| limit | No | Query parameter: limit | |
| offset | No | Query parameter: offset |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description says 'GET' implying read-only, but annotations indicate destructiveHint=true and readOnlyHint=false, creating a contradiction. No additional behavioral traits are disclosed beyond that.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single concise sentence that front-loads the purpose and includes a scope requirement. Every part is necessary with no extra words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The description lacks details about pagination behavior, default values for limit/offset, and response format, which are important for a list tool with no output schema. It is incomplete.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with minimal descriptions like 'Path parameter: appId'. The tool description adds no extra meaning, so baseline score of 3 applies.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the action 'List test runs' and the resource, using a specific verb and noun. It distinguishes from siblings like getTestRun by being a list operation.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides no guidance on when to use this tool vs alternatives such as getTestRun or listTestSuites. It only mentions a scope requirement, which is a prerequisite, not usage context.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
listTestSetReportsBDestructiveInspect
GET /apps/{appId}/test-reports/{reportId}/test-set-reports — List test set reports within a run — Returns per-test-set results within a test run report. Requires scope: read.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| limit | No | Query parameter: limit | |
| offset | No | Query parameter: offset | |
| reportId | Yes | Path parameter: reportId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description describes a read-only GET operation, but annotations set destructiveHint=true and readOnlyHint=false, creating a direct contradiction. The description does not address or resolve this inconsistency.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is two sentences long with no redundant information. It efficiently conveys the HTTP method, resource path, and brief return value.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The description is vague about return values and does not mention pagination (limit/offset). Given the contradictory annotations and no output schema, it fails to provide sufficient context for an agent to use the tool correctly.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema already covers all 4 parameters with descriptions. The description adds context by explaining that the list is 'within a run', clarifying the role of reportId. This adds value beyond the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool lists test set reports within a run, using the verb 'List' and specifying the resource. It distinguishes from siblings like listTestReports and listSuiteReports by focusing on per-test-set results within a test run.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage context (within a run) but does not explicitly state when to use this tool vs alternatives like listTestReports or listSuiteReports. No exclusions or conditions are provided.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
listTestSuitesBDestructiveInspect
GET /apps/{appId}/test-suites — List test suites — List test suites for an app. Optional has_sandbox_test query param filters by sandbox-test linkage: true returns only suites that have a sandbox test (linked=true / test_set_id populated); false returns only suites without one. Omit to return every suite. Requires scope: read. Supports cursor-based pagination.
| Name | Required | Description | Default |
|---|---|---|---|
| q | No | Query parameter: q | |
| after | No | Query parameter: after | |
| appId | Yes | Path parameter: appId | |
| before | No | Query parameter: before | |
| branch_id | No | Query parameter: branch_id | |
| page_size | No | Query parameter: page_size | |
| has_sandbox_test | No | Query parameter: has_sandbox_test |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description says 'List test suites' implying a read-only operation, but annotations set destructiveHint=true and readOnlyHint=false, creating a contradiction. No additional behavioral traits disclosed.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Single sentence with essential information: HTTP method, path, scope, and pagination. No redundant words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Lacks output format details despite no output schema. Pagination mention is helpful, but contradictions and missing return structure reduce completeness.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline 3 applies. Description does not add parameter meaning beyond the schema's property descriptions.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the HTTP method, path, and action: list test suites. It distinguishes from siblings like createTestSuite and deleteTestSuite by indicating a read operation.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Mentions required scope `read` and cursor-based pagination, providingcontext for when to use. However, no explicit comparison to other list tools or when-not conditions.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
normalizeSuiteReportBDestructiveInspect
POST /apps/{appId}/test-runs/{runId}/suite-reports/{reportId}/normalize — Normalize a suite report — Requires scope: write.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| runId | Yes | Path parameter: runId | |
| reportId | Yes | Path parameter: reportId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate destructiveHint=true and readOnlyHint=false. The description adds the scope requirement ('Requires scope: write') and the HTTP method (POST), reinforcing mutation. However, it does not explain the nature of the destruction or what happens to the report during normalization.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is efficiently concise in one sentence, containing the HTTP method, path, purpose, and required scope. However, the path redundancy could be omitted for more focused text.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a destructive tool with no output schema, the description should explain the outcome or side effects of normalization. It lacks details on what changes occur, reversibility, or return values, making it incomplete for proper usage.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Input schema coverage is 100%, and description does not add meaning beyond the schema's 'Path parameter' labels. The parameter names (appId, runId, reportId) are self-explanatory from context, but no further guidance on values or format is provided.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Normalize a suite report', a specific verb+resource combination, and the tool name distinguishes it from sibling 'normalizeTestRun' for test runs. However, the term 'normalize' is vague without elaboration on what the operation entails.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance is provided on when to use this tool versus alternatives. The description only mentions a required scope (write) but does not indicate prerequisites, side effects, or situations where normalization is appropriate.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
normalizeTestRunCDestructiveInspect
POST /apps/{appId}/test-runs/{runId}/normalize — Normalize a test run — Requires scope: write.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| runId | Yes | Path parameter: runId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate destructive and non-read-only behavior. The description adds the required auth scope ('write') and HTTP method, which provides useful context beyond annotations. However, it does not explain what normalization entails or what data is affected.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is very concise with a single sentence and relevant details (HTTP method, scope). No words are wasted, though more detail could be beneficial.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a destructive tool, the description lacks explanation of what 'normalize' does, what is destroyed or modified, and what the return value is. Without output schema, this is a significant gap.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with both parameters described simply as path parameters. The description does not add any additional meaning or constraints beyond the schema, meeting the baseline expectation.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description states the action ('Normalize a test run') and includes HTTP method and path, which provides a basic purpose. However, the term 'normalize' is vague without further explanation, and it does not distinguish this tool from similar siblings like 'normalizeSuiteReport'.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description offers no guidance on when to use this tool versus alternatives. There are no conditions, prerequisites, or exclusions mentioned, leaving the agent without context for selection.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
record_sandbox_testADestructiveInspect
Record (or refresh) the sandbox test for one or more existing test suites — captures the request/response per step plus the outbound mocks (DB, downstream HTTP, etc.) against the dev's locally-running app, then links the captures onto the suite. Use this when the dev says "record", "rerecord", "re-record", "refresh the recordings", "capture mocks", or as the RECORD step in FROM-SCRATCH (after create_test_suite).
This tool resolves the app (if only a hint is given), resolves ONE OR MORE suites to record (by exact ids OR case-insensitive name substring match), and delegates to a headless playbook. Output produces a RERECORD REPORT — it answers "did the sandbox test get created and linked successfully?".
╔═══ PRE-CHECK — DID YOU ARRIVE HERE FROM A FAILED REPLAY? ═══╗ This tool refreshes the CAPTURED BASELINE (mocks + recorded request/response per step). It does NOT modify the suite's authored assert array or response.body — those are the contract as defined when the suite was created/updated. If the contract changed and you re-record without updating the suite first, the new rerecord fires the suite's stale assertions against the live app, gate-1-fails on the same diff, and the suite comes back unlinked.
Before calling THIS tool in response to a failed replay_sandbox_test or replay_test_suite, walk these checks:
Read failed_steps[].authored_assertions and authored_response_body in the most recent get_session_report (kind=sandbox_run / test_suite_run). The fields are inlined — no second tool call needed unless the report predates the inlined fields.
For each failing step: does any authored assertion pin the diverging value? (e.g. assert {path: "$.order.status", expected: "created order"} where the diff says "expected 'created order', got 'created'".)
YES → call update_test_suite FIRST to update that assertion + the response.body field, THEN call this tool.
NO → safe to call this tool directly; the captured baseline drifts but no authored assertion blocks the rerecord.
If you can't find authored_assertions in the report (older format) AND don't already know the suite's shape, call getTestSuite({app_id, suite_id, branch_id}) to inspect the assert array before deciding. Don't guess.
REFUSE-RULE: if the dev confirms a contract change is intentional and the failing step has a pinned authored assertion on the diverging value, you MUST run update_test_suite before this tool. Calling record_sandbox_test FIRST in that case is the bug this pre-check exists to prevent — don't justify it as "let's just refresh the baseline first". The order is update → record → replay; never record → update. ╚═══════════════════════════════════════════════════════════════╝
===== BEFORE CALLING — one-time setup =====
(a) APP_ID RESOLUTION (skip if app_id is already known): * Derive a likely app name from the cwd's basename (e.g. cwd=/home/dev/orderflow → "orderflow"). Lowercase it. * Call listApps({q: ""}) — the server does a case-insensitive server-side substring match, so you don't paginate the full tenant list (can be hundreds of apps on shared accounts). * Exactly one match → use its id. Multiple → list them and ASK the dev which one (a wrong app_id silently routes traffic + suite creates into the wrong app). Exception: if the compose file / repo layout unambiguously pins one candidate (e.g. compose has service "producer" and one candidate is ".producer" while others are unrelated siblings), you may pick it AND tell the dev up-front so they can correct. * Zero matches → ASK permission to create a new Keploy app with the derived name; on yes, call createApp({name, endpoint}) and use the returned id. * Alternatively pass app_name_hint to THIS tool and the server resolves it (same rules; multiple/zero → typed error).
(b) KEPLOY BINARY VERIFICATION: * Bash: "keploy --version" (or "~/.keploy/bin/keploy --version"). If it exits non-zero the binary is missing. * If missing OR older than this MCP server was built against, install/upgrade: curl --silent -O -L https://keploy.io/ent/install.sh && source install.sh * Re-verify with "keploy --version"; fail loudly if still absent (tell the dev where keploy put the binary so they can add it to PATH).
===== DOCKER-COMPOSE NETWORK RULE (absolute) =====
Use the SAME compose file + service that was used in the validate-curl phase. Do NOT point keploy at a second "keploy-only" compose file — docker-compose isolates each file into its own project + network, so the app container spawned by keploy cannot reach the DB/Kafka containers that validate brought up (and the network-name collision blocks keploy from starting). Correct flow: (i) Validate phase: "docker compose up -d" (brings up app + deps on network _default). (ii) Before calling record_sandbox_test, Bash: "docker compose stop <app_service> && docker compose rm -f <app_service>" — stop ONLY the app service; leave deps running so keploy's new app container can reach them on the existing network. (iii) Pass app_command = "docker compose up <app_service>" (same compose file, same project → same network). container_name = the actual name set by compose (e.g. "orderflow-producer", not "producer").
===== RESOLUTION RULES (server-side, no guessing) =====
App: caller provides app_id OR app_name_hint. With a hint, the server does listApps({q: hint}). Zero matches → typed error; multiple → typed error listing them so Claude asks the dev.
Suites: DEFAULT IS "ALL LINKED". When the dev says "record my sandbox tests" / "rerecord everything" / "refresh my recordings" with no specific suite named, LEAVE BOTH suite_ids AND suite_name_hint UNSET. Do NOT list suites first and pass a comma-joined UUID list back — the CLI resolves "every linked suite for the app" itself, cleaner and less brittle. Only pass a narrower selector when the dev explicitly names suites:
suite_ids (comma-separated, exact) — when you already have the IDs.
suite_name_hint (case-insensitive substring match) — when the dev names suites by human phrasing like "the auth suite" or "deterministic". Every suite whose name contains the substring is recorded. If the dev asks to record suites that don't exist yet (zero match) → typed error. Any ≥1 match is fine. DO NOT prompt the dev for which suites to record — default to all linked if they didn't name any.
===== DISCOVERY — when the dev hands you a bare suite_id with no app_id / branch_id =====
Suites live on a (app_id, branch_id) tuple. A bare suite_id has NO on-disk hint about which app or branch holds it; you have to RESOLVE both before calling this tool. Walk these steps in order — STOP as soon as getTestSuite returns 200:
Detect the dev's git branch: Bash
git rev-parse --abbrev-ref HEADin app_dir. If exit non-zero / output is "HEAD" → not a git repo / detached HEAD; ASK the dev for the Keploy branch name.Resolve candidate apps via the cwd basename: Bash
basename $(pwd)→ call listApps with q=. Usually 1–2 candidates. If 0 → ASK; if >1 → walk every candidate in step 4.For each candidate app, call list_branches({app_id}) and find the branch whose
namematches the git branch from step 1. That gives you {branch_id}. If no match → not this app, try next.Verify with getTestSuite({app_id, suite_id, branch_id=<from step 3>}). 200 → resolved; 404 → wrong app/branch, try next.
If steps 2–4 exhaust without a hit, walk every OPEN branch on each candidate app via list_branches → getTestSuite. Then try main (branch_id omitted). If still nothing → ASK the dev for the {app_id, branch_id} pair.
The standard pattern when "search the suite by id" returns nothing is NOT "give up and ask the dev which app" — it's "the suite exists on a BRANCH, walk discovery". Suites created via create_test_suite + rerecord on a Keploy branch are INVISIBLE to a main-view listTestSuites; you have to scope each call to a branch.
After resolving once in a session, REUSE the {app_id, branch_id} for any subsequent suite-targeted call (replay_sandbox_test, update_test_suite, replay_test_suite); don't re-walk discovery for every action.
===== PREREQUISITES =====
app_command: shell command that starts the dev's app (e.g. "docker compose up producer").
app_url: base URL the app listens on, e.g. http://localhost:8080.
app_dir: absolute path to repo root.
container_name if app_command is docker-compose.
keploy binary on PATH. If
which keployreturns nothing, install it before calling this tool with:curl --silent -O -L https://keploy.io/install.sh && source install.sh.
===== AFTER CALLING — walk the playbook =====
The response includes a "playbook" array; execute its steps in order. The flow is HEADLESS — one background process, NDJSON progress events on a local file, no separate HTTP surface to bind. THERE IS NO SEPARATE CLEANUP STEP — the CLI exits on its own once phase=done is written.
Spawn the
keploy record sandbox --cloud-app-id …process via Bash (run_in_background). Capture its PID into $KEPLOY_PID.Poll progress by repeatedly calling Bash with
tail -n 1 $PROGRESS_FILE. Each call returns instantly; the MCP round-trip between calls paces the loop. DO NOT wrap in a sleep loop — Claude Code's Bash rejects standalonesleep Nand chained-sleep patterns. Read .phase off each line; stop when phase=done. The wait_for_done step's built-inkill -0 $KEPLOY_PIDcheck is the safety-net for silent early-exit (CLI died before writing the terminal event) — it lets the loop exit instead of spinning forever on a dead process.Read the terminal event (last line of $PROGRESS_FILE). It carries data.ok, data.error (on failure), data.test_run_id (on success).
On data.ok=true: call get_session_report(app_id, test_run_id) with verbose=true to surface the rerecord report. On data.ok=false: show data.error to the dev directly (optionally tail the log_file for stderr context) and SKIP get_session_report (there's no run to fetch).
Auto-replay + linkTestSetToSuite run INSIDE the CLI process before it writes phase=done — if the terminal event says ok=true, linkage already happened. You do NOT need to wait for a separate post-success window; the CLI doesn't exit until it's fully done.
INTERRUPTED FLOWS: if your conversation dies between step 1 and step 2 (Claude crashes, connection drops, dev cancels), the CLI keeps running in the background. It's not orphaned — it'll finish its run and write phase=done. To abort early, the dev can pkill -f "keploy.*sandbox" manually; otherwise just let it complete and resume by re-reading the progress file on the next turn.
===== NDJSON SCHEMA — the contract =====
Every line in the progress_file is one JSON object with this envelope:
{ "ts": "", "command": "record" | "test", "phase": "", "message": "", "data": { ... phase-specific ... } // optional }
The phase vocabulary is intentionally extensible — new lifecycle phases get added over time as the CLI grows (started, agent_up, app_starting, suites_running_start, record_done, auto_replay_skipped, upload_done, linking_done, etc.). There are only TWO phases the AI must handle programmatically; everything else is informational and you should NOT switch on phase names you don't recognize:
phase != "done" → keep polling. Optional: surface message/data to the dev as ambient progress ("agent is starting...", "suites uploading..."), but never branch on a specific intermediate phase name.
phase == "done" → terminal event. Stop polling. The data envelope carries:
data.ok bool true on success, false on failure
data.error string (only on ok=false) one-line failure summary
data.test_run_id string (only on ok=true) pass to get_session_report
data.app_id string echo of the app_id passed to the tool
data.artifact_dir string local path to captured/replayed artifacts
data.dashboard_url string UI link to drill into the run
If you observe a phase you don't recognize, IGNORE it and keep polling. If "done" itself is renamed by a future CLI version, the wait_for_done step's PID-alive guard is your safety net (the poll loop exits when the CLI dies); surface log_file contents to the dev.
===== "ALL SUITES FAILED CAPTURE" — special signal =====
If you see a phase: "auto_replay_skipped" event with message: "all suites failed during rerecord; skipping replay + linking" ahead of the terminal done event, every suite failed at the CAPTURE phase (before auto-replay even ran). The CLI fails closed in this case — auto-replay and suite linking are SKIPPED, so every per_suite entry comes back linked=false.
Watch for this trap: the terminal data.ok=true because the CLI itself completed cleanly (it didn't crash; it just had nothing to record successfully). DO NOT read data.ok=true as "rerecord succeeded" — read <linked>/<total>. If linked == 0, this is a HARD failure that needs diagnosis, not a partial-linkage case.
ALWAYS surface the dashboard URL on this case. The terminal done event still carries data.dashboard_url and data.test_run_id (atg's TestSuiteRun was created during the capture phase); emit them verbatim so the dev can drill into per-step failures in the UI:
"0/N suites have a sandbox test — every suite failed during the capture phase, so auto-replay and linking were skipped. Dashboard: <data.dashboard_url> (test_run_id=)"
EDGE CASE: if data.test_run_id is empty, atg never inserted a TestSuiteRun (typically a pre-flight validation failure — branch-id rejection, app unreachable, etc.). The dashboard URL won't resolve. Skip the URL, surface the log_file contents instead so the dev can read the early-stage failure.
Recovery is the same as WHEN linked=false below — read failed_steps for each suite and pick route B (fix code) / C (update suite + record again) / SKIP. Don't infra-retry; capture-phase failures across every suite usually mean the app is broken, the suite shapes are stale, or the dev's local app isn't reachable.
===== LINKAGE VERIFICATION =====
After get_session_report returns, for EVERY suite that went into this record, call getTestSuite({suite_id}) and check whether the suite has a sandbox test (linked=true / non-empty test_set_id). A suite without a sandbox test cannot be replayed — replay_sandbox_test will 400 on it with "no sandboxed tests" until a successful record produces one.
===== WHEN linked=false — recovery rules =====
A suite with linked=false after record_sandbox_test means the record process couldn't produce a sandbox test for that suite. The SUITE ITSELF still exists; it just has no sandbox test. Diagnose WHY by reading the rerecord report's failed_steps for that suite:
No failed_steps OR pure infra error (link-commit / upload failed, no step diverged) → call record_sandbox_test AGAIN scoped to just the unlinked suite_ids. The tool is idempotent on the suite; safe to re-run.
failed_steps with assertion diffs (response shape, body fields, status code shifted from what the suite expected) → the suite is stale relative to current app behavior. The CONTRACT changed:
Change is INTENTIONAL (new field, renamed key, different status code is the new normal) → call update_test_suite to update the affected step's response / assertions to match the new contract, THEN call record_sandbox_test on the updated suite.
Change is UNINTENTIONAL (app regressed) → fix the app code first, then call record_sandbox_test. No suite update needed; the original test was correct.
failed_steps with 500s / handler crashes / connection refused → the app is broken at the wire level. Fix the app, then call record_sandbox_test. Don't update_test_suite to absorb a real failure.
NEVER:
Don't call create_test_suite to "redo" the suite — it already exists; re-creating authors a duplicate (see BEFORE CREATING in create_test_suite).
Don't blindly loop record_sandbox_test without diagnosing failed_steps first; if the cause is suite-vs-app mismatch, retries won't help.
===== MANDATORY OUTPUT — Phase 2 section =====
Your final message to the dev MUST contain a section with this exact heading (do NOT collapse into a single pass/fail table with the rerecord report; do NOT merge with Phase 1 or Phase 3):
### Phase 2 — Sandbox-test linkage
**<linked>/<total> suites have a sandbox test**
_Suites with a sandbox test_
| Suite name | suite_id | test_set_id | Capture pass/total |
| --- | --- | --- | --- |
| <name> | <suite_id> | <test_set_id> | <p>/<t> |
(emit even if zero — one row per linked suite, or "_(none)_" in place of rows)
_Suites without a sandbox test_ (omit ONLY if every suite linked)
| Suite name | suite_id | Likely cause |
| --- | --- | --- |
| <name> | <suite_id> | gate1 / gate2 / infra |Likely-cause decoding: assertion diffs → gate 1 upstream-replay failure; upstream-passing + mock-replay-diff → gate 2 mock-determinism mismatch; zero failures + still unlinked → infra link-commit issue.
Then proceed to replay_sandbox_test ONLY for the suites that DID link; the unlinked ones will 400 on replay.
===== DO NOT =====
DO NOT fall back to raw keploy CLI (
keploy rerecord -t …) if the MCP tool drops mid-flow — the CLI subcommand runs test-sets directly and does NOT update the suite's test_set_id. See MCP DISCONNECT RECOVERY in the top-level instructions.
| Name | Required | Description | Default |
|---|---|---|---|
| app_id | No | Keploy app ID. Provide this OR app_name_hint. | |
| app_dir | No | Absolute path to the repo root | |
| app_url | Yes | Base URL the app listens on | |
| branch_id | Yes | REQUIRED. Keploy branch ID (uuid). Resolve BEFORE calling: (1) `git rev-parse --abbrev-ref HEAD` in app_dir; (2) call create_branch tool with {app_id, name: <git branch>} → use returned branch_id. Direct writes to main are blocked. | |
| suite_ids | No | Comma-separated exact suite IDs to record. Provide this OR suite_name_hint. | |
| app_command | Yes | Shell command that starts the user's app | |
| network_name | No | Optional docker network name for compose setups. | |
| app_name_hint | No | Case-insensitive substring of the app name (typically cwd basename, e.g. "orderflow"). Used when app_id isn't known. | |
| container_name | No | REQUIRED when app_command is docker-compose. EXACT container_name from compose file. | |
| suite_name_hint | No | Case-insensitive substring of suite name (e.g. "happy-path" or "deterministic"). Every matching suite is recorded in one playbook run. | |
| timeout_seconds | No | Per-run timeout, default 300 |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate destructiveHint=true and idempotentHint=false. The description adds significant behavioral context: it explains what is captured (requests, responses, mocks), that it links captures to suites, that it does NOT modify assertions, and details the NDJSON progress file schema. It also covers error signals like 'all suites failed capture' and the implications for auto-replay and linking.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is very long, covering many edge cases and procedures. While well-structured with headings and bullet points, it could be more concise. The front-loaded purpose is clear, but the length may overwhelm an AI agent. Middle-range score reflects adequate structure but verbosity.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (11 parameters, no output schema, destructive nature, open-world), the description is exceptionally complete. It covers every major aspect: prerequisites, discovery, docker-compose networking, progress file contract, recovery from failures, and linkage verification. No gaps remain for an AI agent to infer incorrectly.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema covers 100% of parameters with descriptions. The description adds further context, such as the relationship between app_id and app_name_hint, suite_ids vs suite_name_hint, and branch_id resolution steps. It clarifies default behavior (all linked suites if no suite selector) and provides setup details for app_command and container_name.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: 'Record (or refresh) the sandbox test for one or more existing test suites' and explicitly lists use cases like 'record', 'rerecord', 'refresh the recordings'. It distinguishes itself from sibling tools like create_test_suite and replay_sandbox_test by positioning itself as the recording step after suite creation.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides extensive, explicit guidance on when to use this tool vs alternatives. It includes a pre-check for failed replays, instructions on resolving app IDs and suite IDs, and a recovery section for linked=false scenarios. It explicitly states when to call update_test_suite first, preventing incorrect sequencing.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
replay_sandbox_testADestructiveInspect
Replay the sandbox test for one or more suites against captured mocks — re-runs the suite's steps against the dev's locally-running app while keploy serves outbound calls (DB, downstream HTTP, etc.) from the captured mocks. Use this when the dev says "replay", "run my sandbox tests", "integration-test", "check if mocks still match" — keywords "sandbox" / "replay" / "mocks" / "integration-test" all map here. Also the REPLAY STEP of FROM-SCRATCH: call this LAST (after create_test_suite + record_sandbox_test) to give the dev the whole-app regression picture against the freshly captured mocks. Output produces a SANDBOX RUN REPORT — it answers "does the suite still hold up against its captured baseline?".
═══════════════════════════════════════════════════════════════════ DISAMBIGUATION — pick this tool vs. replay_test_suite: ═══════════════════════════════════════════════════════════════════
USE replay_sandbox_test (THIS TOOL) when the dev says:
"run my sandbox tests" / "replay my sandbox tests"
"integration-test my app" / "run the integration tests"
"check if my mocks still match" / "replay against the captured mocks"
"rerun my sandbox suite" (with the word "sandbox") Trigger keyword: an explicit "sandbox" / "replay" / "mocks" / "integration-test" — silent signal that the dev wants captured-mock replay, NOT live-app execution.
USE replay_test_suite INSTEAD when the dev says:
"run the test suite" / "run my test suites" (bare — no "sandbox")
"execute test suite X" / "run suite 810d3ebe…"
"test the suite again" / "smoke test against the live app" Bare verbs ("run / test / execute") applied to "the suite" without the word "sandbox" mean LIVE-APP execution, NOT captured-mock replay. replay_test_suite hits the dev's running localhost app directly via HTTP — no docker spin-up, no mocks.
After a record_sandbox_test run, the natural next step is THIS tool (replay against the just-captured mocks). After create_test_suite / update_test_suite, the natural next step is replay_test_suite (validate against the live app). When the dev's verb is bare and the prior turn doesn't make the intent obvious, ASK rather than picking sandbox-replay silently — code-change regressions can hide under "mock didn't match" failures.
═══════════════════════════════════════════════════════════════════ DISCOVERY — when the dev hands you a bare suite_id with no app_id / branch_id: ═══════════════════════════════════════════════════════════════════
Suites live on a (app_id, branch_id) tuple. A bare suite_id has NO on-disk hint about which app or branch holds it; you have to RESOLVE both before calling this tool. Walk these steps in order — STOP as soon as getTestSuite returns 200:
Detect the dev's git branch: Bash
git rev-parse --abbrev-ref HEADin app_dir. If exit non-zero / output is "HEAD" → not a git repo / detached HEAD; ASK the dev for the Keploy branch name.Resolve candidate apps via the cwd basename: Bash
basename $(pwd)→ call listApps with q=. Usually 1–2 candidates. If 0 → ASK; if >1 → walk every candidate in step 4.For each candidate app, call list_branches({app_id}) and find the branch whose
namematches the git branch from step 1. That gives you {branch_id}. If no match → not this app, try next.Verify with getTestSuite({app_id, suite_id, branch_id=<from step 3>}). 200 → resolved; 404 → wrong app/branch, try next.
If steps 2–4 exhaust, walk every OPEN branch on each candidate app via list_branches → getTestSuite. Then try main (branch_id omitted). If still nothing → ASK the dev for the {app_id, branch_id} pair.
After resolving once in a session, REUSE the {app_id, branch_id} for subsequent suite-targeted calls; don't re-walk discovery for every action.
SCOPE — whole-app vs single-suite:
Default: LEAVE suite_ids UNSET → the tool resolves "every suite for the app that has a sandbox test (test_set_id populated)" and replays them all. Use this for "run my sandbox tests" / "check if my tests still pass" — whole-app regression. New suites auto-pick up.
Single / subset: PASS suite_ids when the dev names specific suites — "replay sandbox test for suite 810d3ebe-…", "replay only the auth suite", "run suite X and Y". The tool validates each requested id is actually a suite with a sandbox test (has test_set_id); an unlinked id gets a precise "record first" error instead of an opaque downstream CLI failure.
This tool resolves the app, picks the suite set per the rule above, and returns a single playbook that drives the replay for them. It does NOT record.
WHAT THIS TOOL DOES INTERNALLY (so you don't have to):
Resolves app_id — use the explicit app_id if the caller has one; otherwise pass app_name_hint (usually the cwd basename) and the server does listApps with a substring match. Multiple matches → error listing them; zero matches → error suggesting the dev generate a suite first.
Lists test suites for the app, keeps only those with a non-empty test_set_id. Zero linked → typed "no linked sandbox tests" error.
If suite_ids was passed, validates every requested id is in the linked-suites set; unlinked ids → typed error pointing to record_sandbox_test.
Returns the headless playbook — walk it exactly: spawn CLI in background, tail the progress file (PID-alive guard built in), read the terminal event, fetch the report. No separate cleanup step — the CLI exits on its own.
===== PREREQUISITES ===== (Same as record_sandbox_test — if you just recorded, you already have them. Same docker-compose network rule applies: use the same compose file + service, stop the app service before calling, leave deps running.)
app_command: shell command that starts the dev's app (e.g. "docker compose up producer").
app_url: base URL the app listens on, e.g. http://localhost:8080.
app_dir: absolute path to repo root.
container_name if app_command is docker-compose.
keploy binary on PATH. If
which keployreturns nothing, install it before calling this tool with:curl --silent -O -L https://keploy.io/install.sh && source install.sh.
===== AFTER CALLING — walk the playbook =====
Same headless playbook shape as record_sandbox_test: spawn keploy test sandbox --cloud-app-id … in the background via Bash, poll tail -n 1 $PROGRESS_FILE repeatedly (no sleep loops; the wait_for_done step has a built-in kill -0 $KEPLOY_PID guard so the loop exits if the CLI dies silently), read the terminal NDJSON event (phase=done, data.ok, data.test_run_id), and — if ok=true — call get_session_report(app_id, test_run_id) with verbose=true at the end. No separate cleanup step needed; the CLI exits cleanly once phase=done is written.
===== MANDATORY OUTPUT — Phase 3 section =====
Your final message to the dev MUST contain a section with this exact heading (do NOT merge with Phase 2; do NOT compress the failed-steps table even when failures are homogeneous):
### Phase 3 — Sandbox run reportUnder it, emit the uniform three-subsection format owned by get_session_report: (i) per-suite table — one row per suite in per_suite, passing suites included, columns = Suite name | passed/total steps. (ii) failed-steps table — ONE ROW per entry in failed_steps[], columns = Suite | Step name | Method + URL | Expected → Actual status | mock_mismatch y/n. Never collapse rows. (iii) Diagnosis + Recommendation (see get_session_report description for case-specific rules around mock_mismatch_dominant, repo-diff inspection, and the SKIP / FIX-CODE / FIX-TEST branching for fix-it follow-ups).
Do NOT print aggregate step totals across suites — they mix unrelated suites and hide where damage actually is.
===== ROLLUP LINE =====
Close the message with a final one-line rollup paragraph (no heading), in addition to the three phase sections. Mention the TOTAL number of suites replayed (which may exceed the count created in this session, because replay_sandbox_test covers every linked suite the app has). Example: "Rollup: inserted 4 suites, 4/4 with sandbox tests after record, 3/4 suites passed sandbox replay across the app's 6 linked suites — 1 failure is likely keploy egress-hook, file an issue with the IDs above."
===== DO NOT =====
DO NOT call update_test_suite or record_sandbox_test after this. The dev said RUN, not REFRESH.
DO NOT fall back to raw keploy CLI (
keploy test …) if the MCP tool drops mid-flow — CLI runs test-sets directly and does NOT write results back to the MCP-visible TestSuiteRun. See MCP DISCONNECT RECOVERY in the top-level instructions.
| Name | Required | Description | Default |
|---|---|---|---|
| app_id | No | Keploy app ID. Provide this OR app_name_hint. | |
| app_dir | No | Absolute path to the repo root | |
| app_url | Yes | Base URL the app listens on | |
| suite_id | No | OPTIONAL legacy single-suite alias for suite_ids. Prefer suite_ids when passing one or more — keeps the input shape uniform across record_sandbox_test / replay_sandbox_test. | |
| branch_id | Yes | REQUIRED. Keploy branch ID (uuid). Resolve BEFORE calling: (1) `git rev-parse --abbrev-ref HEAD` in app_dir; (2) call create_branch tool with {app_id, name: <git branch>} → use returned branch_id. Direct writes to main are blocked. | |
| suite_ids | No | OPTIONAL comma-separated suite IDs to narrow the run to a specific subset. LEAVE UNSET when the dev says "run my sandbox tests" / "replay everything" — the CLI then resolves "every suite with a sandbox test for the app" itself, which is the right default for whole-app regression and means new suites auto-pick up. Only pass an explicit list when the dev names suites ("replay just the auth suite", "run only suite 810d3ebe…"). | |
| app_command | Yes | Shell command that starts the user's app | |
| network_name | No | Optional docker network name for compose setups. | |
| app_name_hint | No | Case-insensitive substring of the app name (typically the cwd basename, e.g. "orderflow"). Used when app_id isn't known. Server does listApps({q: hint}) and requires exactly one match; 0 or >1 matches return an error. | |
| container_name | No | REQUIRED when app_command is docker-compose. EXACT container_name from compose file. | |
| timeout_seconds | No | Per-run timeout, default 300 |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description details internal behavior (resolves app, lists suites, returns playbook), what the tool does not do (does not record), and prerequisites. While annotations indicate destructiveHint=true, the description adds context beyond annotations without contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is lengthy but well-structured with labeled sections (DISAMBIGUATION, DISCOVERY, SCOPE, etc.), front-loading the core purpose. Every section serves a clear purpose, earning its place given the tool's complexity.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity, no output schema, and 11 parameters, the description is remarkably complete. It covers discovery, prerequisites, after-calling playbook steps, mandatory output format, rollup line, and DO NOT items. No significant gaps.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Though schema coverage is 100%, the description adds significant usage context for parameters (e.g., how to resolve branch_id, when to leave suite_ids unset, treating app_name_hint). This goes beyond the schema's own descriptions.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool replays sandbox tests against captured mocks, with explicit verb+resource. It distinguishes itself from the sibling replay_test_suite via a dedicated disambiguation section, listing keywords and usage scenarios.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
An extensive disambiguation section explicitly tells when to use this tool vs. replay_test_suite, including specific keywords and natural next steps after other tools. It also provides prerequisites and step-by-step discovery guidance.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
replay_test_suiteADestructiveInspect
Replay an existing test suite live against the dev's LOCAL APP (no mocks, no docker spin-up). Returns a playbook that delegates to the enterprise CLI keploy test-suite, which walks each suite's steps, fires HTTP requests at base_path, evaluates assertions, and uploads per-suite results to api-server. The CLI prints a final pass/fail summary table plus a "Report:" URL to stdout. Output produces a TEST SUITE REPORT — it answers "does the suite hold up against the actual current system?".
═══════════════════════════════════════════════════════════════════ DISAMBIGUATION — pick this tool vs. replay_sandbox_test: ═══════════════════════════════════════════════════════════════════
USE replay_test_suite (THIS TOOL) when the dev says:
"run the test suite" / "run my test suites"
"execute test suite X" / "run suite 810d3ebe…"
"test the suite again" / "rerun the suite"
"validate the suite changes" (after editing a suite)
"smoke test against the live app" Default reading: bare verbs "run" / "execute" / "test" applied to "the suite" mean LIVE-APP execution, NOT replay against captured mocks.
USE replay_sandbox_test INSTEAD when the dev says:
"run my sandbox tests" / "replay my sandbox tests"
"integration-test my app" / "check if my mocks still match"
"replay the captured tests" / "run against the recorded mocks" Trigger keyword: "sandbox" / "replay" / "mocks" / "integration-test" — explicit signal that the dev wants captured-mock replay, not live-app.
After a record_sandbox_test run, the natural next step is replay_sandbox_test (replay against the freshly captured mocks). After create_test_suite / update_test_suite, the natural next step is replay_test_suite (validate the new/edited suite against the live app). When the dev's verb is bare ("run the suite") and the prior turn was create/update, prefer THIS tool. When the prior turn was record, ASK the dev if unsure — the verbs overlap and silently picking sandbox-replay can mask code-change failures with mock-replay noise.
USE THIS for: re-running previously-created suites against a running local app — verifying a regression after a code change, smoke-testing a branch, re-validating after editing a suite.
DO NOT USE this for: validating a NEW suite that hasn't been inserted yet (use create_test_suite — it runs the suite twice as part of validation), or for running suites against the captured-mock copy of the app (use replay_sandbox_test — captured-mock replay flow).
═══════════════════════════════════════════════════════════════════ DISCOVERY — when the dev hands you a bare suite_id with no app_id / branch_id: ═══════════════════════════════════════════════════════════════════
Suites live on a (app_id, branch_id) tuple. A bare suite_id has no on-disk hint about which app or branch holds it; you have to RESOLVE both before calling this tool. Walk these steps in order — STOP as soon as getTestSuite returns 200:
Detect the dev's git branch: Bash
git rev-parse --abbrev-ref HEADin app_dir. If exit non-zero / output is "HEAD" → not a git repo / detached HEAD; ASK the dev for the Keploy branch name (don't invent one).Resolve candidate apps via the cwd basename: Bash
basename $(pwd)→ call listApps with q= (case-insensitive substring match). Usually 1–2 candidates (e.g. "orderflow" → matches "orderflow" and "orderflow.producer"). If 0 → ASK the dev for the app_id; if >1 → walk every candidate in step 4.For each candidate app, call list_branches({app_id}) and find the branch whose
namematches the git branch from step 1. That gives you {branch_id, status}. If no match → that app's not the owner; try the next candidate. If status is closed/merged → ask the dev whether to use this branch anyway.Verify with getTestSuite({app_id, suite_id, branch_id=<from step 3>}). 200 → resolved; 404 → wrong app, try next candidate.
If steps 2–4 exhaust without a hit, the suite is on a branch whose name doesn't match the git branch (the dev created it with a custom name, or it's on main). Then: call list_branches on each candidate app and try every OPEN branch's branch_id with getTestSuite, then try main (branch_id omitted). If still nothing → ASK the dev for the {app_id, branch_id} pair.
The reverse "look up suite_id globally" path doesn't exist — auditing is branch-scoped, so resolution starts from a branch context. After resolving once in a session, REUSE the {app_id, branch_id} for any subsequent suite-targeting call (delete_test_suite / update_test_suite / replay_test_suite); don't re-walk discovery for every action.
═══════════════════════════════════════════════════════════════════ INPUTS ═══════════════════════════════════════════════════════════════════
app_id (required) — Keploy app ID. Same value used for create_test_suite / list_branches.
branch_id (required) — Keploy branch UUID. Resolve via the explicit two-step flow BEFORE calling: (1) Bash
git rev-parse --abbrev-ref HEADin app_dir; (2) call create_branch tool with {app_id, name: } — find-or-create returns {branch_id, ...}; pass it here. Direct main writes are blocked.base_path (required) — base URL of the dev's local app, e.g. http://localhost:8080. Each suite step's relative path is appended to this.
suite_ids (optional) — list of suite IDs to run. Omit / empty = run every suite registered for app_id on the branch.
header (optional) — single header to inject into every request, e.g. "Cookie: session=…". Same shape as the CLI's -H flag.
app_dir (optional) — absolute path to the dev's repo root (where the app is running). Defaults to '.' (cwd). The CLI invocation cd's here.
═══════════════════════════════════════════════════════════════════ HOW THIS TOOL WORKS ═══════════════════════════════════════════════════════════════════
This tool DOES NOT execute the suite itself. It returns a "playbook" — a small array of shell steps for you (Claude) to walk via Bash. The playbook spawns the enterprise CLI keploy test-suite in the foreground; the CLI:
Validates the branch exists + is writable (fails fast with a clear message if not).
Loads suites from api-server (filtered by --suite-id when supplied; otherwise every suite on the branch).
For each suite: fires step requests at base_path, evaluates assertions, records per-step results.
Uploads a TestSuiteRun + TestSuiteReport entry to api-server (?branch_id=).
Prints a summary table to stdout, exits 0 on all-pass / 1 on any failure.
Walk the playbook in order. Surface the CLI's stdout to the dev — the table shows which suites passed / failed / were "buggy" (suite-level verdict separate from individual step failures).
PREREQUISITES the playbook assumes:
The dev's app is up and reachable at base_path.
keploybinary is on PATH. If missing, install before calling this tool:curl --silent -O -L https://keploy.io/install.sh && source install.sh.Either ~/.keploy/cred.yaml exists (API key) or KEPLOY_API_KEY is exported.
| Name | Required | Description | Default |
|---|---|---|---|
| app_id | Yes | Keploy app ID | |
| header | No | Optional single request header, e.g. "Cookie: session=…". Injected on every step. | |
| app_dir | No | Absolute path to the dev's repo root. Defaults to '.' (cwd). The CLI cd's here. | |
| base_path | Yes | Base URL of the dev's local app, e.g. http://localhost:8080 | |
| branch_id | Yes | REQUIRED. Keploy branch UUID. Resolve via two-step flow: (1) `git rev-parse --abbrev-ref HEAD` in app_dir; (2) call create_branch tool. Direct main writes are blocked. | |
| suite_ids | No | Optional comma-separated suite IDs to run. Omit to run every suite registered for app_id on the branch. |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description discloses that the tool does NOT execute the suite but returns a playbook, details CLI behavior, prerequisites, and output format. This adds significant value beyond the annotations, which already include destructiveHint=true and other hints.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is well-structured with clear sections (disambiguation, discovery, inputs, how it works). It is somewhat verbose but every section serves a purpose; minor redundancy in discovery steps.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (6 params, no output schema), the description is exceptionally thorough: it covers prerequisites, CLI behavior, error handling, and even includes a discovery flow for resolving ambiguous suite IDs. No gaps.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, but the description adds rich context such as the branch_id resolution flow, default for app_dir, and shape of header. It clearly explains optional vs required semantics.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description explicitly states 'Replay an existing test suite live against the dev's LOCAL APP' and contrasts with replay_sandbox_test, making the purpose crystal clear and distinct from siblings.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The disambiguation section provides explicit when to use this tool versus replay_sandbox_test, including keyword triggers and prior tool context. It also states when NOT to use (e.g., new suite validation).
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
revokeAPIKeyADestructiveInspect
DELETE /api-keys/{keyId} — Revoke an API key — Requires scope: admin.
| Name | Required | Description | Default |
|---|---|---|---|
| keyId | Yes | Path parameter: keyId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate destructiveHint=true, and the description adds the scope requirement and HTTP method, providing extra context beyond annotations without contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single, well-structured sentence that includes method, resource, action, and scope, with no unnecessary words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple tool with one parameter and no output schema, the description covers the essential information: what it does and the authentication requirement.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, and the description does not add extra meaning beyond the schema's parameter description ('Path parameter: keyId'). Baseline score of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the HTTP method (DELETE), resource (API key), and action (revoke), distinguishing it from sibling tools like createAPIKey and listAPIKeys.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description includes the required scope (`admin`), which informs usage context. It does not explicitly mention when not to use it or alternatives, but the purpose is clear.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
run_and_reportADestructiveInspect
Run test suites and return results with failures and coverage.
!! DO NOT USE for local-app "tests for my changes" flows !! This tool sends the run to the SaaS backend which REJECTS private/localhost URLs ("IPv6 address is private / reserved"). It only works when base_url points at a PUBLIC, non-loopback address (a staging/prod deployment).
For local-app testing, use record_sandbox_test / replay_sandbox_test instead — they drive the keploy local agent which happily records against http://localhost.
| Name | Required | Description | Default |
|---|---|---|---|
| app_id | Yes | Keploy app ID | |
| timeout | No | Per-request timeout in seconds | |
| base_url | Yes | PUBLIC target API base URL (not localhost). For localhost, use record_sandbox_test. |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations include destructiveHint: true, indicating side effects. The description adds context that the tool sends runs to a SaaS backend and constraints on base_url. It does not contradict annotations. However, it could elaborate on specific side effects (e.g., creation of test runs, impact on coverage data).
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is three sentences long, with the purpose stated first, followed by a clear warning and explanation. Every sentence serves a distinct purpose: stating functionality, highlighting restriction, and offering alternatives. No redundant or extraneous information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given there is no output schema, the description could better specify the structure of the returned results (e.g., format of failures and coverage). For a tool that runs tests, the description is missing details on whether results are returned synchronously or require polling. However, the guidance on usage and constraints is thorough.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so the schema already documents all three parameters. The description adds value by emphasizing the base_url constraint and pointing to alternatives, but does not provide additional semantic details for app_id or timeout beyond what is in the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Run test suites and return results with failures and coverage', which specifies the verb (run), resource (test suites), and output (results with failures and coverage). It differentiates from sibling tools like start_rerecord_session by explicitly stating the localhost restriction, making the tool's specific use case clear.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides explicit guidance on when not to use the tool ('DO NOT USE for local-app') and explains the reason (SaaS backend rejects private/localhost URLs). It clearly recommends alternative tools (start_rerecord_session / start_integration_test_session) for localhost testing, leaving no ambiguity about usage context.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
runTestSuitesADestructiveInspect
POST /apps/{appId}/test-suites/run — Run test suites — Run test suites against a PUBLIC target URL. DO NOT use for local-app / localhost runs — base_url must be reachable from the SaaS backend (rejects loopback / private IPs as 400 'invalid baseURL'). For localhost runs use the MCP tool record_sandbox_test (keploy agent). Optional sandbox_mode field: ""|"rerecord"|"integration_test" — the sandbox modes are primarily used through MCP's record_sandbox_test / replay_sandbox_test tools. Requires scope: write.
| Name | Required | Description | Default |
|---|---|---|---|
| auth | No | Optional auth bundle the runner injects into every step. Carries an `authtype` discriminator (BearerToken / BasicAuth / APIKeyAuth / CookieAuth / LoginCurl / None) plus the matching variant block. See models.Auth in pkg/models/e2e.go for the full shape. | |
| appId | Yes | Path parameter: appId | |
| timeout | No | Per-request timeout in seconds; 0 means use the runner default. | |
| base_url | Yes | PUBLIC target URL the SaaS backend will hit. Loopback / private IPs are rejected with 400. For localhost runs use the MCP record_sandbox_test tool, not this endpoint. | |
| rate_limit | No | Per-second cap on outgoing requests; 0 means unbounded. | |
| sandbox_mode | No | Empty for normal in-backend runs. `rerecord` and `integration_test` switch to sandbox flow where the local keploy agent or k8s-proxy drives the run. Surfaced for completeness; MCP tools (record_sandbox_test / replay_sandbox_test) are the supported entry points. | |
| test_suite_ids | No | Suite IDs to include in the run. Empty/omitted means "run all suites for the app" — same default the GraphQL surface applies. |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate destructiveHint=true and readOnlyHint=false. Description adds that the tool rejects loopback/private IPs with 400 error and mentions scope requirement, providing useful behavioral context beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three sentences covering endpoint, rule, and optional field. Efficient but the sandbox_mode reference is out of place since it's not in schema.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Adequately covers purpose and usage restrictions but does not explain output format or differentiate from similar siblings like run_test_suite or run_and_report. Also lacks description of what 'run test suites' entails in terms of results.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The schema has one parameter (appId) with description, but the description mentions a non-existent sandbox_mode field not in the schema, causing confusion. Adds no additional meaning for appId.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool runs test suites against a public target URL and explicitly warns against using for localhost, distinguishing it from sibling tools.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly says 'DO NOT use for local-app / localhost runs' and directs to start_rerecord_session for localhost, providing clear when-to-use and alternative.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
scaffold_pipeline_workflowADestructiveInspect
Generate the exact CI workflow YAML to add keploy sandbox tests to a pull-request pipeline, and tell you where to write it. Use this when the dev asks to "add keploy sandbox tests to my pipeline" / "wire keploy into CI" / "run keploy on PR" / "add a CI job for keploy" — the server emits the file contents verbatim so you don't have to compose the flag list yourself.
===== GOAL =====
Write a CI workflow file that runs keploy test sandbox --cloud-app-id <uuid> --app-url <url> on pull requests and gates the PR on the result. NEVER kick off an actual test run in this flow — it is pure file authoring, ends with the file on disk. DO NOT fire replay_sandbox_test, record_sandbox_test, replay_test_suite, or any other run-starting MCP tool here.
===== HOW (absolute) =====
Call this tool. It returns { file_path, content, summary }. Write the "content" to "file_path" VERBATIM via your Write tool — NO flag renames, NO flag removals, NO step reordering, NO synthesis. The server owns the YAML template; your job is only to (1) resolve the inputs from the repo and api-server and (2) Write the returned content. Do NOT compose the YAML yourself from general knowledge — flag drift (missing --cloud-app-id, inventing --app) is the most common bug when Claude improvises.
DO NOT ASK the dev for confirmation before writing. Resolve everything from the repo + api-server, pick the GitHub Actions default, call this tool, Write the file. The dev's prompt is already the go-ahead.
===== STEPS =====
DETECT THE CI SYSTEM:
Default = GitHub Actions (biggest share). File = .github/workflows/keploy-sandbox.yml.
If .gitlab-ci.yml exists → GitLab (not yet supported by this tool; tell the dev and stop).
If .circleci/config.yml exists → Circle (not yet supported; tell the dev and stop).
Otherwise → GitHub Actions.
RESOLVE VALUES by calling MCP tools + reading the repo:
app_id: call listApps({q: ""}). Exactly one → use its id. Multiple → pick the one whose name most specifically matches the repo's primary service (e.g. "orderflow.producer" wins over "orderflow" when there's a ./producer directory); mention which you picked in the final message. Zero → stop and tell the dev to create the app + rerecord first.
suite_ids: DO NOT pass this arg by default. An empty suite_ids means the CLI resolves "every linked sandbox suite for the app" at CI run time — which is what you want (new suites auto-pick up without workflow edits). The tool still verifies there's ≥1 linked suite at scaffold time so the first PR run doesn't fail empty-handed. Only pass suite_ids when the dev explicitly narrows ("run only the auth suite in CI"); don't pin "all current suites" — that's staleness waiting to happen.
compose_file: READ THE REPO. Default is docker-compose.yml. AVOID passing a docker-compose-keploy.yaml variant that has
networks: default: external: true— those variants only work locally, where another compose run has already created the external network. In CI the runner starts clean andexternal: truefails with "network not found". If the primary docker-compose.yml brings up the full app (deps + app service), use it end-to-end.app_service, container_name, app_port: read from the SAME compose_file you picked above. app_service = the service key (e.g. "producer"); container_name = that service's container_name: field in that same compose file (e.g. "orderflow-producer" if compose_file=docker-compose.yml, but "producer" if compose_file=docker-compose-keploy.yaml — THESE DIFFER, pick consistently); app_port = the host-side of its ports: mapping.
app_url = http://localhost:<app_port>. The tool derives this; you don't pass it separately.
CALL THIS TOOL with app_id, app_service, container_name, app_port, compose_file (and suite_ids only if the dev explicitly narrowed scope). It returns { file_path, content, summary }. Write the "content" to the "file_path" VERBATIM.
===== FLAG NAME RULES (absolute, do not drift when reviewing the output) =====
--cloud-app-id← NOT--app-id. The OSS config has anappIduint64 field that viper maps--app-idinto; passing a UUID there fails with "invalid syntax" before RunE runs.keploy test sandbox --cloud-app-id <uuid> --app-url <url>← the CI form. NOTkeploy test --cloud-app-id(must betest sandbox— the headless flags live on the sandbox subcommand only), NOTkeploy test-suite run(that command doesn't exist). There is NO--pipelineflag.Install URL =
https://keploy.io/ent/install.sh← NOThttps://keploy.io/install.sh(OSS; no sandbox subcommand at all), NOT a github.com/keploy/keploy release tarball.
If the server-emitted content ever disagrees with these rules, trust the server output and file a bug — don't edit the YAML.
===== RESOLUTION ARGS =====
Pass either app_id (explicit UUID) or app_name_hint (substring; server does listApps and requires exactly one match).
Pass app_service (docker-compose service name), container_name (from compose container_name: field read from the SAME compose_file arg), and app_port (HTTP port the service exposes).
compose_file is optional, defaults to "docker-compose.yml". If the repo has a -keploy.yaml variant with
external: truenetworks, do NOT point compose_file at it — it won't work in CI.suite_ids is optional and should be LEFT BLANK by default — the CLI resolves every linked suite at run time. Only pin an explicit list when the dev narrows scope.
===== FINAL RESPONSE — three short sections, no questions =====
### Created
| File | Lines |
| --- | --- |
| .github/workflows/keploy-sandbox.yml | N |
### Summary
- App: <name> (<app_id>), <N> linked suites replayed on every PR
- Trigger: pull_request → main, + manual workflow_dispatch
- Failure on any suite gates the PR (non-zero exit from the CLI)
### Before the first run, add this GitHub secret
- `KEPLOY_API_KEY` — at https://github.com/<owner>/<repo>/settings/secrets/actions/new
(self-hosted users — point at your own api-server by building the
enterprise binary with -X main.api_server_uri=<url>; there is no
runtime env override on the released binary.)This tool does NOT run anything. It only generates file contents.
| Name | Required | Description | Default |
|---|---|---|---|
| app_id | No | Keploy app UUID. Provide this OR app_name_hint. | |
| app_port | No | HTTP port the service exposes (default 8080) | |
| ci_system | No | CI system. Currently only "github-actions" supported (default). | |
| suite_ids | No | Optional comma-separated suite IDs to pin in the workflow. LEAVE BLANK by default — the CLI resolves every linked suite for the app at run time, so new suites auto-pick up in CI without editing the YAML. Only pass an explicit list when the dev narrows scope. | |
| app_service | Yes | Docker-compose service name for the user's app (e.g. 'producer'). Must be a service KEY in the compose_file you pass. | |
| branch_name | No | OPTIONAL Keploy branch NAME (the human-readable string the dev knows — e.g. "update-order-response-fields" or whatever the dev's git branch is) to scope the "are there any linked suites yet?" pre-flight count. Resolved to a branch UUID internally via list_branches. The generated YAML still uses --create-branch ${{ github.head_ref }} at CI run time (per-PR branch), so this arg ONLY influences the pre-flight smoke count + the summary line — it doesn't get baked into the workflow. Pass when your linked suites live on a branch (not main) so the scaffold's count reflects them. If unset, the scaffold counts main-view suites only. | |
| compose_file | No | Path to the docker-compose file CI should use (default "docker-compose.yml"). Avoid pointing this at a variant with `networks: default: external: true` — those only work locally where the external network already exists; in CI (clean runner) they fail with "network not found". | |
| app_name_hint | No | Case-insensitive substring of the app name (typically cwd basename, e.g. "orderflow"). | |
| container_name | Yes | Container name from the compose_file's container_name: field for app_service. Different compose files can set different container names for the same service — e.g. docker-compose.yml might say 'orderflow-producer' while docker-compose-keploy.yaml says 'producer'. Must match whichever file you pass as compose_file. |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description adds significant context beyond annotations: it states the tool returns file_path, content, summary and that the agent must write the content verbatim. It warns against editing server output and clarifies the tool does not run tests. This complements the destructiveHint and openWorldHint annotations without contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is thorough with clear sections (GOAL, HOW, STEPS, FLAG NAME RULES). However, it is quite verbose, with some repetition (e.g., flag name rules are stated again in the body). Despite this, every sentence adds value given the tool's complexity; a slightly tighter structure could improve conciseness.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Despite no output schema, the description fully explains the return value (file_path, content, summary) and the required follow-up action (write the content). It covers all aspects: parameter resolution, CI detection, flag correctness, and even post-creation instructions (adding GitHub secrets). No gaps identified.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, but the description adds extensive meaning for each parameter: e.g., suite_ids should be left blank by default, compose_file should avoid external: true, branch_name only influences pre-flight counts. It provides concrete resolution strategies (e.g., reading docker-compose, calling listApps) that go beyond the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it generates CI workflow YAML for keploy sandbox tests. It specifies the exact purpose ('add keploy sandbox tests to a pull-request pipeline') and distinguishes from sibling tools like run_sandbox_tests by explicitly saying not to use them. The verb 'generate' matches the tool's action.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides explicit trigger phrases ('add keploy sandbox tests to my pipeline'), when-not-to-use instructions ('NEVER kick off an actual test run'), and alternatives (e.g., 'use listApps' for resolving app_id). It also includes step-by-step resolution logic and fallbacks for CI detection.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
startLoadTestBDestructiveInspect
POST /apps/{appId}/load-tests — Start a load test — Requires scope: write.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate destructive hint true and idempotent false. The description adds the scope requirement but no further behavioral details beyond what annotations provide.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single concise sentence that front-loads the action, with no wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
No output schema exists, and the description does not explain return values, asynchronous behavior, or side effects. For a mutation tool, this is insufficient.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% and the description adds no extra meaning to the single parameter (appId) beyond what the schema already says. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the action (Start a load test) and the resource (app identified by appId), including HTTP method and required scope. It distinguishes from siblings like stopLoadTest and listLoadTestRuns.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides no guidance on when to use this tool versus alternatives, such as when to start vs. stop a load test, or any prerequisites beyond the scope requirement.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
stopJobADestructiveInspect
POST /jobs/{jobId}/stop — Stop a running job — Requires scope: write.
| Name | Required | Description | Default |
|---|---|---|---|
| jobId | Yes | Path parameter: jobId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate destructiveHint=true. The description adds the required scope ('write'), which is critical for authorization. It also provides the HTTP method and path, though this is more REST-specific. Overall, it adds useful behavioral context beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single sentence that includes the action, resource, and scope requirement. No wasted words, and it is front-loaded with the key information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the single parameter, no output schema, and the presence of annotations, the description is nearly complete. It explains what the tool does and the required scope. Minor omissions like whether the job must be running are not critical.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema has 100% description coverage, and the description does not add additional meaning beyond the schema. The parameter 'jobId' is already explained as a path parameter. The description meets the baseline but provides no extra semantic value.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Stop a running job' which specifies the action (stop) and the resource (a running job). It distinguishes itself from sibling tools like stopLoadTest and startLoadTest by focusing on generic jobs.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance on when to use this tool versus alternatives. It does not mention when not to use it or provide context about prerequisites or side effects.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
stopLoadTestADestructiveInspect
POST /apps/{appId}/load-tests/{runId}/stop — Stop a running load test — Requires scope: write.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| runId | Yes | Path parameter: runId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare destructiveHint=true and readOnlyHint=false. The description adds the HTTP method and URL pattern but no extra behavioral detail beyond 'Stops a running load test'. It does not contradict annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Single sentence with all essential information front-loaded (verb, resource, HTTP method, required scope). No wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a stop action with no output schema and low complexity, the description covers purpose, required scope, and mechanics (HTTP POST). It is fully adequate.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Input schema has 100% description coverage for both parameters (appId, runId) as 'Path parameter'. Description adds no additional semantics beyond what the schema and URL pattern provide.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the action ('Stop a running load test') and the specific resource, distinguishing it from sibling tools like startLoadTest or getLoadTestReport.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
It mentions the required scope 'write' which provides a precondition, but does not explicitly state when to use or when to avoid using this tool compared to alternatives. However, the context is clear enough.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
streamJobEventsBDestructiveInspect
GET /jobs/{jobId}/events — Stream job events (SSE) — Requires scope: read. Returns a text/event-stream.
| Name | Required | Description | Default |
|---|---|---|---|
| jobId | Yes | Path parameter: jobId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations declare destructiveHint=true and readOnlyHint=false, contradicting the implication of a read-only stream in the description. The description does not address this destructive behavior, nor does it explain the side effects of streaming events.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single, well-structured sentence front-loading the HTTP method and path. Every element serves a purpose without redundant words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Despite good conciseness, the description lacks key behavioral context such as connection longevity, event types, and handling of destructive side effects implied by annotations. No output schema exacerbates the incomplete picture for an AI agent.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
With 100% schema coverage, the baseline is 3. The description includes the parameter in the path pattern but adds no further semantics beyond what the schema already provides.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool streams job events via SSE, with the HTTP method and path. It effectively distinguishes from sibling 'streamLoadTestEvents' by specifying the resource type (job vs load test).
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description mentions the required scope 'read' but provides no explicit guidance on when to use this tool versus alternatives like 'getJob' or 'streamLoadTestEvents'. Usage context is only implied by the name and streaming verb.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
streamLoadTestEventsBDestructiveInspect
GET /apps/{appId}/load-tests/{runId}/events — Stream load test events (SSE) — Requires scope: read. Returns a text/event-stream.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| runId | Yes | Path parameter: runId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Description says 'Returns text/event-stream', suggesting a read operation, but annotation destructiveHint=true contradicts this. No details on connection longevity or event format.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Single line with all essential info: method, path, action, scope, response type. No wasted text.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Lacks output schema and does not describe event structure or format, which is critical for an SSE stream. Annotations conflict with description.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%; description adds no additional meaning beyond naming the parameters as path parameters.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the HTTP method (GET), path, action (stream load test events using SSE), and distinguishes from siblings like streamJobEvents.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Implies usage for streaming events from a load test run, but no explicit guidance on when to use vs alternatives or what not to use it for.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
updateAppBDestructiveInspect
PUT /apps/{appId} — Update an app — Requires scope: write.
| Name | Required | Description | Default |
|---|---|---|---|
| brd | No | Business requirements document content the AI uses for context. | |
| prd | No | Product requirements document content the AI uses for context. | |
| auth | No | Authentication configuration for test execution. The runner injects the matching headers on every step request. | |
| docs | No | Free-form developer docs. Used by AI as additional context when authoring suites. | |
| appId | Yes | Path parameter: appId | |
| labels | No | Add or update labels on the app. New labels (no `id`) require both `name` and `color`; updates to existing labels (with `id`) require at least one of `name`/`color`. | |
| schema | No | ||
| country | No | Two-letter country code controlling data-residency-affected behavior. Rarely set. | |
| postman | No | Postman collection JSON the AI parses for endpoint shapes / examples. | |
| endpoint | No | ||
| main_curl | No | Reference curl that drives generation when no schema is available. | |
| rate_limit | No | Requests-per-second cap the runner applies to outbound calls during runs. | |
| webhook_url | No | ||
| api_examples | No | Sample request/response pairs the AI consults when authoring suites. | |
| code_snippet | No | Server code snippet the AI uses for endpoint context. | |
| private_mode | No | Restrict app visibility to the authenticated user only. | |
| graphql_schema | No | GraphQL schema (SDL) the AI uses when generating GraphQL suites. | |
| enable_pre_hook | No | Run the pre-step hook before each test step. | |
| max_test_suites | No | Cap on how many suites generate-tests will mint at once. Server default applies if omitted. | |
| enable_post_hook | No | Run the post-step hook after each test step. | |
| ignore_endpoints | No | Endpoint patterns the runner skips when generating / running suites. | |
| disable_schema_assertion | No | ||
| app_level_custom_function | No | Register a JS function devs can reference from suite step templates. Key uniquely identifies the function; CustomFunction is the JS source. | |
| app_level_custom_variables | No | Add / update / delete a SINGLE global variable. The Action enum on the embedded ExtractInput controls the operation. To set multiple variables, call updateApp once per variable. |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate destructiveHint=true and readOnlyHint=false. The description adds the need for 'write' scope, which is useful. However, it does not describe the extent of updates (e.g., partial vs full replacement) or other side effects like rate limiting or webhook behavior.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single concise sentence with the method, resource, and scope requirement. It is well-structured and front-loaded, but given the tool's complexity (10 parameters), it could include more context without losing conciseness.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With no output schema, 10 parameters, nested objects, and destructive semantics, the description is too minimal. It lacks information on return values, error handling, idempotency, or differential behavior from siblings, leaving significant gaps for the agent.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is only 30%, yet the description adds no parameter-specific semantics beyond mentioning appId as a path parameter. For the 7 undocumented parameters, the agent gains no additional meaning from the description, failing to compensate for the low coverage.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool updates an app via PUT, specifies the path parameter {appId}, and distinguishes it from siblings like createApp or deleteApp. The verb 'Update' and resource 'app' are specific and unambiguous.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance on when to use this tool versus alternatives like createApp or deleteApp. It only mentions scope requirement, but does not explain prerequisites, conditions, or scenarios where this tool is preferred.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
updateTestCaseADestructiveInspect
PUT /apps/{appId}/recordings/{testSetId}/test-cases/{testCaseId} — Update a test case — Update mutable fields of a recorded test case (name, http_req, http_resp). Requires scope: write.
| Name | Required | Description | Default |
|---|---|---|---|
| name | No | ||
| appId | Yes | Path parameter: appId | |
| http_req | No | ||
| http_resp | No | ||
| testSetId | Yes | Path parameter: testSetId | |
| testCaseId | Yes | Path parameter: testCaseId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already set destructiveHint=true, idempotentHint=false, and readOnlyHint=false, so the description adds the need for write scope. No contradictions. It does not detail side effects beyond mutation, but the annotations cover the behavioral hints adequately.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single sentence that includes the HTTP method, URL, purpose, and scope requirement. No redundant information; every part is useful.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a mutation tool with no output schema, the description explains what fields are updated and the required scope. It does not specify return values, but that is acceptable. It covers the essential aspects, though it could mention potential side effects on the recorded test case.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 50% (only path params described). The description names the updatable fields (name, http_req, http_resp) but does not explain their types or constraints. With moderate schema coverage, the description partially compensates but could provide more detail.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it updates a test case, specifies the HTTP method and URL, and lists the mutable fields (name, http_req, http_resp). This distinguishes it from other tools like getTestCase or deleteTestSuite.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description indicates it is for updating mutable fields and requires write scope, but lacks explicit guidance on when not to use this tool versus alternatives. The context is clear enough but not fully elaborated.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
update_test_suiteADestructiveInspect
Edit an existing test suite — change one or more step bodies, assertions, headers, or remove/add steps. Returns a playbook that delegates to keploy update-test-suite, which validates the new state (static structural checks + 2 live runs for idempotency + GET-coupling check) and snapshot-replaces the suite via api-server.
POST-EDIT BEHAVIOUR: any structural change here (step method/url/body/headers/extract/assert, or add/delete steps) AUTOMATICALLY clears the suite's sandbox test server-side — the suite comes back as linked=false. Call record_sandbox_test on the updated suite before any sandbox replay; otherwise replay_sandbox_test will 400 with "no sandboxed tests". Cosmetic-only edits (name, description, labels) preserve the sandbox test.
═══════════════════════════════════════════════════════════════════ FETCH-FIRST RULE — required for the edit to be accepted: ═══════════════════════════════════════════════════════════════════
The api-server's replace handler rejects updates that preserve ZERO step IDs from the existing suite ("full rewrite, not an edit"). To make a real edit:
Call getTestSuite first (or use download_recording / get_app_testing_context if you already have the suite). Capture each existing step's "id" field.
Compose your new steps_json INCLUDING the existing "id" on every step you want to KEEP or EDIT. Omit "id" only on steps you're ADDING. Drop a step entirely from steps_json to DELETE it.
Call this tool with that merged steps_json.
If you author a fresh JSON without the existing step IDs, the server rejects it with "preserves no steps from the existing suite". When that happens, your two options are: (a) re-author with IDs preserved (preferred — keeps history), or (b) call delete_test_suite then create_test_suite (loses history, fresh suite_id).
═══════════════════════════════════════════════════════════════════ DISCOVERY — when the dev hands you a bare suite_id with no app_id / branch_id: ═══════════════════════════════════════════════════════════════════
Suites live on a (app_id, branch_id) tuple. A bare suite_id has no on-disk hint about which app or branch holds it; you have to RESOLVE both before calling this tool. Walk these steps in order — STOP as soon as getTestSuite returns 200:
Detect the dev's git branch: Bash
git rev-parse --abbrev-ref HEADin app_dir. If exit non-zero / output is "HEAD" → not a git repo / detached HEAD; ASK the dev for the Keploy branch name.Resolve candidate apps via the cwd basename: Bash
basename $(pwd)→ call listApps with q=. Usually 1–2 candidates. If 0 → ASK; if >1 → walk every candidate in step 4.For each candidate app, call list_branches({app_id}) and find the branch whose
namematches the git branch from step 1. That gives you {branch_id}. If no match → not this app, try next.Verify with getTestSuite({app_id, suite_id, branch_id=<from step 3>}). 200 → resolved; 404 → wrong app/branch, try next.
If steps 2–4 exhaust, walk every OPEN branch on each candidate app, then try main (branch_id omitted). If still nothing → ASK the dev for the {app_id, branch_id} pair.
The getTestSuite call in step 4 is the one whose response you also use to capture every step's existing "id" for the FETCH-FIRST RULE above — so step 4 is actually a 2-for-1: discovery AND fetch-first happen on the same call.
After resolving once in a session, REUSE the {app_id, branch_id} for subsequent suite-targeted calls; don't re-walk discovery for every action.
═══════════════════════════════════════════════════════════════════ INPUTS ═══════════════════════════════════════════════════════════════════
app_id (required) — Keploy app id
suite_id (required) — UUID of the suite to update
branch_id (required) — Keploy branch UUID (resolve via the two-step flow before calling)
steps_json (required) — JSON array of the FULL desired step list. Each kept step MUST carry the existing "id". Same step shape as create_test_suite (response, extract, assert, etc — all static structural checks apply).
name / description / labels (optional) — overrides for top-level suite metadata
app_url (required) — base URL of the dev's running local app, e.g. http://localhost:8080. The CLI fires the new state TWICE against this for the idempotency check + GET-coupling check.
app_dir (optional) — repo root the CLI cd's into; defaults to "."
═══════════════════════════════════════════════════════════════════ HOW THIS TOOL WORKS ═══════════════════════════════════════════════════════════════════
This tool DOES NOT call api-server itself. It returns a 3-step playbook for you (Claude) to walk via Bash — same shape as create_test_suite:
Write merged JSON to a temp file.
Run
keploy update-test-suite --suite-id <id> --file <path> --branch-id <uuid> --base-url <url>— runs every static structural check, fires the new state twice locally, applies the GET-coupling check, then POSTs the snapshot-replace.Cleanup the temp file.
Walk the playbook in order. If step 2 exits non-zero, surface stdout to the dev — it has the rule violation / failure detail.
OUTCOMES the AI should recognize:
Exit 0 + stdout has "✓ suite updated:" + "View:" line → success. Surface the View URL to the dev.
Exit 1 + "preserves no steps from the existing suite" → fetch-first rule was missed. Re-author with step IDs preserved (or call delete_test_suite + create_test_suite as the documented escape hatch).
Exit 1 + structural-check violations → fix the suite per the violation messages, then REWRITE the suite file via Bash and RE-RUN this CLI command directly. DO NOT call update_test_suite again to retry — the playbook + file path are already valid; only the JSON content needs revision. The validator output includes a canonical step skeleton on structural failures.
Exit 2 + "couldn't reach the dev's app" → ensure the app is up at app_url and retry.
PREREQUISITES the playbook assumes:
The dev's app is up and reachable at app_url.
keploybinary is on PATH. If missing, install before calling this tool:curl --silent -O -L https://keploy.io/install.sh && source install.sh.Either ~/.keploy/cred.yaml exists or KEPLOY_API_KEY is exported.
| Name | Required | Description | Default |
|---|---|---|---|
| name | No | Optional override for the suite's name. Defaults to the existing name if omitted. | |
| app_id | Yes | Keploy app ID | |
| labels | No | Optional comma-separated labels to set (replaces the existing list). | |
| app_dir | No | Absolute path to the dev's repo root. Defaults to '.' (cwd). | |
| app_url | Yes | Base URL of the dev's local running app, e.g. http://localhost:8080. The CLI fires the new state twice against this for the idempotency + GET-coupling checks. | |
| suite_id | Yes | UUID of the test suite to update | |
| branch_id | Yes | REQUIRED. Keploy branch UUID. | |
| steps_json | Yes | JSON array of the full desired step list. Each step you want to KEEP must carry the existing "id" — omit "id" only on new steps. Drop steps entirely to delete them. | |
| description | No | Optional override for the suite's description. |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description fully discloses behavioral traits beyond annotations: it explains that the tool returns a playbook for the AI to execute via Bash, details the 3-step playbook, describes outcomes and error scenarios, and lists prerequisites (app up, keploy binary, credentials). This adds significant context to the destructiveHint and openWorldHint annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is long but well-structured with clear headings (FETCH-FIRST RULE, DISCOVERY, INPUTS, etc.). It uses bullet points and numbered steps for readability. While not concise, every section earns its place given the tool's complexity and the need to guide the AI through a multi-step process.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The description covers all necessary aspects: purpose, prerequisites, discovery flow, input semantics, how the tool works (playbook), expected outcomes, error handling, and fallback strategies. It leaves no ambiguity for the AI agent, making it fully complete for this complex tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so the baseline is 3. The description adds value by explaining the crucial constraint on steps_json (must include existing step IDs) and clarifying app_url's purpose ('CLI fires the new state twice for D3/D7 validation'). This goes beyond the schema descriptions.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description opens with a clear verb+resource statement: 'Edit an existing test suite — change one or more step bodies, assertions, headers, or remove/add steps.' It distinguishes itself from sibling tools like create_test_suite and delete_test_suite by referencing them as alternatives in the FETCH-FIRST RULE section.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicit when-to-use guidance is provided through the FETCH-FIRST RULE, which requires calling getTestSuite first to capture step IDs. The description also specifies when not to use this tool (e.g., when the server rejects with 'preserves no steps') and offers alternatives: re-author with IDs or call delete_test_suite + create_test_suite.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
updateTestSuiteBDestructiveInspect
PUT /apps/{appId}/test-suites/{suiteId} — Update a test suite — Requires scope: write.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| suiteId | Yes | Path parameter: suiteId | |
| branch_id | No | Query parameter: branch_id |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations indicate destructiveHint=true and readOnlyHint=false, consistent with 'update'. Description adds HTTP method (PUT) and scope requirement. However, no details on behavior like partial vs. full update, idempotency (annotations: idempotentHint=false), or what happens if suite doesn't exist.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Very concise, two fragments. Front-loads HTTP method and path. Could be slightly more structured, but overall efficient with no wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a destructive update tool with no output schema, the description is too minimal. It lacks information on return values, success/error behaviors, and prerequisites beyond scope. Given many sibling tools, more context is needed for safe invocation.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Input schema has 100% coverage with descriptions for both parameters. Description does not add additional meaning beyond the schema, so baseline of 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states 'Update a test suite', indicating the resource and action. However, it does not differentiate from sibling tools like updateTestCase or updateApp, and the resource is inferred from the path rather than explicitly stated.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Only mentions required scope 'write', but provides no guidance on when to use this tool versus alternatives like createTestSuite, deleteTestSuite, or other update tools. Lacks context for choosing this over siblings.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
validateTestSuiteADestructiveInspect
POST /apps/{appId}/test-suites/{suiteId}/validate — Validate a test suite — Run the suite against a public, non-loopback base URL to capture responses and run assertions. DO NOT use for local-app / localhost validation — the SaaS backend rejects private IPs with 500. For local apps, curl endpoints yourself (Bash) and pass the captured responses into create_test_suite directly. Requires scope: write.
| Name | Required | Description | Default |
|---|---|---|---|
| appId | Yes | Path parameter: appId | |
| suiteId | Yes | Path parameter: suiteId |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations indicate destructiveHint=true and readOnlyHint=false, which the description supports by mentioning capturing responses and running assertions. However, it does not fully detail what state changes occur (e.g., does it update the suite?). The private IP rejection is a useful behavioral note.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three sentences pack all key information: purpose, warning, alternative, and scope. No redundancy or fluff. The HTTP method and path provide precise identification.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the destructive nature and lack of output schema, the description covers scope requirements, error conditions (private IP rejection), and alternative workflows. Minor omission: no mention of what validation results look like, but the tool likely returns a validation report (implied by sibling listValidationResults).
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Input schema covers both parameters with 100% description, so the baseline is 3. The description adds no extra semantic meaning beyond confirming they are path parameters via the HTTP path.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool validates a test suite by running it against a public base URL. It distinguishes from siblings like run_test_suite and validate_and_run_test_suite by specifying the non-loopback requirement and providing an alternative for local apps.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly warns against using for localhost/local-app validation and gives a concrete alternative (curl and create_test_suite). It also specifies the required scope `write`, guiding correct invocation.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
Claim this connector by publishing a /.well-known/glama.json file on your server's domain with the following structure:
{
"$schema": "https://glama.ai/mcp/schemas/connector.json",
"maintainers": [{ "email": "your-email@example.com" }]
}The email address must match the email associated with your Glama account. Once published, Glama will automatically detect and verify the file within a few minutes.
Control your server's listing on Glama, including description and metadata
Access analytics and receive server usage reports
Get monitoring and health status updates for your server
Feature your server to boost visibility and reach more users
For users:
Full audit trail – every tool call is logged with inputs and outputs for compliance and debugging
Granular tool control – enable or disable individual tools per connector to limit what your AI agents can do
Centralized credential management – store and rotate API keys and OAuth tokens in one place
Change alerts – get notified when a connector changes its schema, adds or removes tools, or updates tool definitions, so nothing breaks silently
For server owners:
Proven adoption – public usage metrics on your listing show real-world traction and build trust with prospective users
Tool-level analytics – see which tools are being used most, helping you prioritize development and documentation
Direct user feedback – users can report issues and suggest improvements through the listing, giving you a channel you would not have otherwise
The connector status is unhealthy when Glama is unable to successfully connect to the server. This can happen for several reasons:
The server is experiencing an outage
The URL of the server is wrong
Credentials required to access the server are missing or invalid
If you are the owner of this MCP connector and would like to make modifications to the listing, including providing test credentials for accessing the server, please contact support@glama.ai.
Discussions
No comments yet. Be the first to start the discussion!