AI Design Blueprint Doctrine
Server Details
The industry standard reference for safe, observable, and steerable AI agent UX. Browse and search the 10 Blueprint principles, principle clusters, curated implementation examples, and application guides. 13 public tools require no credentials. Tools for learning path, coaching context, and handoffs require a Firebase Bearer token. Validation and usage summary tools require a Pro or Teams membership.
- Status
- Healthy
- Last Tested
- Transport
- Streamable HTTP
- URL
Glama MCP Gateway
Connect through Glama MCP Gateway for full control over tool access and complete visibility into every call.
Full call logging
Every tool call is logged with complete inputs and outputs, so you can debug issues and audit what your agents are doing.
Tool access control
Enable or disable individual tools per connector, so you decide what your agents can and cannot do.
Managed credentials
Glama handles OAuth flows, token storage, and automatic rotation, so credentials never expire on your clients.
Usage analytics
See which tools your agents call, how often, and when, so you can understand usage patterns and catch anomalies.
Tool Definition Quality
Average 4.7/5 across 24 of 24 tools scored.
Each tool targets a clearly distinct purpose. For example, architect.validate vs architect.validate_consensus differ in single-shot vs consensus; handoffs.agency, handoffs.operator, and handoffs.partnership are separated by engagement type. No significant overlap.
Tools follow a consistent dot-notation grouping (architect.*, clusters.*, examples.*, guides.*, handoffs.*, me.*, principles.*, signals.*, team.*) with predictable verbs (validate, list, get, search, add, etc.). Minor deviation: some underscore within names (e.g., me.add_evidence) but overall pattern holds.
24 tools is on the higher side but justified given the broad domain spanning validation, certification, learning, handoffs, and feedback. Each tool has a clear role, and the count reflects the platform's comprehensive scope without feeling bloated.
The tool surface covers the full workflow (validate → consensus → certify), discovery (principles, clusters, examples, guides), personal progress (learning path, coaching, evidence), handoffs (support, partnership, agency), feedback, and team summaries. No obvious gaps for the stated purpose.
Available Tools
24 toolsarchitect.certifyCertify Production-Ready ArchitectureAInspect
Pro/Teams — second-pass adversarial certification of an architect.validate run that scored production_ready (A or B first-pass tier). ON CLIENT TIMEOUT — DO NOT RETRY THIS TOOL. RECOVERY FIRST: the run_id is emitted in the FIRST notifications/progress event at t=0s (BEFORE the LLM call begins). Capture it. On timeout, call me.validation_history(run_id='<that-id>') to fetch the persisted cert verdict; the server-side run completes independently within a 20-minute budget. This is the canonical recovery path. Use it before considering any retry. Long-running LLM call (60-180s typical; exceeds Claude Code's ~60s idle budget); MCP clients commonly close the call before the server returns. Retrying re-runs the LLM call AND burns one of your 3 cert retry-budget attempts. Mints the certified production_ready badge when both reviewers sign off; caps the run to C/emerging when the second pass surfaces a missed production_blocker. MANDATORY DOCTRINE RULE (load-bearing): the badge certifies the EXACT code that produced the validate run_id, NOT 'this codebase' in general. If you modify, fix, or iterate the code between architect.validate and architect.certify — even a single character — cert rejects with code_fingerprint_mismatch. Fixing the code voids the run. The recovery path is always: edit code → architect.validate → fresh run_id → architect.certify on the fresh run. Do NOT cert from a stale run_id after iteration; ask the user to re-validate first. WHEN TO CALL: only after architect.validate returned tier=production_ready AND the user wants the certified badge AND the code has not been touched since the validate run. NOT for tier=draft/emerging/not_applicable runs (typed rejections fire — see below). NOT idempotent across attempts: each call is one of the 3 attempts in the retry budget. BEHAVIOR: atomic one-shot single LLM call, ~60-180s server-side at high reasoning effort (small payloads finish faster; observed p99 ~250s; server-side budget is 20 min, ~5× observed max). Exceeds typical MCP-client tool-call idle budget (~60s in Claude Code), so the FIRST notifications/progress event fires at t=0 carrying the run_id. The run is atomic by contract — no in_progress lifecycle, no cancellation, no resume. Updates the persisted run's result_json (public review URL + me.validation_history(run_id=...) reflect the cert outcome). ELIGIBILITY GATE (typed rejection enum on failure): caller must own the run, tier=production_ready, less than 24h old, not already certified, within cert retry budget (max 3 attempts), no other cert call in flight for the same run_id, code fingerprint must match the validated code, AND the submitted payload must be cert-payload-complete (see Payload Completeness below — cert rejects pre-LLM with payload_incomplete when an imported module's surface isn't visible in the validate payload that produced this run_id). Rejection reasons (typed Literal): auth_required, paid_plan_required, run_not_found, not_run_owner, not_eligible_tier, not_agentic_component (tier=not_applicable runs), already_certified, certification_age_exceeded, retry_budget_exhausted, code_fingerprint_mismatch, code_fingerprint_missing, code_not_on_file (caller omitted code argument AND the 24h cert-retry hold for this run has expired or was never written. Recovery: re-run architect.certify from the same MCP session that ran architect.validate, passing the code explicitly — the server never persists code by design), payload_incomplete (submitted/validated payload imports modules whose contents aren't visible — cert refuses pre-LLM to prevent a false-precision downgrade. Recovery: re-validate with verbatim public-surface stubs for every imported module, then re-cert on the fresh run_id. Empirically validated: PR #157 iter8/iter9 cert rejections were exactly this class — code on disk was correct, the submitted payload merely omitted module visibility), cert_consensus_score_below_threshold (consensus_median<75 — consensus runs only), cert_consensus_unstable_blocker (any principle mode_stability<80% — consensus runs only), run_state_corrupt, cert_persistence_failed, cert_in_flight (a prior architect.certify call on this run_id is still running. Poll me.validation_history for the verdict; do not retry until it resolves). PAYLOAD COMPLETENESS (load-bearing for cert eligibility): the cert reviewer reads the EXACT payload that produced the validate run_id. Imported modules whose surface isn't present in the payload cause pre-LLM payload_incomplete refusal. Avoidance — when validating with intent to cert, bundle public-surface stubs for every imported module: from sqlalchemy.exc import SQLAlchemyError → include a stub class; from app.db import models → include a class models: namespace stub with the columns/methods you reference; module-level imports of dataclass, Literal, json, datetime, timezone MUST also be in the payload (cert correctly catches when they're omitted — code would NameError on import). 'Submit Like Production': the payload should be the code as it would actually run, not a compressed sketch. The stubs cover IMPORTED dependencies only; the certified code's own enforcement branches (approval gates, policy checks, recovery paths) must be present in full. A # ... placeholder reads as an ABSENT control and is graded against you, not as shorthand for one that exists. PRE-LLM REJECTION AUDIT TRAIL: when cert rejects before the LLM call (payload_incomplete, code_fingerprint_mismatch, etc.), certification_attempts=[] on the response — no attempt landed in the retry budget, no LLM hop occurred. The rejection envelope's rejection_reason + guidance are the actionable surface. (Audit-trail UI surfacing of pre-LLM rejections is tracked in the platform self-audit set as anomaly #5; out of scope for the cert tool itself.) INPUTS: re-send the SAME code that produced the run_id (the architect persists findings + recommendations, never code, by design — privacy-preserving). Server compares the submitted code's SHA-256 fingerprint to the stored fingerprint and rejects mismatches. Auth: Bearer , Pro or Teams plan required. UK/EU data residency (Cloud Run europe-west2). Code processed transiently by OpenAI (no-training-on-API-data) and dropped; payloads JSON-escaped + delimited as inert untrusted data — prompt-injection inside code is ignored. If the cert call fails outright (provider error, persistence error), a fresh architect.certify is the recovery path; the eligibility gate enforces the 3-attempt retry budget. For long-running cert workflows the answer is to re-validate, not to make this tool stateful. OUTCOMES: certification_status ∈ {confirmed_production_ready (badge mints), downgraded_to_emerging (cert review surfaced a missed production_blocker, tier capped at C/emerging), unavailable_provider_error (LLM call failed, retry within budget)}. Cert findings + summary + attempt history surfaced on the persisted run for full inspectability.
| Name | Required | Description | Default |
|---|---|---|---|
| code | No | The same code that was sent to architect.validate to produce this run_id. Sent verbatim — the cert reviewer needs the actual code to surface production_blockers the first pass missed. May be omitted (empty string) when the prior validate stored the code under the 24h cert-retry hold; in that case the server reuses the stored code automatically. Sent under the same enterprise-safety envelope as architect.validate (transient processing, no training, JSON-escaped + delimited). | |
| run_id | Yes | The run_id from a prior architect.validate call. Returned in the validate response when persistence_status='saved'. Must be owned by the caller (per-user authorisation, same gate as me.validation_history). |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Description discloses extensive behavioral traits beyond annotations: atomicity, timeout handling, retry budget, eligibility gates, typed rejections, payload completeness, pre-LLM rejections, and outcomes. Annotations are consistent (readOnlyHint=false, destructiveHint=false).
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is very long (over 1500 words). While comprehensive and front-loaded with core purpose and recovery, it could be more concise. Structure includes lists and bold text but is verbose.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity and output schema, the description covers all aspects: purpose, recovery, eligibility, rejections, payload completeness, outcomes. It is fully self-contained.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema covers both parameters (code and run_id) with detailed descriptions. Description adds context like when to omit code (if stored) and how run_id is obtained, significantly enhancing meaning.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose as 'second-pass adversarial certification of an architect.validate run that scored production_ready (A or B first-pass tier).' It distinguishes from sibling tools like architect.validate (first pass) and mentions specific conditions and outcomes.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Provides explicit guidance on when to use (only after tier=production_ready, code not touched) and when not to use (tier=draft/emerging/not_applicable). Includes recovery path on timeout and alternatives like re-validation.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
architect.validateValidate Agent ArchitectureAInspect
Pro/Teams — first-pass doctrine review of agentic code/workflow against the 10-principle Agentic AI Blueprint. ON CLIENT TIMEOUT — DO NOT RETRY THIS TOOL. Long-running LLM call (60-180s typical); MCP clients commonly close the call before the server returns. Retrying re-runs the 60-180s LLM call from scratch and burns compute. RECOVERY: the run_id is emitted in the FIRST notifications/progress event at t=0s (before the LLM call begins) — capture it. On timeout, call me.validation_history(run_id='<that-id>') to fetch the persisted result; the server-side run completes independently within a 20-minute budget. Edge case: if the transport dropped before the first progress notification (very rare; sub-second window), call me.validation_history(repository='<same value you passed here>') to find your most recent run. TASK-AUGMENTED INVOCATION (MCP 2025-11-25, SEP-1686): clients that advertise the tasks capability can task-augment this call by including task: {ttl: <ms>} inside the JSON-RPC request's params (NOT as a tool argument; alongside arguments, _meta, etc.). The server returns a CreateTaskResult immediately (taskId equals the run_id above) and runs the validation in the background. Spec-correct long-running pattern: poll via tasks/get for state, fetch the terminal payload via tasks/result, listen for notifications/tasks/status for push updates, and cancel via tasks/cancel. _meta.progressToken from the original request stays valid for the entire task lifetime. Sync (non-augmented) calls behave exactly as before, backwards-compatible by construction. The me.validation_history(run_id=...) recovery path remains the canonical recovery handle for clients that don't yet advertise the tasks capability. Returns code_classification (autonomous_agentic_workflow vs non_agentic_component), per-principle findings (verdict, severity_score 0-100, severity_class, code-cited evidence, recommendation), severity-weighted readiness (score|null, grade|null, tier ∈ {production_ready, emerging, draft, not_applicable}), recommended examples, reproducibility envelope (model, seed, doctrine_fingerprint, prompt_template_fingerprint), persistence_status with shareable run_id/badge_url/review_url. WHEN TO CALL: the user wants a governance audit, readiness score, or production_ready badge on an agent/workflow they just built or changed. WHEN NOT TO CALL: non-agentic plumbing (math utilities, type aliases, event-loop helpers, single-shot request/response handlers) returns tier=not_applicable with score=null/grade=null — that's not a failure, the doctrine simply doesn't grade non-agentic code, and architect.certify will refuse with not_agentic_component. Submit the OWNING agentic workflow instead. BEHAVIOR: long-running LLM call (~60-180s typical at high reasoning effort, single-pass; server-side budget 20 min). Mints run_id at t=0; first notifications/progress event carries run_id as recovery handle; keepalive every 30s. Persists ValidationRun + UserValidationRun + AIValidationRunLog + LLMUsageLog atomically; on rollback, badge/review URLs are stripped. Auth: Bearer , Pro/Teams plan. UK/EU residency; transient OpenAI processing (no-training); prompt-injection in code is inert. INPUTS: send FULL file contents verbatim as implementation_context (NO truncation, NO ... placeholders, NO comment removal — the architect treats your ... as literal code and hallucinates bugs that don't exist). If too large, split into MULTIPLE calls scoped by file/module; never truncate one call. Pass repository="" to group runs into a project trend. Pass private_session=true to bypass server-side logging (persistence + recovery disabled). focus_area narrows scope; unmatched focus_area fails explicitly rather than silently widening. PAYLOAD COMPLETENESS (load-bearing if you intend to architect.certify this run): the validate first-pass is permissive — it scores on doctrine alignment + structural patterns visible in the submitted code. Cert's adversarial second-pass is rigorous — it scores on cert-payload-completeness as well as code correctness. A run that scores 100/A at validate can cert-reject pre-LLM with payload_incomplete when imported modules' surfaces aren't visible. To validate with INTENT TO CERT, also bundle verbatim public-surface stubs for every imported module: from sqlalchemy.exc import SQLAlchemyError → include a stub class; from app.db import models → include a class models: namespace stub with the columns/methods the code references; module-level imports of dataclass, Literal, json, datetime, timezone MUST also be in the payload (cert correctly catches when they're omitted — the module would NameError on import as submitted). 'Submit Like Production': the payload should be the code as it would actually run. TWO COMPLETENESS AXES. (1) IMPORTS: stub the public surface of every dependency (above). (2) ENFORCEMENT BRANCHES: the code under cert itself (approval gates, policy checks, recovery paths) must be the REAL logic, fully written. A placeholder body (# ... execute approved action ..., pass # TODO, a bare ...) is graded as a MISSING control, not shorthand; cert scores what would actually run. Never sketch the agent you are certifying. Empirically reconfirmed PR #157 iter8 → iter9 cert downgrades. SCORE VARIANCE DISCLOSURE (anomaly #10 — empirically documented): validate scores are POINT ESTIMATES with an observed empirical variance band of ~20-67 pts on BYTE-IDENTICAL input. Runs against the same repository, same code, same deterministic seed (the seed is derived from input — same input → same seed) can produce materially different scores AND different top-blocker rankings, because OpenAI's reasoning models at reasoning_effort=high are not strictly deterministic even with the seed parameter pinned. The reproducibility_mode='best_effort' field on every response is the platform's honest disclosure of this property. For decisions where stability matters more than speed, call architect.validate_consensus (N=3-5 aggregated, median verdict + per-principle stability metrics) instead — collapses the variance, surfaces unstable principles explicitly. A single validate run is a single roll; consensus is the right tool when one score isn't enough. ITERATION LOOP — repository keying. Pass the SAME repository value across calls to chain iteration rounds; the validator auto-resolves the most recent prior run on (user, repository, scope) as prior_run_baseline and the LLM grades the new submission with iteration context (per-principle severity deltas surface in the response). Changing the repository string between calls — even subtly with an iter-2 suffix — silently severs the chain and yields a fresh blind first-shot. Round numbering belongs in task or commit messages, never in repository. See the architect-validation-orchestration skill in the agent-asset pack for the full validate → consensus → certify sequence. VERIFICATION LAYERS (the two-layer doctrine this platform practices on itself): validate verifies DOCTRINE ALIGNMENT against the 10-principle Blueprint — design patterns, hand-off explicitness, operational-state inspectability, race/blocker handling at the architectural level. validate does NOT guarantee runtime correctness. cert verifies PAYLOAD COMPLETENESS and runs an adversarial second pass over the submitted code — catches production_blockers the first pass missed, name-errors on import, missing module surfaces, etc. cert does NOT verify runtime correctness either. Passing validate is a NECESSARY condition for production_ready, not a sufficient one. Runtime correctness (does this actually execute and behave?) is verified at the THIRD layer — your tests, types, walks. The platform's own recursive-integrity practice: every PR runs validate against its own primitives, then cert. Real bugs surfaced via this practice in PR #157 — NULL-UUID false-positive (iter3) and tie-breaker mismatch (iter5) — that 25 unit tests had missed. Two-layer verification is the discipline, not 'either/or'. TYPED FAILURES: timed_out, rate_limited, dependency_unavailable, schema_mismatch (each carries retryable + next_action). NEXT STEP: if tier=production_ready (A or B grade), the response carries certification_status='not_evaluated' — call architect.certify(run_id, code) to mint the certified production_ready badge (separate ~60-150s adversarial review, eligibility-gated). See Payload Completeness above for the common pre-cert pitfall.
| Name | Required | Description | Default |
|---|---|---|---|
| task | No | What the agent or workflow is trying to accomplish. Adds evaluation context. | |
| files | No | List of file paths relevant to the implementation context. | |
| goals | No | Specific safety or quality goals to evaluate against (e.g. 'prevent irreversible actions', 'explicit approvals'). | |
| language | No | Programming language of the code being evaluated (e.g. 'python', 'typescript'). | |
| focus_area | No | Narrow the evaluation to a specific principle cluster or slug (e.g. 'delegation', 'visibility', 'establish-trust-through-inspectability'). | |
| repository | No | Iteration key. SAME value across calls auto-resolves the most recent prior run as `prior_run_baseline` for iteration-aware grading (per-principle severity deltas, regressions/improvements). CHANGING the value (even subtly with an `iter-2` suffix) silently severs the chain and yields a fresh blind first-shot. Round numbering belongs in `task`, not here. Empirical evidence of why anchoring matters: PR #157 iter1 33/F vs iter2 100/A on byte-identical baseline-race primitives (+67 spread); invoice-payment-manager #158 38/F vs #159 74/C (+36 spread) — same code, score variance from non-deterministic LLM at reasoning_effort=high; the baseline anchor collapses this onto a stable arc. | |
| example_limit | No | Maximum number of curated examples to include in recommendations. | |
| private_session | No | Set to true to disable logging AND prior-run anchoring AND run_id recovery for this call. Use for private one-shots that don't participate in the iteration arc. Default false. | |
| implementation_context | Yes | The artifact under review. SEND FULL FILE CONTENTS VERBATIM — the architect cites per-line evidence (identifiers, branch ordering, structural choices); any compression destroys evidence and produces hallucinated findings on code that isn't there. CONCRETE DON'TS: do NOT replace docstrings/comments with `...`; do NOT condense multi-line statements; do NOT replace dict/set comprehensions with `{...}`; do NOT remove explanatory comments to save tokens. If the file is large, split into MULTIPLE architect.validate calls scoped by file/module — never truncate one call. Architecture summaries (high-level prose) accepted ONLY for greenfield (no code yet); never as a substitute for code that already exists. |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Discloses long-running LLM call (60-180s), timeout recovery with run_id, task-augmented invocation, persistence, auth requirements, data residency, score variance, and non-determinism. Annotations already indicate write operation (readOnlyHint=false) but description adds critical operational context.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is very long (over 1200 words) and covers many topics, which harms conciseness. While structured with clear sections and front-loaded with the core purpose, it would benefit from tighter editing to remove redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (9 parameters, multiple behaviors, output schema present), the description is extremely thorough, covering recovery, task-augmentation, iteration, verification layers, typed failures, and next steps. It leaves no significant gaps.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% for all 9 parameters, but description adds significant extra context: warnings against truncation for implementation_context, empirical evidence for repository keying, and clarification for private_session. Goes well beyond the schema descriptions.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: 'first-pass doctrine review of agentic code/workflow against the 10-principle Agentic AI Blueprint.' It uses specific verbs ('review', 'validate') and distinguishes from siblings like architect.certify and architect.validate_consensus.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Provides explicit 'WHEN TO CALL' and 'WHEN NOT TO CALL' guidance, including alternatives like architect.validate_consensus for stability. Also covers iteration loop, recovery paths, and conditions for certification intent.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
architect.validate_consensusValidate Agent Architecture (Consensus Mode)AInspect
Pro/Teams — N-shot CONSENSUS doctrine review of agentic code. ON CLIENT TIMEOUT — DO NOT RETRY THIS TOOL. Long-running (~80-120s for N=3 parallel LLM calls); MCP clients often close the call before the server returns. Retrying re-runs N × 60-180s LLM calls from scratch and burns N× compute. RECOVERY: same heartbeat pattern as architect.validate — the run_id is emitted in the FIRST progress event at t=0s (before LLM children fire); on timeout, call me.validation_history(run_id='<that-id>') to fetch the persisted consensus envelope. Runs N parallel architect.validate calls with private_session=True, then aggregates them to a per-principle MODE verdict + median severity + per-principle stability + score range/stdev. Returns one ConsensusValidationResponse with the headline median score, the honest variance band, and a representative full ValidationResponse (the child whose score is closest to the median). WHEN TO CALL: the user wants an HONEST first-pass score on agentic code, with the architect's variance surfaced. The single-shot architect.validate re-asserts the prior persisted run's verdict via baseline-anchor injection — same code can score 60/C anchored vs 98/A unanchored. Consensus mode is the unanchored honest read. WHEN NOT TO CALL: when you NEED the iteration delta against a prior run (regressions/improvements panel) — for that, call architect.validate which keeps baseline injection on. CHAIN RESUME: each child runs with private_session=True (no anchor) on purpose, but the CONSOLIDATED outer row IS persisted with lifecycle_status='completed' — the next single-shot architect.validate on the same repository auto-resolves it as prior_run_baseline. Consensus checkpoint becomes the new anchor. See the architect-validation-orchestration skill in the agent-asset pack for the full validate → consensus → certify sequence. BEHAVIOR: N (default 3, max 5) parallel LLM calls run concurrently; wallclock ~80-120s for N=3 (max child latency, not sum). Cost = N × LLM bill. Each child runs with private_session=True so the doctrine prompt's prior-run baseline injection is suppressed (no anchor bias). One CONSOLIDATED UserValidationRun row is written carrying the consensus envelope; the N children themselves do NOT persist (private_session contract). AUTH: Bearer , Pro/Teams plan. Same paid-plan gate as architect.validate. INPUTS: same shape as architect.validate. n is the only extra arg (range 2..5). private_session is implicit (always true for children); the OUTER consolidated row IS persisted unless the tool itself is called inside another private context — but no such wrapper exists today. OUTPUT: response carries score_consensus_median (headline), score_stdev (honest uncertainty), score_range (min, max), mode_stability_min_pct (the cert-eligibility gate's input — ≥ 80% means the consensus is stable), per_principle (mode + distribution + severity median per principle), and representative_response (the closest-to-median child's full ValidationResponse so existing UI components render unchanged). TYPED FAILURES: same as architect.validate (timed_out, rate_limited, dependency_unavailable). Plus consensus-specific: consensus_quorum_failed when fewer than 2 child runs succeeded (≥ 2 required to compute a meaningful median).
| Name | Required | Description | Default |
|---|---|---|---|
| n | No | Number of parallel child runs. Default 3 (the variance signal is visible at N=3; cost = 3× LLM bill). Capped server-side by Settings.consensus_n_max (default 5). | |
| task | No | What the agent or workflow is trying to accomplish. | |
| files | No | List of file paths relevant to the implementation. | |
| goals | No | Specific safety or quality goals to evaluate against. | |
| language | No | Programming language of the code (e.g. 'python'). | |
| focus_area | No | Optional: narrow the review to a principle cluster or slug. | |
| repository | No | Iteration key. Consensus children all run unanchored (`private_session=True`), but the consolidated row IS persisted under this key — discoverable as prior baseline for the next single-shot `architect.validate`. Same value across calls keeps the iteration arc inspectable. | |
| example_limit | No | Max curated examples per child run. | |
| implementation_context | Yes | The artifact under review. SEND FULL FILE CONTENTS VERBATIM — same constraint as architect.validate. Truncation produces hallucinated findings on code that isn't there. |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description discloses all critical behavioral traits: long-running (~80-120s), concurrent N parallel LLM calls, cost, private session for children, persistence of consolidated row, quorum failure, and details of output structure. Annotations (readOnlyHint=false, etc.) are not contradicted; the description enriches them.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is long due to complexity, but well-structured: critical timeout info first, then usage guidelines, behavior, inputs, outputs, failures. Every sentence adds value. Slightly more conciseness could be achieved, but it remains clear and front-loaded.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity, the description covers all necessary context: timeout recovery, prerequisite (Pro/Teams plan), lifecycle (chain resume), typed failures, quorum condition, and output schema explanation. It also references the orchestration skill. An agent has enough info to use the tool correctly.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so baseline is 3. The description adds significant value beyond the schema: it explains the purpose of 'n' (cost, variance signal), 'repository' (iteration key), and 'implementation_context' (send full contents). It also clarifies the default and max for 'n'.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: 'N-shot CONSENSUS doctrine review of agentic code.' It distinguishes itself from the sibling tool 'architect.validate' by explaining that consensus mode provides an unanchored honest read, while the single-shot version anchors to prior runs. The title 'Validate Agent Architecture (Consensus Mode)' reinforces this.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly states when to call (user wants an honest first-pass score with variance) and when not to call (when iteration delta against a prior run is needed, in which case use 'architect.validate'). It also provides timeout recovery instructions and references the orchestration skill for the full sequence.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
assets.listList Agent AssetsARead-onlyIdempotentInspect
Public — list downloadable doctrine and agent asset artifacts (skill packs, rule packs, MCP setup snippets) the user can drop into their AI coding tool to import the Blueprint as native skill/rule files. Returns a list of assets with name, format (one of: zip / md / markdown / mdc / json / toml / text — the full vocabulary), pack_version, download_url, and platform target (Claude Code, Cursor, Codex, Gemini, Qwen). The response also carries count (length of assets) for symmetry with principles.list / clusters.list / guides.list. WHEN TO CALL: the user asks how to bring the Blueprint into their coding agent, or wants to install it as a local skill/rule file. WHEN NOT TO CALL: for the live MCP tools themselves — those are already available through this server. For doctrine content, prefer principles.list/get and guides.list/get. BEHAVIOR: read-only, idempotent, no auth required. Asset artefacts are regenerated on every deploy from the canonical doctrine.
| Name | Required | Description | Default |
|---|---|---|---|
No parameters | |||
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnly, idempotent, and non-destructive hints. The description adds valuable behavioral context: it is public, requires no auth, and assets are regenerated on every deploy from canonical doctrine. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is comprehensive and well-structured with clear sections (public, returns, when to call/not call, behavior). While slightly verbose, every sentence adds value. Front-loading key info aids quick comprehension.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given no parameters and an output schema likely covering basic structure, the description thoroughly explains return format, platform targets, symmetric count field, and regeneration behavior. It leaves no ambiguity for agent selection and invocation.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
With 0 parameters, schema coverage is 100%. The description adds meaning by detailing the response structure (list of assets with name, format, pack_version, download_url, platform target, and count) beyond what the empty schema provides.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool lists downloadable doctrine and agent asset artifacts for import into AI coding tools. It uses specific verbs ('list downloadable') and distinguishes from sibling tools like principles.list and guides.list by noting they are preferred for doctrine content.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explicitly provides 'WHEN TO CALL' (user asks about bringing Blueprint into coding agent) and 'WHEN NOT TO CALL' (for live MCP tools or doctrine content, preferring principles.list/get and guides.list/get), giving clear context and alternatives.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
clusters.getGet ClusterARead-onlyIdempotentInspect
Get one principle cluster by stable slug. Returns the cluster definition, shared rationale, and the full set of member principles (slug + title) so the caller can pivot into principles.get without a second list call. WHEN TO CALL: the user has already named a specific cluster (e.g. 'delegation', 'visibility', 'trust', 'orchestration') OR you have a slug from a prior clusters.list / principles.list response and need its full definition + member principles. The response embeds member principle slugs + titles already, so DO NOT loop principles.get over each member to get a cluster overview — read the response. WHEN NOT TO CALL: the user is describing a topic, failure mode, or keyword in natural language (call principles.search instead); the user wants to discover which clusters exist (call clusters.list); the user wants the definition of one specific principle (call principles.get directly). Idempotent + cacheable per slug. Returns 404-shaped error_payload on unknown slug — the slug must match exactly the value emitted by clusters.list, with no normalization.
| Name | Required | Description | Default |
|---|---|---|---|
| slug | Yes | Stable slug of the principle cluster (e.g. 'delegation', 'visibility', 'trust', 'orchestration'). |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already provide readOnlyHint, idempotentHint, destructiveHint. Description adds that it returns 404 for unknown slugs, slug must match exactly, and response embeds member slugs. This complements annotations without contradicting them.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Description is moderately long but well-structured: purpose first, then usage guidelines. Every sentence serves a purpose. Could be slightly more concise, but front-loading is effective.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With output schema present, low parameter count, and rich annotations, the description covers return values (cluster definition, rationale, member principles), error behavior (404), and exact matching rules. Fully complete for this tool's complexity.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Only one parameter with 100% schema coverage. Schema already provides examples. Description adds critical nuance: slug must match exactly the value from clusters.list, no normalization. This adds meaning beyond the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description uses specific verb 'Get' with resource 'principle cluster' identified by slug. Clearly differentiates from siblings: clusters.list (list all) and principles.get (get a principle). The response includes member principles, avoiding redundant calls.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicit WHEN TO CALL (user named a cluster or have slug) and WHEN NOT TO CALL (natural language -> principles.search, discover -> clusters.list, principle definition -> principles.get). Warns against looping over members after response, which is a concrete anti-pattern.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
clusters.listList ClustersARead-onlyIdempotentInspect
List all principle clusters with their stable slugs and linked principle titles. Use this to discover which clusters exist before drilling in with clusters.get or filtering principles.list by cluster. Prefer clusters.get when you already know the cluster slug and need full detail.
| Name | Required | Description | Default |
|---|---|---|---|
No parameters | |||
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already provide readOnlyHint, idempotentHint, and destructiveHint. The description adds value by specifying return content (stable slugs, linked principle titles), going beyond the structured data. No contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three sentences, each serving a distinct purpose: purpose, usage guidance, alternative recommendation. No redundancy or unnecessary words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With no parameters, a clear description, comprehensive annotations, and an existing output schema, the description covers all needed context for an agent to use the tool correctly.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
There are no parameters, so schema coverage is 100% by default. The description does not need to add parameter-level detail; baseline score of 4 applies as there are no param information gaps.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description states a specific verb 'List' and resource 'principle clusters', specifies return contents (stable slugs and linked principle titles), and distinguishes from sibling tools like clusters.get and principles.list.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly states when to use this tool (discover clusters before drilling in) and when to prefer alternatives (clusters.get when slug known, principles.list for filtering), providing clear usage boundaries.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
examples.getGet ExampleARead-onlyIdempotentInspect
Get one curated example by stable slug. Returns title, summary, source-code links, principle coverage (the principle slugs the example demonstrates), difficulty, library/framework, and implementation notes. Use this when you already have the slug from examples.search, a principles.get response, or a guide cross-link; prefer examples.search when filtering by topic / principle / difficulty / library; prefer guides.get when the caller wants a full walkthrough rather than a single reference example. Returns error_payload on unknown slug.
| Name | Required | Description | Default |
|---|---|---|---|
| slug | Yes | Stable slug of the curated example (e.g. 'agents-building-blocks-5-control'). |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint, destructiveHint, and idempotentHint. The description adds return fields and error behavior on unknown slug, which is useful but not essential given the annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two concise sentences. The first sentence covers purpose and outputs, the second covers usage guidelines and error behavior. No wasted words, front-loaded.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the simple input (one param), presence of output schema, and low complexity, the description is fully complete, including error handling and reference to sibling tools.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with a description for 'slug'. The description adds 'stable slug' and a concrete example, plus context on how the slug is obtained (from other tools), enhancing meaning.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Get one curated example by stable slug', specifying the verb, resource, and key parameter. It distinguishes from siblings by explicitly mentioning examples.search and guides.get.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides explicit when-to-use and when-not-to-use guidance: 'Use this when you already have the slug...prefer examples.search when filtering...prefer guides.get when...'. This leaves no ambiguity.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
examples.searchSearch ExamplesARead-onlyIdempotentInspect
Search curated examples by free-text query, ranked by relevance, with optional filters: principle_ids (only examples covering those principles), difficulty (beginner/intermediate/advanced), library (e.g. 'langgraph', 'openai'). Returns each match's slug, title, summary, principle coverage, difficulty, library, and source-code link — slug is the handle examples.get hydrates. Default limit 5, capped server-side. Use this when the user describes a use case, technique, or library and wants matching examples; prefer examples.get when you already have the slug; prefer guides.search when the user wants a full walkthrough; prefer principles.search when the user wants doctrine guidance, not an implementation.
| Name | Required | Description | Default |
|---|---|---|---|
| limit | No | Maximum number of results to return. Capped at server maximum. | |
| query | Yes | Free-text search query matched against example title, summary, and metadata. | |
| library | No | Filter by library or framework name (e.g. 'langgraph', 'openai', 'anthropic'). | |
| difficulty | No | Filter by difficulty level. | |
| principle_ids | No | Filter to examples that cover these principle IDs. |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, so the description adds value by detailing return fields, default limit (5), server-side capping, and relevance ranking, without contradicting annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Description is well-structured and front-loaded, but slightly verbose. Every sentence adds value, though some redundancy exists with schema descriptions.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity and available output schema, the description is complete: it explains return fields, limit behavior, and usage guidance, with no gaps.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so baseline is 3. The description adds context (e.g., meaning of principle_ids) but does not significantly enhance beyond schema descriptions.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool searches curated examples by free-text query with optional filters, and distinguishes from sibling tools like examples.get, guides.search, and principles.search.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicit usage guidelines are provided: when to use this tool versus alternatives (e.g., prefer examples.get when slug is known, guides.search for walkthroughs, principles.search for doctrine).
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
guides.getGet Application GuideARead-onlyIdempotentInspect
Get a full application guide by its stable slug (e.g. 'security-application', 'observable-evaluation'). Returns sections, action items, and linked principles. Use this when you already have the guide slug from guides.list or guides.search. Prefer guides.search when the user describes a topic in natural language; prefer guides.list when you need the full inventory.
| Name | Required | Description | Default |
|---|---|---|---|
| slug | Yes | Stable slug of the application guide (e.g. 'security-application', 'observable-evaluation'). |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, destructiveHint=false. The description adds the behavioral context of what the response returns (sections, action items, linked principles), which is useful but does not disclose anything beyond what annotations already cover. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three sentences, front-loaded with the core purpose, then usage guidelines. Every sentence contributes meaningfully; no redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
The tool has a single required parameter with full schema documentation, an output schema (so return format is covered), and clear sibling differentiation. The description is complete given the complexity.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with a clear description of the 'slug' parameter. The description adds value by noting that the slug comes from guides.list or guides.search and provides example slugs, which is helpful beyond the schema alone.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the verb ('Get'), the resource ('full application guide'), and the required input ('by its stable slug'). It also lists what the response contains (sections, action items, linked principles), and distinguishes from siblings guides.list and guides.search by naming them explicitly.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly says when to use this tool ('when you already have the guide slug from guides.list or guides.search') and when to prefer alternatives ('Prefer guides.search when the user describes a topic in natural language; prefer guides.list when you need the full inventory'). Provides clear decision criteria.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
guides.listList Application GuidesARead-onlyIdempotentInspect
List application guides that show how Blueprint principles apply to engineering challenges (security, evaluation, observability, etc.). Use this to discover which guides exist before drilling in. Prefer guides.search when the user describes a topic or failure mode in natural language. Prefer guides.get when you already know the guide slug and need full detail.
| Name | Required | Description | Default |
|---|---|---|---|
No parameters | |||
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already provide readOnlyHint=true, idempotentHint=true, destructiveHint=false. Description confirms the listing nature but adds no further behavioral context beyond the annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three sentences front-load purpose then usage guidance. Every sentence is essential and no wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given no parameters and annotations covering safety, the description completely informs the agent about tool purpose and when to use alternatives.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
No parameters exist; schema description coverage is 100% trivially. Baseline 4 is appropriate as description does not need to add parameter information.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description uses specific verb 'List' and resource 'application guides' with clear scope 'how Blueprint principles apply to engineering challenges'. It distinguishes from siblings by referencing guides.search and guides.get.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly states when to use this tool ('discover which guides exist before drilling in') and when to prefer alternatives: guides.search for natural language topics, guides.get when slug is known.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
guides.searchSearch Application GuidesARead-onlyIdempotentInspect
Search application guides by free-text query, matched against section answers and action items. Use this when the user describes an engineering challenge (security review, evaluation harness, observability) and wants matching guides. Prefer guides.get when you already have the guide slug; prefer guides.list when you need the full inventory.
| Name | Required | Description | Default |
|---|---|---|---|
| limit | No | Maximum number of results to return. Capped at server maximum. | |
| query | Yes | Free-text search query matched against all guide content including section answers and action items. |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, destructiveHint=false, indicating a safe read operation. The description adds value by specifying that the search matches 'section answers and action items,' which provides important behavioral context beyond what annotations convey.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is concise and well-structured: three sentences front-loaded with the purpose, followed by usage guidelines. Every sentence adds value and there is no redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool has only 2 parameters (one required), a clear schema, an output schema, and annotations fully covering safety, the description provides enough context to understand when and how to use the tool. It explains the search scope and distinguishes from siblings, making it complete for an AI agent.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% (both parameters fully described in the input schema). The description does not add new information about the parameters beyond what the schema provides, so it meets the baseline of 3.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: 'Search application guides by free-text query, matched against section answers and action items.' It also distinguishes from sibling tools by specifying when to use guides.get or guides.list instead.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicit usage guidance: 'Use this when the user describes an engineering challenge...' and 'Prefer guides.get when you already have the guide slug; prefer guides.list when you need the full inventory.' This clearly tells the agent when to choose this tool over alternatives.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
handoffs.agencyRequest Agency HandoffAInspect
Authenticated — submit an agency engagement enquiry on behalf of the caller for a founder-led discovery call. Persists an AgencyHandoff row routed to the agency inbox; the user is contacted by the team for a scoped proposal. Engagement scopes: workflow sprint (rapid agentic workflow implementation), proof-of-concept (validate a specific agent design in a bounded timeframe), pilot support (co-design and validate a production-ready pilot), advisory (ongoing architectural guidance across a product team). WHEN TO CALL: the user has identified a paid hands-on expert engagement need beyond self-service learning, and explicitly asks to talk to the team or book a discovery call. ALWAYS confirm with the user before firing — this creates a sales-visible record. WHEN NOT TO CALL: for free training / partnerships discussion (use handoffs.partnership); for support / billing / access (use handoffs.operator); proactively or as a sales push. BEHAVIOR: write-only, single insert, side-effecting. Auth: Bearer (Firebase ID token, any plan). UK/EU residency. Response confirms the ticket id + scope so the user can reference it.
| Name | Required | Description | Default |
|---|---|---|---|
| role | No | Role or title of the person submitting the agency inquiry. | |
| locale | No | Response locale for the acknowledgment. | en |
| reason | Yes | Description of the engagement need: workflow sprint, proof-of-concept, pilot support, or advisory. | |
| company | No | Company or team name submitting the agency inquiry. | |
| website | No | Website or relevant URL for the team or project. | |
| agent_name | No | Name of the agent or client triggering the handoff. | mcp-client |
| support_type | No | Type of support needed. | |
| trace_summary | No | Optional agent trace summary for operator context. | |
| agent_platform | No | Platform or runtime the agent is running on. | |
| workflow_stage | No | Current workflow stage. |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Describes write-only, single insert, side-effecting behavior, auth requirements (Bearer token, any plan), UK/EU residency, and response details. No contradiction with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Well-structured with sections and bullet points. Front-loaded with main purpose. Could be slightly trimmed but every sentence adds value.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Comprehensive for a complex tool with 10 parameters and output schema. Covers behavior, auth, scope, and usage guidance thoroughly.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%. Description adds list of engagement scopes and enum values for support_type, but these are already in schema. Minimal added value beyond schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Clearly states the tool submits an agency engagement enquiry for a founder-led discovery call and persists an AgencyHandoff row. Distinguishes from siblings like handoffs.partnership and handoffs.operator.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly provides WHEN TO CALL and WHEN NOT TO CALL sections with specific conditions and alternative tools. Also instructs to confirm with user before firing.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
handoffs.operatorRequest Operator HandoffAInspect
Authenticated — creates a support handoff record when an agent needs human review, account-specific escalation, or operator follow-up that cannot be resolved with the read-only doctrine tools. Persists a SupportHandoff row (reason, topic, page_url, agent_name, agent_platform, trace_summary, user_email) routed to the support inbox; user is contacted by the team. WHEN TO CALL: user explicitly asks for human help, hits a billing/access issue, or the agent has tried the doctrine tools and the user still needs a human. ALWAYS confirm with the user before firing — this creates a human-visible ticket. WHEN NOT TO CALL: proactively, silently, or to log debugging traces (use diagnostic logs instead); for partnerships/agency enquiries (use handoffs.partnership / handoffs.agency); for content questions answerable by principles.search / guides.search. BEHAVIOR: write-only, single insert, side-effecting (creates a ticket the team will see). Auth: Bearer (any plan). UK/EU residency. Response confirms ticket id + topic so the user can reference it.
| Name | Required | Description | Default |
|---|---|---|---|
| topic | No | Topic category for routing (e.g. 'agent', 'billing', 'access', 'general'). | agent |
| locale | No | Response locale for the handoff acknowledgment. | en |
| reason | Yes | Clear description of why a human operator review is needed. | |
| page_url | No | URL of the page or context where the handoff was triggered. | |
| agent_name | No | Name of the agent or client triggering the handoff. | mcp-client |
| trace_summary | No | Optional summary of the agent's recent actions or trace for operator context. | |
| agent_platform | No | Platform or runtime the agent is running on (e.g. 'claude-code', 'cursor', 'copilot'). |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Description discloses authentication requirement, write-only nature, side effect (creates a visible ticket), residency constraint, and response format. Annotations only provide basic hints; description adds substantial behavioral context beyond annotation.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Well-structured with purpose, field list, usage guidelines, behavior, auth, and response. Front-loaded with key info. Slightly long but each sentence adds value. Minor redundancy due to field listing.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a 7-parameter tool with output schema, description covers purpose, usage, behavior, auth, residency, and response. Missing explanation of locale parameter effect and minor inconsistency with user_email. Otherwise complete for the complexity.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline is 3. Description lists fields persisted (reason, topic, etc.) adding context, but includes 'user_email' which is not in the schema, and omits the 'locale' parameter. This inconsistency may confuse, but overall adds some meaning beyond schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it 'creates a support handoff record' and specifies when it is needed (human review, escalation, operator follow-up). It distinguishes from sibling tools handoffs.partnership and handoffs.agency, and from diagnostic logs, providing precise resource and verb.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicit 'WHEN TO CALL' and 'WHEN NOT TO CALL' sections with specific scenarios. Includes critical instruction to confirm with user before firing. Covers alternatives for partnerships, agency, and content questions.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
handoffs.partnershipRequest Partnership HandoffAInspect
Authenticated — creates a partnerships handoff record for design-partner, ecosystem, training, or advisory conversations needing human review. Persists a PartnershipHandoff row routed to the partnerships inbox; the user is contacted by the team. WHEN TO CALL: user explicitly wants to engage as a design partner, co-marketing/training partner, or evaluate the Blueprint for their org's training programme. ALWAYS confirm with the user before firing — this creates a human-visible partnerships ticket. WHEN NOT TO CALL: for general support / billing / access issues (use handoffs.operator); for paid-engagement enquiries (use handoffs.agency); proactively or as a sales prompt — only when the user has explicitly asked. BEHAVIOR: write-only, single insert, side-effecting (creates a ticket). Auth: Bearer (any plan). UK/EU residency. Response confirms the ticket id + audience so the user can reference it.
| Name | Required | Description | Default |
|---|---|---|---|
| role | No | Role or title of the person submitting the partnership inquiry. | |
| topic | No | Partnership topic category. | ecosystem |
| locale | No | Response locale for the handoff acknowledgment. | en |
| reason | Yes | Clear description of the partnership opportunity or inquiry. | |
| website | No | Website of the organization for additional context. | |
| agent_name | No | Name of the agent or client triggering the handoff. | mcp-client |
| organization | No | Name of the organization or company making the partnership inquiry. | |
| trace_summary | No | Optional agent trace summary for operator context. | |
| agent_platform | No | Platform or runtime the agent is running on. |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations indicate write operation (readOnlyHint=false), but description adds specifics: write-only, single insert, side-effecting, auth requirements (Bearer token, any plan), UK/EU residency, and that the user is contacted by the team. This adds value beyond annotations, though some redundancy.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Description is moderately long but well-structured with clear sections (Authenticated, WHEN TO CALL, WHEN NOT TO CALL, BEHAVIOR). Every sentence provides value, though slightly verbose. Could be trimmed slightly.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the complexity (9 parameters, 1 required, enums), the description covers purpose, usage, exclusions, behavior, and response (ticket id + audience). Output schema exists, so return values are not needed. Complete for an agent to use correctly.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, so parameters are fully documented there. The description lists topic categories (design-partner, ecosystem, training, advisory) which matches the enum, but doesn't add new meaning. Baseline 3 is appropriate.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it creates a partnership handoff record for design-partner, ecosystem, training, or advisory conversations needing human review. It distinguishes from sibling tools (handoffs.operator, handoffs.agency) by specifying its unique use cases.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicit WHEN TO CALL and WHEN NOT TO CALL sections with named alternatives (handoffs.operator, handoffs.agency). Also instructs to confirm with user before firing, providing clear guidance for correct usage.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
me.add_evidenceAdd Evidence NoteAInspect
Authenticated — append a free-text evidence note to a specific stage in the caller's active course. Notes record concrete implementation observations, decisions, or artefacts that demonstrate progress through a Blueprint principle (e.g. how a delegation boundary was implemented, what approval flow was chosen and why). Persisted as UserStageEvidence rows scoped to (user_id, course_slug, stage_slug). WHEN TO CALL: AFTER the user has articulated something concrete they have built, observed, or decided — not to capture intent or speculation. Pair with me.coaching_context to close evidence gaps. WHEN NOT TO CALL: to log every conversation turn; to record planning, ideas, or todos; on behalf of another user; without the user's awareness (they should know their progress is being recorded). BEHAVIOR: write-only, single insert. Auth: Bearer (Firebase ID token, any plan). UK/EU residency. Notes are visible only to the owning user and are surfaced on me.learning_path / me.coaching_context. Confirms the stage_slug + course_slug pair in the response so the user can see which stage was credited.
| Name | Required | Description | Default |
|---|---|---|---|
| note | Yes | Evidence note to append to the delegation boundary notes for this stage. | |
| stage_id | Yes | ID of the stage to append the evidence note to. | |
| course_slug | Yes | Slug of the course the stage belongs to (e.g. 'agentic-fundamentals'). |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description discloses write-only behavior, single insert, authentication requirements (Bearer token, any plan), residency (UK/EU), and data visibility (only owning user, surfaced on other tools). It also mentions response confirmation. This adds value beyond annotations, which only indicate non-readOnly, non-destructive nature.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single paragraph but well-organized with clear sections. While it is fairly detailed, every sentence serves a purpose. Minor conciseness improvements could be made, but it remains effective.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a tool with 3 simple required parameters and an output schema (not shown), the description covers purpose, usage, behavior, auth, and response details. The agent has all necessary context to use the tool correctly without ambiguity.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so baseline is 3. The description adds context about what notes should contain (concrete implementation observations) and that the response confirms the stage_slug and course_slug. This provides meaningful guidance beyond the schema's parameter descriptions.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the action ('append a free-text evidence note'), the resource ('to a specific stage in the caller's active course'), and the purpose (recording concrete observations). It also distinguishes itself from siblings like me.coaching_context by noting it pairs with that tool.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicit WHEN TO CALL and WHEN NOT TO CALL sections provide concrete usage conditions (e.g., after concrete actions, not for speculation or planning, not on behalf of others). It also references a related tool (me.coaching_context), giving clear alternatives.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
me.coaching_contextGet My Coaching ContextARead-onlyIdempotentInspect
Authenticated — returns stages in the caller's active course where recorded evidence is thin relative to the stage's principle requirements. Each thin stage carries the missing principle slugs + a short diagnostic so the caller can suggest the user record concrete evidence. WHEN TO CALL: when the user asks 'what should I work on next' or 'what's weak in my Blueprint progress'; before suggesting which guide/example to consult. Pair with me.add_evidence to close gaps. WHEN NOT TO CALL: to lecture the user on principles they have already satisfied; on every conversation turn (state changes only when evidence is added). BEHAVIOR: read-only, idempotent. Auth: Bearer (any plan). Returns thin_stages list with stage slug, course slug, missing principles, evidence_count, and a coaching_note.
| Name | Required | Description | Default |
|---|---|---|---|
No parameters | |||
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true, idempotentHint=true, destructiveHint=false. Description adds authentication requirements, return structure (thin_stages list with fields), and idempotency behavior. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Single paragraph front-loaded with main purpose, followed by clear WHEN TO CALL/WHEN NOT TO CALL sections. Every sentence adds value without redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given zero parameters and comprehensive annotations, description covers all necessary context: purpose, usage conditions, behavior, auth, output structure. Output schema exists but description provides a useful summary.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
No parameters in input schema, so baseline is 4. Description does not need to add parameter info but explains the output, which is beneficial.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Describes a specific action (returns stages with thin evidence) on a specific resource (caller's active course). Distinguishes from siblings by mentioning pairing with me.add_evidence and giving distinct use cases.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly states when to call ('what should I work on next' or 'what's weak in my Blueprint progress') and when not to call (lecturing on satisfied principles, on every turn). Suggests alternative tool (me.add_evidence) for closing gaps.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
me.learning_pathGet My Learning PathARead-onlyIdempotentInspect
Authenticated — returns the caller's Blueprint learning-path state: current course slug, stage progress, certification status (Foundation, Practitioner, Capstone), Capstone track eligibility flags, and the next recommended stage. WHEN TO CALL: the user asks 'where am I', 'what's next', or 'am I Capstone-eligible'; before suggesting next-step coaching content. WHEN NOT TO CALL: as a heartbeat (state changes only when the user completes a stage); to read another user's progress. BEHAVIOR: read-only, idempotent. Auth: Bearer (any plan, including basic). Returns user_email, course_slug, stages list with completion timestamps, certification block, and a next_stage hint.
| Name | Required | Description | Default |
|---|---|---|---|
No parameters | |||
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Beyond annotations (readOnlyHint, idempotentHint, destructiveHint), description adds auth requirements (Bearer token, any plan), stresses read-only and idempotent nature, and lists return fields. No contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Single paragraph, front-loaded with key info ('Authenticated — returns...'), no wasted words, logically organized: purpose, usage, behavior, auth, return fields.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With zero parameters, full annotations, and output schema present, the description still covers purpose, usage guidelines, behavioral traits, auth details, and return values, making the tool fully understandable.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
No parameters exist (0), so adding param info is unnecessary. The description compensates by detailing output fields not fully captured in output schema, adding value beyond the empty schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description specifies the tool returns the caller's learning-path state with exact fields (course slug, stage progress, certification status, etc.), clearly distinguishing from sibling tools like me.coaching_context or me.validation_history by focusing on personal progress.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly lists when to call (user asks 'where am I', 'what's next', 'am I Capstone-eligible') and when not to call (as a heartbeat, for another user's progress), providing clear decision criteria.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
me.validation_historyMy Architect Agent Validation HistoryARead-onlyIdempotentInspect
Pro/Teams — return the authenticated user's architect.validate run history with the Blueprint Readiness Score (0-100), letter grade (A-F), and tier (draft, emerging, production_ready). Three lookup modes: (1) run_id=<id> returns a SINGLE run with the full persisted result_json — use this to RECOVER a result when your MCP client tool-call timed out before architect.validate returned. The run completes server-side and persists; the run_id is surfaced in the first progress notification of every architect.validate call so you have the recovery handle even when your client gives up early. (2) repository=<name> returns the full per-run trend for that repository plus a regression diff between the latest two runs. (3) No arguments returns one summary per repository the user has validated, sorted by most recent. Use modes (2) or (3) BEFORE calling architect.validate again on the same repository — they tell you which principles regressed since the last run, so you can focus the new review on what is actually changing. Auth: Bearer . Pro or Teams plan required.
| Name | Required | Description | Default |
|---|---|---|---|
| limit | No | Maximum number of runs to return when scoped to a single repository. Capped at 50. Ignored when `run_id` is provided. | |
| run_id | No | Single-run lookup by run_id (UUID). Returns the persisted result_json verbatim — the same payload architect.validate would have returned if your client hadn't timed out. Use this to recover a result when your MCP tool-call closed before the server returned. Per-run authorisation: returns only runs owned by the calling user. | |
| repository | No | Repository name or path to scope the history to. Pass the same value you would pass to architect.validate. Omit to get one summary per repository. Mutually exclusive with `run_id` — if both are passed, `run_id` wins. |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint, idempotentHint, destructiveHint. The description adds crucial behavioral context: the run completes server-side and persists, run_id is in first progress notification for recovery, per-run authorization, and plan requirements. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Well-structured with numbered modes and clear use cases. Every sentence adds value, but slightly verbose with embedded explanations. Front-loaded with main purpose. Could trim some redundant phrasing without losing clarity.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (three modes, recovery mechanism) and rich annotations/output schema, the description is fully complete. It covers auth, plan requirement, return values for each mode, and use case guidance. No gaps remain.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with descriptions for all three parameters. The description adds significant meaning beyond the schema: run_id recovery use case, repository regression diff, and no-args summary. This compensates fully and enhances understanding.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool returns the authenticated user's architect.validate run history with score, grade, and tier. It defines three distinct lookup modes, distinguishing it from siblings like architect.validate which performs the validation. The specific resources (run_id, repository) and actions (recover, trend, summary) leave no ambiguity.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly guides when to use each mode: run_id for recovering timed-out results, repository for per-repo trend before re-validating, and no arguments for one summary per repo. Advises using modes 2 or 3 before calling architect.validate again to see regressions. Provides clear context and exclusions.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
principles.getGet PrincipleARead-onlyIdempotentInspect
Get one Blueprint principle by stable slug. Returns id, title, cluster, definition, rationale, risk-if-violated, implementation heuristics, and linked example slugs (which examples.get can hydrate). Use this when you already have the exact slug from principles.list or principles.search; prefer principles.search when the user describes a topic or failure mode in natural language; prefer principles.list when you need every principle or every principle within a cluster. Returns error_payload on unknown slug.
| Name | Required | Description | Default |
|---|---|---|---|
| slug | Yes | Stable slug of the principle (e.g. 'establish-trust-through-inspectability'). |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnlyHint, idempotentHint, destructiveHint. Description adds that it returns error_payload on unknown slug, providing additional behavioral context.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences: first states action and return fields, second provides usage guidance and error behavior. No redundancy or fluff.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given output schema exists, description covers purpose, return fields, usage, and error case. Complete for a single-parameter read tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100%, so description adds limited value beyond schema. It provides an example slug and mentions stability, but schema already describes the slug. Baseline 3.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Clearly states it gets one blueprint principle by stable slug, lists returned fields, and distinguishes from sibling tools by specifying when to use each (principles.search for natural language, principles.list for all).
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly states when to use this tool (exact slug from principles.list or principles.search) and when to prefer alternatives (principles.search for topics, principles.list for all).
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
principles.listList PrinciplesARead-onlyIdempotentInspect
List all 10 Blueprint principles with stable slugs, titles, and clusters. Use this when you need the full inventory or want every principle in one cluster (pass cluster slug to filter). Prefer principles.search when the user describes a topic, failure mode, or keyword in natural language. Prefer principles.get when you already know the exact slug and need full detail.
| Name | Required | Description | Default |
|---|---|---|---|
| cluster | No | Cluster slug to filter by (e.g. 'delegation', 'visibility', 'trust', 'orchestration'). Omit to return all principles. |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint, idempotentHint, and destructiveHint. Description adds that it returns stable slugs, titles, and clusters, and supports filtering, which is useful context beyond the annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Three concise sentences with no unnecessary words. The core action is front-loaded, and every sentence adds value.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the simple task (list with optional filter) and the presence of an output schema, the description fully covers what is returned and how to filter. Annotations cover safety, so no gaps remain.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The description explains the single optional parameter 'cluster' with examples ('delegation', 'visibility', etc.) and clarifies that omitting returns all. Since schema coverage is 100%, the description adds meaningful examples and context.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it lists all 10 Blueprint principles with stable slugs, titles, and clusters. It differentiates from siblings by specifying when to use this tool versus principles.search and principles.get.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly says when to use this tool: for full inventory or filtering by cluster. Explicitly advises against using it when the user describes a topic (use search) or when the exact slug is known (use get).
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
principles.searchSearch PrinciplesARead-onlyIdempotentInspect
Search Blueprint principles by free-text query and return the closest matches ranked by relevance. Use this to find principles related to a specific design challenge, failure mode, or keyword (e.g. 'reversibility', 'approval flow', 'delegation boundary'). Returns principle title, cluster, definition, rationale, and implementation heuristics. Prefer this over principles.list when you have a specific topic in mind rather than wanting all principles.
| Name | Required | Description | Default |
|---|---|---|---|
| limit | No | Maximum number of results to return. Capped at server maximum. | |
| query | Yes | Free-text search query matched against principle title, definition, rationale, and cluster. |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnlyHint, idempotentHint, and non-destructive nature. The description adds value by listing returned fields (title, cluster, definition, rationale, implementation heuristics). No contradictions present.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences that front-load the purpose, provide usage guidance, and list output fields. Every sentence is necessary and no redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity, full annotation coverage, and presence of an output schema, the description adequately covers what an agent needs: purpose, usage context, and return content. It is complete for its complexity.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 100%, but the description adds context that the query is matched against title, definition, rationale, and cluster, and that limit is capped at server maximum. This provides additional meaning beyond the schema alone.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description uses the specific verb 'search' and resource 'Blueprint principles', clearly stating it returns closest matches ranked by relevance with examples like 'reversibility' or 'approval flow'. It distinguishes itself from 'principles.list' by specifying when to use each.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly states 'Prefer this over principles.list when you have a specific topic in mind rather than wanting all principles', providing clear when-to-use and when-not-to-use guidance with a named alternative.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
signals.feedbackSubmit FeedbackAInspect
Public — records explicit free-text user feedback about the Blueprint, this tool surface, or a specific principle/example. Captures category (bug, doctrine_critique, missing_example, ergonomics, other), free-text body, and optional contact_email when permission_to_follow_up is true. WHEN TO CALL: ONLY when the user explicitly says they want to give feedback (e.g. 'can you log this as feedback', 'file this critique', 'send a bug report'). Use signals.report instead for value-moment metrics (rating validate's output 1-5). WHEN NOT TO CALL: proactively, silently, or to substitute for signals.report. Never harvest contact info without explicit permission_to_follow_up=true. BEHAVIOR: write-only, no auth required (open to all callers), single insert into UserFeedback. UK/EU residency. contact_email is stored ONLY when permission_to_follow_up=true, and that fact is confirmed back in the response so the user can see the privacy boundary.
| Name | Required | Description | Default |
|---|---|---|---|
| surface | No | Which Blueprint surface the feedback is about. Use 'mcp' if the session was via Claude Code or another MCP client. Use 'principles', 'examples', 'guides', 'coaching', or 'validation' based on what the user interacted with. | |
| task_type | No | What the user was doing when they decided to give feedback. Use plain English — e.g. 'code-review', 'architecture-design', 'agent-setup', 'onboarding', 'validation'. Infer from context. | |
| what_helped | No | Ask the user: 'What was most helpful?' Record their answer verbatim or paraphrased in plain English. Max 1000 chars. No code snippets, no proprietary content. | |
| what_missing | No | Ask the user: 'What was missing or could be improved?' Record their answer verbatim or paraphrased. Max 1000 chars. | |
| contact_email | No | Only ask for this if the user explicitly says they want a follow-up response. Never prompt for email unprompted. Only stored when permission_to_follow_up=true. | |
| rating_clarity | No | Ask the user: 'How clear was the Blueprint guidance? Rate 1–5.' 1 = very unclear, 5 = very clear. Only set if the user gives an explicit number. | |
| would_use_again | No | Ask the user: 'Would you use the Blueprint again for a similar task?' Set true/false based on their answer. Only set if they answer explicitly. | |
| rating_usefulness | No | Ask the user: 'How useful was the Blueprint for this task? Rate 1–5.' 1 = not useful, 5 = very useful. Only set if the user gives an explicit number. | |
| permission_to_follow_up | No | Set to true only if the user explicitly said they want a follow-up. Must be confirmed before storing contact_email. |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
The description discloses that the tool is write-only, requires no auth, performs a single insert into UserFeedback, and follows UK/EU residency. It also explains the privacy boundary around contact_email. The annotations do not contradict any of this, and the description adds valuable behavioral context beyond the annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single well-structured paragraph that front-loads the core purpose, followed by usage guidance and behavior. While it is clear and comprehensive, it is slightly verbose and could be tightened without losing clarity.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the complexity of 9 optional parameters and the presence of an output schema, the description covers the tool's purpose, usage context, privacy implications, and data location thoroughly. It leaves little ambiguity for an AI agent to misuse the tool.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The description adds significant usage guidance for each parameter, such as when to set 'surface' to 'mcp', how to infer 'task_type', and when to ask for 'contact_email'. This enhances the schema descriptions, which already cover 100% of the parameters.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it is for recording explicit free-text user feedback about the Blueprint, tool surface, or specific principle/example. It specifies categories and optional fields, and distinguishes from the sibling tool 'signals.report'.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description includes explicit WHEN TO CALL and WHEN NOT TO CALL sections, stating to call only when the user explicitly requests feedback and not to use it proactively or as a substitute for signals.report. It also warns against harvesting contact info without permission.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
signals.reportReport Value EventAInspect
Pro/Teams — records a value moment (review_confidence, runtime_risk_found, regression_caught, recommendation_taken) after a successful architect.validate or design session. Each event captures event_type, surface_used (mcp/web/cli), perceived_value (1-5), and an optional brief_context — structured fields only, NO prompts or code stored. WHEN TO CALL: after architect.validate returns a clearly useful result AND the user has acknowledged the value (or you ask them "would you rate this 1-5?"). Validate's response carries an explicit next_step instruction telling the agent to OFFER this call — surface that offer to the user. WHEN NOT TO CALL: silently or without the user's awareness; on every validate (only after a clear value moment); to capture intent or speculative value. If the user declines, do not retry within the same session. BEHAVIOR: write-only, single insert into ValueEvent. Auth: Bearer , Pro or Teams plan required. UK/EU residency. Do NOT include proprietary code, prompt content, or PII in brief_context — it surfaces in admin AI-visibility dashboards. Expect a 1-line acknowledgment in the response; the structured feedback is then aggregated server-side.
| Name | Required | Description | Default |
|---|---|---|---|
| team_size | No | If the user mentions their team size during the session, record it here. Do not ask for it explicitly — only capture if volunteered. | |
| event_type | Yes | Pick the type that best matches what just happened: 'review_confidence' — architect.validate returned aligned; 'runtime_risk_found' — architect.validate found violations; 'workflow_clarity' — principles/examples clarified a design decision; 'agent_setup_success' — user successfully wired up an agent or MCP tool; 'onboarding_helped' — user understood how to start using the Blueprint; 'research_time_saved' — user found relevant doctrine faster than expected; 'team_alignment' — Blueprint helped align a team on agentic design; 'other' — use only if none of the above fit. | |
| surface_used | No | Where the value was experienced. Use 'mcp' when called from Claude Code, Cursor, Windsurf, or any MCP client. Use 'principles' if the user was browsing or searching principles. Use 'examples' if the user was reading implementation examples. Use 'for-agents' if the user came via the /for-agents page. Use 'learn' or 'certification' for course-related sessions. | |
| brief_context | No | 1–2 plain-English sentences summarising what was helpful. Example: 'Validation identified a missing approval gate before email send.' No code snippets, no proprietary content, no user PII. Max 500 chars. | |
| workflow_stage | No | Infer from what the user was doing: 'exploring' — reading doctrine, browsing principles; 'designing' — planning architecture or agent flows; 'implementing' — writing or refactoring code; 'reviewing' — running architect.validate on existing code; 'shipping' — preparing for production or deployment. | |
| perceived_value | No | Ask the user: 'On a scale of 1–5, how valuable was this session?' Map their answer directly: 1=low, 5=high. Do not guess — only set this if the user gave an explicit score. | |
| would_recommend | No | Ask the user: 'Would you recommend the Blueprint to a colleague?' Set true/false based on their answer. Only set if asked — do not assume. |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations show readOnlyHint=false, etc. Description adds write-only, single insert, auth Bearer token, Pro/Teams plan, UK/EU residency, and constraints on data (no PII, code). This goes well beyond annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Description is somewhat long but well-structured with sections like WHEN TO CALL and BEHAVIOR. Every sentence adds value; no excessive fluff.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (7 parameters, output schema exists), the description covers usage, behavior, auth, and constraints comprehensively. Output schema handles response details; description adds sufficient context.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with parameter descriptions. Description adds extra guidance on usage (e.g., team_size only if volunteered, perceived_value must be asked, brief_context no code). This adds meaningful context beyond schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool records a 'value moment' after a successful validate or design session, with specific event types listed. It distinguishes from siblings like signals.feedback by focusing on value capture after specific actions.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly states when to call (after architect.validate returns useful result and user acknowledges value), when not to call (silently, on every validate, for intent/speculative value), and provides specific scenarios. Also mentions not to retry if declined.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
team.summarizeSummarize Team UsageARead-onlyIdempotentInspect
Pro/Teams — summarises the caller's tool-usage patterns and value signals over a configurable window (default 30 days). Returns tool_call_counts, top principles cited in validate runs, value_event_counts by event_type, and an aggregate readiness trend. WHEN TO CALL: the user asks 'how is the Blueprint helping me/my team', 'what should I explore next', or 'show me my Blueprint usage'. WHEN NOT TO CALL: proactively or on every conversation turn (the summary is an explicit retrospective, not telemetry); to compare users (returns only the caller's own data). BEHAVIOR: read-only, idempotent over the same window. Aggregates from AIToolCallLog + ValueEvent + AIValidationRunLog. Pass private_session=true to bypass server-side logging for this summary call (the underlying historical data still exists; only this read is untracked). Auth: Bearer , Pro or Teams plan. UK/EU residency.
| Name | Required | Description | Default |
|---|---|---|---|
| days_back | No | Number of days of usage history to include in the summary. | |
| private_session | No | Set to true to skip logging this summary call. |
Output Schema
| Name | Required | Description |
|---|---|---|
No output parameters | ||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already provide readOnlyHint=true, idempotentHint=true, destructiveHint=false. The description adds significant behavioral context: read-only and idempotent over same window, data sources (AIToolCallLog etc.), private_session behavior (bypasses logging but underlying data exists), auth requirements, plan restrictions, and residency. No contradictions with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is well-structured with headers (WHEN TO CALL, WHEN NOT TO CALL, BEHAVIOR) and front-loaded with purpose. While somewhat long, each sentence adds value and the structure aids readability. Could be slightly trimmed but remains efficient.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's complexity (2 params, output schema present), the description covers all necessary aspects: purpose, usage context, behavior, parameter details, data sources, output structure, auth, and residency. No gaps detected.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with clear descriptions for both parameters. The tool description adds nuance for private_session, explaining that it only affects logging for this read. This adds value beyond the schema's basic description.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool's purpose: 'summarises the caller's tool-usage patterns and value signals over a configurable window' and lists specific return fields (tool_call_counts, etc.). It distinguishes from siblings by noting it only returns the caller's own data and provides explicit exclusions (e.g., not for comparing users).
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description includes explicit WHEN TO CALL and WHEN NOT TO CALL sections, with example user queries. It clearly states when not to use (proactively, comparing users) but does not name specific alternative tools. This provides strong guidance without naming alternatives.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
Claim this connector by publishing a /.well-known/glama.json file on your server's domain with the following structure:
{
"$schema": "https://glama.ai/mcp/schemas/connector.json",
"maintainers": [{ "email": "your-email@example.com" }]
}The email address must match the email associated with your Glama account. Once published, Glama will automatically detect and verify the file within a few minutes.
Control your server's listing on Glama, including description and metadata
Access analytics and receive server usage reports
Get monitoring and health status updates for your server
Feature your server to boost visibility and reach more users
For users:
Full audit trail – every tool call is logged with inputs and outputs for compliance and debugging
Granular tool control – enable or disable individual tools per connector to limit what your AI agents can do
Centralized credential management – store and rotate API keys and OAuth tokens in one place
Change alerts – get notified when a connector changes its schema, adds or removes tools, or updates tool definitions, so nothing breaks silently
For server owners:
Proven adoption – public usage metrics on your listing show real-world traction and build trust with prospective users
Tool-level analytics – see which tools are being used most, helping you prioritize development and documentation
Direct user feedback – users can report issues and suggest improvements through the listing, giving you a channel you would not have otherwise
The connector status is unhealthy when Glama is unable to successfully connect to the server. This can happen for several reasons:
The server is experiencing an outage
The URL of the server is wrong
Credentials required to access the server are missing or invalid
If you are the owner of this MCP connector and would like to make modifications to the listing, including providing test credentials for accessing the server, please contact support@glama.ai.
Discussions
No comments yet. Be the first to start the discussion!