architect.validate_consensus
Run multiple parallel reviews of agentic code to obtain a consensus score with uncertainty estimates, removing anchor bias from prior runs. Provides median score, standard deviation, and per-principle stability.
Instructions
Pro/Teams — N-shot CONSENSUS doctrine review of agentic code. Runs N parallel architect.validate calls with private_session=True, then aggregates them to a per-principle MODE verdict + median severity + per-principle stability + score range/stdev. Returns one ConsensusValidationResponse with the headline median score, the honest variance band, and a representative full ValidationResponse (the child whose score is closest to the median). WHEN TO CALL: the user wants an HONEST first-pass score on agentic code, with the architect's variance surfaced. The single-shot architect.validate re-asserts the prior persisted run's verdict via baseline-anchor injection — same code can score 60/C anchored vs 98/A unanchored. Consensus mode is the unanchored honest read. WHEN NOT TO CALL: when you NEED the iteration delta against a prior run (regressions/improvements panel) — for that, call architect.validate which keeps baseline injection on. BEHAVIOR: N (default 3, max 5) parallel LLM calls run concurrently; wallclock ~80-120s for N=3 (max child latency, not sum). Cost = N × LLM bill. Each child runs with private_session=True so the doctrine prompt's prior-run baseline injection is suppressed (no anchor bias). One CONSOLIDATED UserValidationRun row is written carrying the consensus envelope; the N children themselves do NOT persist (private_session contract). AUTH: Bearer , Pro/Teams plan. Same paid-plan gate as architect.validate. INPUTS: same shape as architect.validate. n is the only extra arg (range 2..5). private_session is implicit (always true for children); the OUTER consolidated row IS persisted unless the tool itself is called inside another private context — but no such wrapper exists today. OUTPUT: response carries score_consensus_median (headline), score_stdev (honest uncertainty), score_range (min, max), mode_stability_min_pct (the cert-eligibility gate's input — ≥ 80% means the consensus is stable), per_principle (mode + distribution + severity median per principle), and representative_response (the closest-to-median child's full ValidationResponse so existing UI components render unchanged). RECOVERY: same heartbeat pattern as architect.validate — first notification at t=0 carries the run_id; recover via me.validation_history(run_id=) if your client times out. TYPED FAILURES: same as architect.validate (timed_out, rate_limited, dependency_unavailable). Plus consensus-specific: consensus_quorum_failed when fewer than 2 child runs succeeded (≥ 2 required to compute a meaningful median).
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| n | No | Number of parallel child runs. Default 3 (the variance signal is visible at N=3; cost = 3× LLM bill). Capped server-side by Settings.consensus_n_max (default 5). | |
| task | No | What the agent or workflow is trying to accomplish. | |
| files | No | List of file paths relevant to the implementation. | |
| goals | No | Specific safety or quality goals to evaluate against. | |
| language | No | Programming language of the code (e.g. 'python'). | |
| focus_area | No | Optional: narrow the review to a principle cluster or slug. | |
| repository | No | Repository name or path. Lets the consensus row group into the per-project history view alongside single-shot validate runs. | |
| example_limit | No | Max curated examples per child run. | |
| implementation_context | Yes | The artifact under review. SEND FULL FILE CONTENTS VERBATIM — same constraint as architect.validate. Truncation produces hallucinated findings on code that isn't there. |