Open Census MCP Server

06_success_criteria.md•2.33 KiB

# Chapter 6: Success Criteria [← Ch. 5](05_battery_summary.md) | [README](README.md) | [Ch. 7 →](07_risk_register.md) --- ## 6.1 Stage 1 (Response Generation) | Criterion | Threshold | Measurement | |---|---|---| | Battery completion | 39/39 queries with both conditions | Count in JSONL | | Treatment tool usage | ≥95% of treatment responses include ≥1 tool call | `tool_calls` field | | Control tool absence | 0% of control responses include tool calls | `tool_calls` field | | No crashes | 0 unhandled exceptions | Harness log | | Latency | Treatment median <30s, control median <10s | `total_latency_ms` | ## 6.2 Stage 2 (Judge Scoring) | Criterion | Threshold | Measurement | |---|---|---| | Judge completion | 117/117 judgments (39 × 3) | Count in scores JSONL | | Inter-rater agreement | Krippendorff's α ≥ 0.4 (moderate) per dimension | α ordinal calculation | | No dimension-level floor/ceiling | No dimension where all scores are 0 or all are 2 | Score distribution | | Position bias | A/B assignment effect < 0.5 CQS points | Mean difference by position | ## 6.3 Stage 3 (Treatment Effect) | Criterion | Threshold | Measurement | |---|---|---| | Edge case superiority | Treatment > Control on edge cases, p < 0.05 (Wilcoxon) | Paired signed-rank test | | Normal equivalence | \|Treatment - Control\| < 1 CQS point on normal queries (TOST) | Two one-sided tests | | D6 gate | Control D6=0 rate > Treatment D6=0 rate | Gate failure frequency | | Dimension-specific effect | ≥2 dimensions show significant (p<0.05) treatment advantage | Per-dimension Wilcoxon | ## 6.4 What Constitutes Failure These outcomes would invalidate or undermine the evaluation: - If treatment scores **lower** than control on normal queries → pragmatics cause harm - If inter-rater α < 0.2 → rubric is unreliable, cannot draw conclusions - If judge panel shows >1 CQS point vendor bias → scoring contaminated - If >20% of treatment responses fail to use tools → agent loop broken - If treatment and control are statistically indistinguishable on edge cases → pragmatics don't help where they should Each failure mode has a different implication. "Pragmatics cause harm" kills the thesis. "Rubric unreliable" means redesign the rubric and re-run. "Agent loop broken" means fix the harness and re-run. The distinction matters for how we respond to negative results.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

06_success_criteria.md•2.33 KiB