Open Census MCP Server

VERIFICATION_SUMMARY.md•4.17 KiB

# Judge Pipeline Fixes - Verification Summary **Date:** 2026-02-13 **Task:** Full re-run preparation (temporal fix + Google truncation fix) ## Changes Implemented ### Fix 1: Temporal Grounding in Judge Prompt **File:** `src/eval/judge_prompts.py` **Change:** Added temporal grounding line at start of judge prompt: ``` Today's date is February 13, 2026. Score based on data availability as of this date. ``` **Rationale:** Addresses systematic D1 score reversal where all judges penalized treatment responses for citing 2020-2024 ACS 5-year data (released Jan 29, 2026). Minimal intervention - no enumeration of specific releases, domain-agnostic. ### Fix 2: Google max_output_tokens Increase **File:** `src/eval/judge_config.yaml` **Change:** Added `max_output_tokens: 8192` to Google judge config **File:** `src/eval/judge_pipeline.py` (line 387-389) **Change:** Updated config merging to respect per-vendor max_output_tokens **Rationale:** 92/234 Google responses (39.3%) truncated due to insufficient token budget ### Fix 3: Google Truncation Retry **File:** `src/eval/judge_pipeline.py` (call_google function) **Change:** Added JSON validation before returning response: ```python try: json.loads(content) except json.JSONDecodeError: raise ValueError(f"Truncated JSON response ({len(content)} chars)") ``` **Rationale:** Triggers retry on malformed JSON responses ### Fix 4: Analysis Deduplication **File:** `src/eval/judge_analysis.py` (load_judge_scores function) **Change:** Added deduplication logic to keep latest record per unique key: ```python key = (r['query_id'], r['judge_vendor'], r['presentation_order'], r['pass_number']) ``` **Rationale:** Handles re-runs where new records append to existing JSONL ### Fix 5: Checkpoint Clearing **Action:** Cleared all checkpoint entries (all vendors) **Script:** `clear_all_checkpoints.py` **Result:** Checkpoint now at 0 entries, forcing full re-run ### Fix 6: Decision Log **File:** `docs/verification/phase4b_decision_log_DEC022.md` **Content:** Documents temporal evaluation confound, minimal intervention rationale, and finding that this generalizes to all time-sensitive domains ## Verification Results ### ✓ Check 1: Temporal grounding in judge prompt ``` PASS: Judge prompt starts with temporal grounding First line: Today's date is February 13, 2026. Score based on data availability as of this date. ``` ### ✓ Check 2: Google max_output_tokens = 8192 ``` PASS: Google max_output_tokens = 8192 ``` ### ✓ Check 3: call_google has JSON truncation check ``` PASS: call_google has JSON truncation check ``` ### ✓ Check 4: Analysis script has deduplication ``` PASS: Analysis script has deduplication logic - Creates seen dictionary - Uses correct key tuple - Reports deduplication stats ``` ### ✓ Check 5: Checkpoint cleared (0 entries) ``` PASS: Checkpoint cleared (0 entries) Original: 468 entries (Google was previously cleared) Final: 0 entries ``` ### ✓ Check 6: Decision log created ``` PASS: Decision log created Path: docs/verification/phase4b_decision_log_DEC022.md - Contains correct title - Contains date - Contains intervention text ``` ### ✓ Check 7: pytest test suite ``` PASS: 47/47 tests passed in 10.03s ``` ## Ready for Pipeline Execution All fixes implemented and verified. The pipeline is ready to run: - All checkpoint entries cleared → full re-run of 702 judge calls - Temporal grounding added → addresses D1 score reversal - Google token limit increased → prevents truncation - Deduplication in place → handles JSONL append model - All tests passing → no regressions **Next step:** User executes pipeline manually ```bash /opt/anaconda3/envs/census-mcp/bin/python -m eval.judge_pipeline src/eval/judge_config.yaml ``` ## Expected Behavior 1. Pipeline reads 39 query pairs from Stage 1 results 2. Skips 0 tasks (checkpoint empty) 3. Runs all 702 judge calls (3 vendors × 6 passes × 39 queries) 4. Appends new records to existing JSONL 5. Analysis script auto-deduplicates on load ## Artifacts - `cleanup_google_checkpoint.py` - Script to clear Google entries only (used in earlier iteration) - `clear_all_checkpoints.py` - Script to clear all checkpoint entries (used for full re-run) - Both scripts preserved for reproducibility

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

VERIFICATION_SUMMARY.md•4.17 KiB