Open Census MCP Server

TASK_build_harness.md•6.92 KiB

# Claude Code Task: CQS Test Harness Implementation ## Context You are building the test harness for Phase 4B of the Census MCP Server evaluation. The smoke test (`src/eval/smoke_test_mcp.py`) already proves the MCP client layer works. Now build the full harness that generates paired control/treatment responses. ## Architecture Reference Read FIRST: `/Users/brock/Documents/GitHub/census-mcp-server/docs/verification/cqs_harness_architecture.md` Decision log: `/Users/brock/Documents/GitHub/census-mcp-server/docs/verification/phase4b_decision_log.md` Test battery: `/Users/brock/Documents/GitHub/census-mcp-server/docs/verification/cqs_test_battery.md` ## Working Environment - Python env: `/opt/anaconda3/envs/census-mcp/bin/python` - Project root: `/Users/brock/Documents/GitHub/census-mcp-server` - Source dir: `src/eval/` - .env file at project root has CENSUS_API_KEY and ANTHROPIC_API_KEY - MCP server module: `census_mcp.server` (launched via `-m census_mcp.server`) - Smoke test pattern in `src/eval/smoke_test_mcp.py` — reuse the MCP client approach ## Files to Create ### 1. `src/eval/__init__.py` (empty) ### 2. `src/eval/models.py` — Pydantic data models ```python """Pydantic models for CQS evaluation data structures.""" from pydantic import BaseModel from typing import Literal, Optional from datetime import datetime class ToolCall(BaseModel): """Record of a single tool call during agent loop.""" tool_name: str arguments: dict result: dict # parsed JSON from tool response latency_ms: float class ResponseRecord(BaseModel): """Complete record of one response (control or treatment).""" query_id: str condition: Literal["control", "treatment"] model: str system_prompt: str response_text: str tool_calls: list[ToolCall] = [] pragmatics_returned: list[str] = [] # context_ids extracted from tool results total_latency_ms: float input_tokens: int output_tokens: int timestamp: datetime class QueryPair(BaseModel): """Paired control + treatment for one query.""" query_id: str query_text: str category: str difficulty: str control: ResponseRecord treatment: ResponseRecord ``` ### 3. `src/eval/mcp_client.py` — MCP subprocess client Adapt the smoke test pattern. Key requirements: - `MCPClient` class with `start()`, `call_tool()`, `list_tools()`, `stop()` - Uses `mcp.ClientSession` + `mcp.client.stdio.stdio_client` - Env vars loaded from .env via dotenv - MCP server params match claude_desktop_config.json: - command: `/opt/anaconda3/envs/census-mcp/bin/python` - args: `["-m", "census_mcp.server"]` - env: PYTHONPATH, PACKS_DIR, CENSUS_API_KEY, LOG_LEVEL, PYTHONUNBUFFERED, PATH - Health check: list tools, confirm 3 expected tools present - NOTE: The `stdio_client` is an async context manager. The session must stay open for the lifetime of the harness run. Design accordingly (context manager or explicit lifecycle management). ### 4. `src/eval/agent_loop.py` — Claude API agent loop Key requirements: - Uses `anthropic` Python SDK (AsyncAnthropic) - Two methods: `run_control(query)` and `run_treatment(query, mcp_client)` - Model: `claude-sonnet-4-5-20250514` (pinned per DEC-4B-013) - max_tokens: 2048 - max_tool_rounds: 5 (safety limit) **Control path:** - System prompt: "You are a helpful assistant answering questions about U.S. Census data. Provide accurate, well-sourced information." - No tools parameter - Single API call, record response text + token usage + latency **Treatment path:** - System prompt (from architecture doc): ``` You are a statistical consultant helping users access and understand U.S. Census data. You have access to Census data tools. For every query: 1. FIRST call get_methodology_guidance with relevant topics to ground your response 2. Use get_census_data to retrieve actual data with margins of error 3. Use explore_variables if you need to find the right variable codes Always provide: - Specific table/variable codes and geography identifiers - Margins of error and reliability context - Appropriate caveats about fitness-for-use If the data is unavailable or unreliable for the stated purpose, say so and explain why. Recommend alternatives when possible. ``` - Tools parameter: built from `mcp_client.list_tools()` — convert MCP tool schemas to Anthropic tool format - Agent loop: 1. Send message 2. If response has `tool_use` content blocks: a. For each tool_use block, call `mcp_client.call_tool(name, input)` b. Build tool_result content blocks c. Append assistant response + tool_results to messages d. Send again 3. Repeat until response has no tool_use blocks OR max_tool_rounds exceeded 4. Collect final text from all text content blocks 5. Extract pragmatics context_ids from tool results (look for keys like "ACS-MOE-001" etc) 6. Return ResponseRecord with all metadata **Tool schema conversion:** MCP tools have `inputSchema` (JSON Schema). Anthropic tools need `input_schema`. The conversion is: `{"name": tool.name, "description": tool.description, "input_schema": tool.inputSchema}` ### 5. `src/eval/harness.py` — Main test runner Key requirements: - Loads test battery from `src/eval/battery/queries.yaml` - For each query: runs control, then treatment, writes QueryPair to JSONL - JSONL output to `results/cqs_responses_{timestamp}.jsonl` - Resume capability: on start, check output file for existing query_ids, skip them - CLI interface: `python -m eval.harness [--query-ids NORM-001 GEO-006] [--output path]` - On MCP or API failure for a single query: log error, skip, continue battery - Summary stats at end: queries completed, failed, total time ### 6. `src/eval/battery/queries.yaml` — Machine-readable test battery Convert the test battery markdown into YAML. Structure: ```yaml queries: - id: "NORM-001" text: "What is the total population of Harris County, Texas?" category: "normal" difficulty: "normal" pragmatics_exercised: [] cqs_dimensions_tested: ["D1", "D5", "D6"] - id: "NORM-002" ... ``` Extract ALL 39 queries from `docs/verification/cqs_test_battery.md`. Use the `query` field as `text`. Include all metadata fields. ## Key Constraints - Do NOT modify any existing files outside `src/eval/` - Do NOT modify server.py or any census_mcp code - Load API keys from .env only, never hardcode - All new code goes in `src/eval/` - Create `results/` directory if it doesn't exist (add .gitkeep) - Create `src/eval/battery/` directory for queries.yaml ## Testing After building, run a single-query test: ```bash cd /Users/brock/Documents/GitHub/census-mcp-server /opt/anaconda3/envs/census-mcp/bin/python -m eval.harness --query-ids NORM-001 ``` This should produce one QueryPair in the JSONL output with both control and treatment responses. ## Do NOT build yet (deferred to later tasks): - judge_pipeline.py (H.18-H.20) - analysis.py (H.24) - lib/stats.py (copy from harmonization project) These depend on the response data this harness generates.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

TASK_build_harness.md•6.92 KiB