### 8.3 Stage 1: Response Generation Pipeline
| ID | Requirement | Priority |
|----|------------|----------|
| VR-020 | Response generation SHALL produce responses for three conditions per test query: control (data tools, no methodology), RAG (data tools, methodology via retrieved chunks), and pragmatics (data tools, methodology via curated MCP tool). All three conditions SHALL have equal access to `get_census_data` and `explore_variables`. The only experimental variable is the form of methodology support | Must |
| VR-021 | Response generation SHALL use a single caller model for all conditions within an evaluation round, controlled by `judge_config.yaml` | Must |
| VR-022 | Response generation SHALL record complete provenance: model string, system prompt (full text), tool call transcripts (including full unsanitized tool returns), pragmatics context IDs returned (pragmatics condition), retrieved chunk metadata (RAG condition), token counts, and latency | Must |
| VR-023 | All three conditions SHALL use the same agent loop with configurable `max_tool_rounds` (default: 20). If the loop exhausts without the model issuing a final response, the system SHALL perform forced synthesis and flag `tool_rounds_exhausted=True` | Must |
| VR-024 | Response generation SHALL output individual ResponseRecord objects in JSONL, one file per condition: `{condition}_responses_{timestamp}.jsonl`. Files SHALL be written to `results/v2_redo/stage1/` | Must |
| VR-025 | Tool filtering SHALL exclude `get_methodology_guidance` from the tool list passed to the Anthropic API for control and RAG conditions. The pragmatics condition SHALL receive the full tool list including `get_methodology_guidance` | Must |
| VR-026 | System prompts SHALL be minimal and equivalent across conditions. Control and RAG SHALL use an identical base prompt. RAG augments the base prompt with retrieved chunks only. The pragmatics prompt adds only the instruction to call `get_methodology_guidance` first. No condition's prompt SHALL contain quality coaching (e.g., "always provide margins of error") | Must |
| VR-027 | Response generation SHALL perform runtime contamination verification: an assertion SHALL confirm that `get_methodology_guidance` is absent from the tool set before every control and RAG query, and present before every pragmatics query. Assertion failure SHALL halt the run | Must |
| VR-028 | Response generation SHALL perform post-run contamination verification: scan all output files and report the count of `get_methodology_guidance` calls per condition. Any such call in control or RAG output SHALL be flagged as contaminated | Must |
| VR-029 | All three conditions SHALL be generated in a single harness run with a shared timestamp to eliminate temporal confounds (API behavior changes, model version drift). The `--condition all` flag SHALL execute control, RAG, and pragmatics sequentially within one MCP server session | Must |
| VR-030 | The agent loop SHALL sanitize `get_census_data` tool results before passing them to the model for control and RAG conditions: the `pragmatics` field SHALL be stripped from the result dict. The full unsanitized result SHALL be preserved in the `ToolCall` log record for fidelity verification. The pragmatics condition SHALL receive unsanitized tool results | Must |
**Rationale:** V1 confounded tool access with knowledge representation — control and RAG had no data tools while pragmatics had full tool access. 33 of 39 RAG responses directed users to data.census.gov because the model had no way to retrieve data. V2 equalizes tool access so the only variable is methodology support form: none (control), retrieved document chunks (RAG), or curated expert judgment via MCP tool (pragmatics). The contamination checks (VR-027, VR-028) are defense-in-depth against the class of confound that invalidated V1. VR-030 addresses a second contamination vector discovered during spot-checking: `get_census_data` bundles curated pragmatics content (context IDs, guidance text, thread edges) in every response via `retriever.get_guidance_by_parameters()`. Without sanitization, all three conditions receive curated expert judgment through the data tool response payload, defeating the experimental design. See `talks/fcsm_2026/2026-02-16_pragmatics_leakage.md`.
**Location:** `src/eval/agent_loop.py`, `src/eval/harness.py`, config in `src/eval/judge_config.yaml`.