Open Census MCP Server

2026-02-12_tool_nonuse_treatment_path.md•3.47 KiB

# Lesson Learned: Tool Non-Use in Treatment Path **Date:** 2026-02-12 **Phase:** 4B Stage 1 **Severity:** Moderate — affects 5/39 queries (13%) ## Observation Five treatment-path responses contained zero tool calls despite tools being available and the system prompt instructing "FIRST call get_methodology_guidance." All five were ambiguity, mismatch, or underspecified queries where the model chose to request clarification rather than call tools. | Query ID | Category | Model Decision | |---|---|---| | GEO-004 | geographic_edge/trap | "Which Portland?" — asked for disambiguation | | SML-004 | small_area/tricky | "What variables?" — query underspecified | | AMB-001 | ambiguity/trap | "Which Springfield?" — asked for disambiguation | | AMB-002 | ambiguity/trap | "What area?" — no geography provided | | MIS-003 | product_mismatch/tricky | Redirected away from ACS for monthly data | ## Root Cause Stochastic model behavior. Sonnet evaluated the query, determined clarification was more appropriate than tool use, and responded directly without calling any tools. The system prompt said "FIRST call get_methodology_guidance" but did not say "ALWAYS, regardless of whether you need clarification." This is a cost of stochastic models with behavior variance as a feature, not a bug. ## Why This Matters 1. **Diagnostic ambiguity:** The harness records tool calls that happened, not tool calls that were considered and rejected. We cannot distinguish "model chose not to call tools" from "tools were unavailable" in the output data. 2. **Missed pragmatics-informed clarification:** A model that grounds first could provide *better* clarification. Example: "Which Springfield? Note that for Springfields under 65,000 population, only ACS 5-year estimates are available, which affects what we can tell you." This is strictly superior to generic clarification. 3. **Experimental confound:** These 5 queries will score similarly in control and treatment because the treatment didn't actually *use* the treatment. This dilutes the measured treatment effect. ## Corrective Actions ### Fix 1: Strengthen treatment system prompt Add explicit instruction to always ground first, even when clarification is needed: ``` IMPORTANT: ALWAYS call get_methodology_guidance first, even when you need to ask for clarification. Use the guidance to provide informed clarification that helps the user understand what data is available and what limitations apply to their request. ``` ### Fix 2: Add diagnostic logging to harness Add `tools_offered: bool` field to ResponseRecord. Set to True when tools were passed to the API regardless of whether the model used them. This distinguishes "tools available but unused" from "tools unavailable." ### Fix 3: Re-run battery With both fixes applied, re-run all 39 queries. Compare tool usage rates. Previous stale results archived. ## Principle Non-determinism in LLM behavior is a feature, not a bug. The correct response is: - **Account for it** in experimental design (statistical aggregation, not exact reproducibility) - **Log it** for diagnostic transparency (tools_offered vs tools_used) - **Strengthen contracts** where behavior matters (prompt engineering for tool compliance) - **Document it** as a known source of variance This is analogous to measurement error in survey methodology — you don't eliminate it, you quantify it and account for it in your analysis. > "Model stochasticity is a parameter to manage; system nondeterminism is a failure mode to control." — B. Webb

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

2026-02-12_tool_nonuse_treatment_path.md•3.47 KiB