xai-toolkit

xai-mcp
docs
decisions

002-deterministic-narratives.md•2.38 KiB

# ADR-002: Deterministic Narratives Over LLM-Generated Interpretation

## Status: Accepted
## Date: 2025-02-21

## Context
Our MCP tools return explainability results to an LLM (Sonnet 4.5 in Copilot).
The question is: who generates the English explanation — the LLM or our code?

SHAP values are precise numerical quantities. A SHAP value of +0.23 for
"water_cut" means exactly: this feature pushed the prediction 0.23 toward the
positive class, relative to the base rate. Allowing an LLM to interpret these
values introduces risk of hallucination, inconsistency, or loss of precision.

## Decision
All English narratives are generated **deterministically by Python code** in
`narrators.py`. The LLM's role is limited to:
1. Understanding the user's question and choosing the right tool
2. Presenting the pre-computed narrative conversationally

The LLM does NOT interpret, rephrase, or recompute the narrative content.

## Consequences

### Positive
- ✅ **Reproducible:** same data + same model = same English explanation, every time
- ✅ **Auditable:** narratives can be snapshot-tested and version-controlled
- ✅ **Trustworthy:** no risk of LLM hallucinating feature importance or direction
- ✅ **Testable:** `pytest` can assert exact narrative content
- ✅ **Regulatory-friendly:** consistent outputs support compliance requirements

### Negative
- ⚠️ Narratives may sound less "natural" than LLM-generated prose
- ⚠️ Template maintenance required when adding new explanation types
- ⚠️ Less flexible for unexpected question formats (LLM could adapt, templates can't)

### Mitigations
- Invest in well-crafted templates that read naturally
- The LLM still adds conversational polish when presenting (greeting, context, follow-up)
- Copilot instructions tell the LLM to present narratives verbatim (see copilot-instructions.md)

## Alternatives Considered
1. **LLM generates explanations from raw SHAP values** — Rejected: non-reproducible,
   risks hallucination, untestable output, fails audit requirements.
2. **Hybrid: Python generates data, LLM generates prose** — Rejected: still
   non-reproducible. Even small LLM rephrasing can change meaning of quantitative
   statements. Gains in naturalness don't justify loss of determinism.
3. **LLM post-processes with strict constraints** — Rejected: constraints are
   brittle; "don't change the numbers" is hard to enforce reliably across LLM versions.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/florenciakabas/xai-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

002-deterministic-narratives.md•2.38 KiB

# ADR-002: Deterministic Narratives Over LLM-Generated Interpretation

## Status: Accepted
## Date: 2025-02-21

## Context
Our MCP tools return explainability results to an LLM (Sonnet 4.5 in Copilot).
The question is: who generates the English explanation — the LLM or our code?

SHAP values are precise numerical quantities. A SHAP value of +0.23 for
"water_cut" means exactly: this feature pushed the prediction 0.23 toward the
positive class, relative to the base rate. Allowing an LLM to interpret these
values introduces risk of hallucination, inconsistency, or loss of precision.

## Decision
All English narratives are generated **deterministically by Python code** in
`narrators.py`. The LLM's role is limited to:
1. Understanding the user's question and choosing the right tool
2. Presenting the pre-computed narrative conversationally

The LLM does NOT interpret, rephrase, or recompute the narrative content.

## Consequences

### Positive
- ✅ **Reproducible:** same data + same model = same English explanation, every time
- ✅ **Auditable:** narratives can be snapshot-tested and version-controlled
- ✅ **Trustworthy:** no risk of LLM hallucinating feature importance or direction
- ✅ **Testable:** `pytest` can assert exact narrative content
- ✅ **Regulatory-friendly:** consistent outputs support compliance requirements

### Negative
- ⚠️ Narratives may sound less "natural" than LLM-generated prose
- ⚠️ Template maintenance required when adding new explanation types
- ⚠️ Less flexible for unexpected question formats (LLM could adapt, templates can't)

### Mitigations
- Invest in well-crafted templates that read naturally
- The LLM still adds conversational polish when presenting (greeting, context, follow-up)
- Copilot instructions tell the LLM to present narratives verbatim (see copilot-instructions.md)

## Alternatives Considered
1. **LLM generates explanations from raw SHAP values** — Rejected: non-reproducible,
   risks hallucination, untestable output, fails audit requirements.
2. **Hybrid: Python generates data, LLM generates prose** — Rejected: still
   non-reproducible. Even small LLM rephrasing can change meaning of quantitative
   statements. Gains in naturalness don't justify loss of determinism.
3. **LLM post-processes with strict constraints** — Rejected: constraints are
   brittle; "don't change the numbers" is hard to enforce reliably across LLM versions.