mcp-confidence
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@mcp-confidenceevaluate confidence: 'The cat sat on the mat.'"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
A language model can tell you how surprised it was by every token it just produced — that's what logprobs=True returns. mcp-confidence reads those numbers and turns them into a routing decision your agent can act on:
HIGH → auto-accept the model's output.
MID → verify (hand off to a stronger model or a supervisor check).
LOW → ask a human.
It's local- and open-model-first (vLLM, Ollama, llama.cpp, TGI, or any OpenAI-compatible endpoint), the core has zero third-party runtime dependencies, and it ships a one-command MCP server so a cloud "director" (e.g. Claude) can delegate generation to a local worker and get a confidence band back with every answer.
Why
The naive approach — "average the token probabilities, accept above 0.95" — is a weak and uncalibrated signal for open-domain free text, and quietly wrong on most models. High-frequency function words and MoE routing pull the per-token mean toward zero, so exp(mean_logprob) sits at 0.05–0.20 even for perfectly fluent output. A fixed probability threshold tuned on short structured outputs will reject good answers and accept bad ones once you point it at real prose.
mcp-confidence takes four deliberate positions to make the signal usable:
Band in mean-logprob (nats) space, not probability. Thresholds live in log space where the score actually separates, instead of an
exp()-compressed[0, 1]that bunches everything near zero.A weakest-link term with a floor. The score mixes the mean with the worst per-token logprob (
min_weight=0.3by default), clamped to a floor (-10.0) so one ultra-rare token — a numeric ID, an unusual name — can't collapse the score of an otherwise solid answer.Per-model calibration tooling. Bundled risk-coverage analysis turns a log of real traffic into the thresholds your model needs, instead of you guessing again.
A 3-band decision, not a number. The output is a routing band you can branch on — accept, verify, or ask a human — with a conservative default: when logprobs are missing, the band is MID, never blind auto-accept.
The defaults are provisional guesses.
HIGH=-1.5,LOW=-3.5are starting points, not validated thresholds. Do not turn on auto-accept routing until you've recalibrated on your own model — see The honest part.
Related MCP server: Random Number MCP
Install
pip install mcp-confidenceThe core (gate, config, calibration, CLI, adapters) is pure stdlib — nothing else gets pulled in. Install extras only for what you use:
pip install "mcp-confidence[openai]" # the OpenAI SDK, for live calls
pip install "mcp-confidence[mcp]" # the MCP server (mcp + openai)
pip install "mcp-confidence[dev]" # pytest + ruffWith uv:
uv pip install mcp-confidence
uv pip install "mcp-confidence[mcp]"You don't need any extra to score logprobs you already have — only to fetch them from a live model or to run the server.
Quickstart
a) From a raw logprob list
If you already have a list of per-token logprobs, you don't need any extras or a network call:
from mcp_confidence import Gate, ConfidenceBand
gate = Gate() # default thresholds
result = gate.from_logprobs([-0.2, -0.5, -0.1])
print(result.band) # ConfidenceBand.HIGH
print(result.score) # -0.337 (mean/min mix, in nats — not a probability)
print(result.mean_logprob) # -0.267
if result.band is ConfidenceBand.LOW:
... # ask a human
elif result.band is ConfidenceBand.MID:
... # verify with a stronger model
else:
... # HIGH — acceptConfidenceResult is a frozen dataclass: score, band, mean_logprob, min_logprob, token_count, logprobs_available.
b) From an OpenAI chat completion
Call the API with logprobs=True, then hand the whole response to from_openai. It accepts both an SDK object and the raw JSON dict, and never raises on a malformed response — missing logprobs degrade to band MID.
from openai import OpenAI
from mcp_confidence import Gate, ConfidenceBand
client = OpenAI()
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Capital of Australia?"}],
logprobs=True,
top_logprobs=5,
)
result = Gate().from_openai(resp)
if result.band is ConfidenceBand.LOW:
escalate_to_human(resp)
elif result.band is ConfidenceBand.MID:
verify_with_stronger_model(resp)
else:
accept(resp)c) Local model (vLLM / Ollama / llama.cpp / TGI)
Any OpenAI-compatible server works — it's the same from_openai path, you just swap base_url. This is the intended sweet spot: cheap local generation, with a confidence signal so you know when to escalate.
from openai import OpenAI
from mcp_confidence import Gate
# vLLM: http://localhost:8000/v1 Ollama: http://localhost:11434/v1
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
resp = client.chat.completions.create(
model="Qwen/Qwen3-30B-A3B",
messages=[{"role": "user", "content": "Summarize this ticket."}],
logprobs=True,
top_logprobs=5,
)
result = Gate().from_openai(resp)
print(result.band.value, round(result.score, 3))Tune the gate per model by passing a GateConfig to Gate(...), or load one from the environment:
from mcp_confidence import Gate, GateConfig
gate = Gate(GateConfig(high_threshold=-1.2, low_threshold=-3.0, min_weight=0.4))
# or, from MCP_CONFIDENCE_* env vars:
gate = Gate(GateConfig.from_env())There's also a CLI for a quick sanity check — no extras required:
$ mcp-confidence score --logprobs="-0.2,-0.5,-0.1"
band high
score -0.3370
mean_logprob -0.2667
min_logprob -0.5000
token_count 3Add --config-from-env to apply MCP_CONFIDENCE_* thresholds instead of the defaults.
The honest part: calibrate before you trust it
Confidence thresholds are model-specific. A band that means "almost always right" on one model means "coin flip" on another. The shipped defaults are guesses; the only way to trust auto-accept routing is to calibrate on your own model's live output.
The workflow: log a JSONL row per generation carrying at least mean_logprob, score, band, and logprobs_available (an event field too, if you want to filter). When you can, add a human_label of "good" or "bad". Then point the calibrator at it:
mcp-confidence calibrate --audit confidence.jsonl --metric mean_logprobThat always prints two things. First, the distribution — where your traffic actually sits — plus the rate of calls where logprobs were unavailable (with a loud warning if it's over 20%, because the gate is blind on those):
worker_delegated events: 1240 | logprobs unavailable: 31 (3%)
-- mean_logprob distribution (n=1209) --
min -6.412 p10 -3.880 p25 -2.910
med -1.740 mean -1.902
p75 -0.980 p90 -0.520 max -0.090
band split (current thresholds): high=505 (42%) low=121 (10%) mid=583 (48%)Second, when your rows carry good/bad labels, a risk-coverage recommendation. It sweeps every observed value as a candidate (high, low) pair and picks the cell that minimizes "bad answers auto-accepted + good answers needlessly escalated," subject to your coverage constraints:
-- risk-coverage (metric=mean_logprob) --
RECOMMENDED: high_threshold=-1.1 low_threshold=-3.4
auto-accept coverage 58% (bad slipped through: 4%)
human-ask coverage 11% (good needlessly asked: 6%)
Set MCP_CONFIDENCE_HIGH_THRESHOLD / MCP_CONFIDENCE_LOW_THRESHOLD to these, then enable routing.Tune the constraints with --min-accept-coverage (default 0.70) and --max-ask-coverage (default 0.15); calibrate on --metric score instead of mean_logprob if you prefer the combined score; use --event-name '*' to accept any row regardless of its event field. Without labels the recommendation is skipped and you pick conservative bands by eye from the distribution.
The same functions are importable and pure (no I/O) if you'd rather calibrate in a notebook:
from mcp_confidence.calibrate import load_events, distribution, risk_coverage
rows, total, unavailable = load_events("confidence.jsonl", event_name=None)
print(distribution(rows, "mean_logprob"))
print(risk_coverage(rows, "mean_logprob", min_accept_cov=0.70, max_ask_cov=0.15))MCP server: the manager–worker pattern
A cloud director (Claude, say) is great at planning and owning the tool-call loop, but it has no logprobs of its own and is expensive for token-heavy grunt work. mcp-confidence ships an MCP server that lets the director delegate generation to a local worker over an OpenAI-compatible endpoint and get the text back with a confidence band attached — so it can decide to accept, verify, or ask a human.
The server runs the worker non-streaming with logprobs=True and top_logprobs=config.top_k, then computes the gate. Configure it entirely through the environment:
export MCP_CONFIDENCE_BASE_URL="http://localhost:8000/v1" # your vLLM/Ollama/TGI worker
export MCP_CONFIDENCE_API_KEY="not-needed"
export MCP_CONFIDENCE_MODEL="Qwen/Qwen3-30B-A3B"
# optional gate overrides (after calibration):
export MCP_CONFIDENCE_HIGH_THRESHOLD="-1.1"
export MCP_CONFIDENCE_LOW_THRESHOLD="-3.4"
mcp-confidence serve # stdio transportRegister it with Claude Code / Claude Desktop:
{
"mcpServers": {
"confidence": {
"command": "mcp-confidence",
"args": ["serve"],
"env": {
"MCP_CONFIDENCE_BASE_URL": "http://localhost:8000/v1",
"MCP_CONFIDENCE_API_KEY": "not-needed",
"MCP_CONFIDENCE_MODEL": "Qwen/Qwen3-30B-A3B"
}
}
}
}The server exposes one tool:
generate_with_confidence(prompt: str, source: str = "") -> dictGenerate text with the local worker and return it with a confidence band.
The returned payload is JSON-serializable and decision-ready:
{
"text": "...the worker's answer...",
"band": "high",
"score": -0.94,
"mean_logprob": -0.71,
"min_logprob": -1.48,
"token_count": 128,
"logprobs_available": true,
"should_verify": false,
"should_ask_human": false
}should_verify is true when the band is MID or logprobs were unavailable; should_ask_human is true when the band is LOW. A worker that doesn't emit logprobs degrades gracefully to MID with should_verify=true — the director is never told to blindly trust a blind signal.
The gate logic for this lives in a pure, dependency-free helper, mcp_confidence.mcp_server.build_confidence_payload(text, openai_response, config), so you can unit-test the whole decision with a dict fixture and no mcp/openai installed. import mcp_confidence.mcp_server never requires the extras — they're imported lazily inside run(), only when you actually serve.
How it compares
The closest existing package is llm-confidence (VATBox), which is good at what it does: deriving a confidence score per key in an OpenAI structured JSON output. mcp-confidence is aimed at a different problem — routing free-text generation, with calibration and a human-in-the-loop decision.
|
| |
Primary target | Free-text / open-domain generation | OpenAI structured JSON outputs |
Output | 3-band routing decision (accept / verify / ask-a-human) | Per-key confidence scores |
Scoring | Mean + weakest-link in logprob (nats) space, with a floor | Per-field probability aggregation |
Calibration tooling | Yes — distribution + labeled risk-coverage sweep | Not included |
Missing-logprobs handling | Conservative MID + | n/a |
Local / open models | First-class (vLLM, Ollama, llama.cpp, TGI) | OpenAI-oriented |
MCP server | Yes — manager–worker delegation | No |
Runtime dependencies | Zero for the core | — |
Different tools for different jobs: reach for llm-confidence when you're scoring fields of a JSON extraction; reach for mcp-confidence when you're deciding whether to trust, double-check, or escalate a generated answer.
Roadmap
Clearly future, not yet implemented:
Supervisor / verify MCP tool — a semantic groundedness check for the MID band (does the answer actually follow from the source?), so "verify" becomes an automated second opinion rather than just a flag.
Self-consistency sampling — use the
top_logprobsalready captured per token (currently untouched) to estimate agreement across alternatives.Verbalized confidence — blend the model's own stated confidence with the logprob signal.
Learned calibrator — fit a small model on labeled audit logs instead of a single threshold pair.
Contributing
Most contributions will be adapters — teaching the gate to read logprobs from one more inference stack — or calibration corrections. A wrong confidence signal is worse than none, so corrections (a band that's miscalibrated, an adapter that misreads a real response, a docstring that overstates what a logprob can tell you) are the most valuable PRs here. Keep the core stdlib-only and every claim backed by a test. See CONTRIBUTING.md.
uv venv && uv pip install -e ".[dev]"
ruff check . && ruff format --check .
pytest -qLicense
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/shaxzodbek-uzb/mcp-confidence'
If you have feedback or need assistance with the MCP directory API, please join our Discord server