mythos-reasoning-mcp
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@mythos-reasoning-mcpverify the evidence and contradictions in this analysis"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Mythos Reasoning MCP
A deterministic verification gate for Codex / Claude Code / any MCP client — independent of the model under check. Mythos does not make the model smarter. It checks the model's output against evidence, contradictions, calibration, and provenance — because a model's self-assessment is exactly what should not be trusted blindly.
TL;DR
Before: model -> answer -> ship ❓ (self-graded, no independent check)
After: model -> answer -> Mythos gate -> ready | revise | gather_evidence | formalize | block19 tools, single MCP server, zero LLM and zero network by default.
Tier 0 (default): fully local heuristics — claim verification, contradiction detection, calibration (Brier + ECE), semantic entropy, provenance binding.
Tier 1 (opt-in): optional
mythos_judgefor fresh-context semantic verdicts. Disabled unless you set an API key.Measurable:
npm run eval:gateevaluates a labeled dataset and writes precision / recall / F1 / claim-level recall toreports/gate_eval.*.
Related MCP server: ultrabrain-mcp
Quick start
git clone https://github.com/sorryorc/mythos-mcp.git
cd mythos-mcp
npm install
npm run build
npm testWire it into Codex (or any MCP client) by pointing at the built entry:
[mcp_servers.mythos_reasoning]
command = "node"
args = ["<absolute path to>/mythos-mcp/dist/index.js"]
cwd = "<absolute path to>/mythos-mcp"
enabled = true
startup_timeout_sec = 20
tool_timeout_sec = 120When your client exposes tools named like mcp__mythos_reasoning__mythos_optimize, the MCP is connected.
What it actually does
Mythos Reasoning MCP turns the Prelude → recurrent depth → Coda idea into a practical workflow:
plan -> recurrent depth (to convergence) -> verify + calibrate -> detect contradictions -> critique + probe -> gate -> verified codaIt runs no LLM and makes no network calls on the default path. It exposes auditable summaries — confidence scores, evidence coverage, calibration, contradiction reports, risk flags, and token-budget estimates — never private chain-of-thought.
Positioning: "Mythos" here is a deterministic verification layer, not a specific model brand. Modern reasoning models already supply depth and self-validation natively; this layer does not replace that — it independently checks the model's output, because a model's own (potentially grader-aware) self-assessment is exactly what should not be trusted blindly.
What is new in 0.5
0.6 builds on the v0.5 refocus with measurable gate evaluation, deterministic provenance binding, and claim routing. The public tool surface is now 19 tools: the quality pipeline, deeper verification helpers, provenance/route checks, the optional judge, and progress authenticity audit.
Refocused tool surface: removed context tiering, tool orchestration, memory, scaffolding, agent coordination, protocol batching, security lattice, and observability tuning tools.
Verification-first shape: retained budget, plan/pass/run, verify/critique/finalize, gate/optimize, compare, formalize, consistency, assumptions, probe, verify-chain, judge, provenance, route, and audit-progress.
Gate effectiveness is measurable:
npm run eval:gateevaluatesdatasets/labeled.sample.jsonand writes precision/recall/false-block/F1 plus claim-level recall toreports/gate_eval.*.Provenance and routing:
mythos_check_provenanceflags factual claims without verifiable pointers;mythos_routedecides which claims stay on Tier 0, need Tier 1 judge, or need more evidence.Docs and tests match the runtime: unit tests no longer cover deleted runtime primitives, and compliance audit scans the remaining 3 source files that contain public tool descriptions or judge prompts.
What was new in 0.4
0.4 retargets the layer for Mythos 5 / Fable 5-class models, whose native effort dial supplies the reasoning depth this MCP used to simulate. Value moves to what the model's own self-assessment cannot be trusted to do — independent verification, calibration, and gating.
Budget → effort mapping (
mythos_budget): each active mode maps onto the host model's test-time-compute effort dial —lite → low,standard → high,deep → xhigh,max → max;offskips the host-model call — with aneffort_rationale. Depth comes from the model's effort; the MCP governs when that spend is worth it. The recurrent-pass loop is kept only as a budget/convergence governor (stalled⇒ gather evidence, don't spend more effort).Multi-agent merge gate (
mythos_gatemerge mode): passcandidates(≥2 sub-agent outputs) and the gate runs cross-agent consistency + contradiction detection and gates the merge (merge_ready/reconcile/block_merge) — the deterministic check where async sub-agents (e.g. Git-shared worktrees) converge. A high-severity cross-agent contradiction forcesblock_merge.Response provenance awareness:
mythos_run,mythos_gate, andmythos_optimizeacceptresponse_source(fable5 | opus_fallback | unknown) plusstop_reason. Opus fallback paths are reported inprovenance; high-risk gates tighten the quality bar and require fallback-aware release notes.Reasoning-extraction compliance audit:
npm testrunsdist/tests/compliance_audit.js, which scans tool descriptions and prompt templates for unsafe requests to reveal, repeat, transcribe, or extract private reasoning.Tier 1 judge tool:
mythos_judgeexposes theJudgeClientadapter with disabled, mock, and Anthropic HTTP implementations. It supports multi-sample self-checks, semantic-entropy stability, optional cross-model second pass, and per-session cache keys. WithoutANTHROPIC_API_KEY, it returns a disabled result and Tier 0 remains the default zero-network path.Progress authenticity audit:
mythos_audit_progresschecks a model's progress report against real tool-result logs and flags unsupported or unverifiable progress claims.
What is new in 0.3
0.3 mines established techniques from the literature and folds them into the deterministic, no-LLM core:
Chain-of-Verification (
mythos_verify_chain): generate independent verification questions per claim / risk-tag / specific, to answer before finalizing (Dhuliawala et al., 2023).Semantic entropy in
mythos_consistency: cluster candidate answers by meaning and report normalized entropy + cluster stats — a stronger confabulation signal than lexical overlap (Farquhar et al., Nature 2024).Atomic-fact decomposition (FActScore, Min et al., 2023): split sentences into atomic claims so one false sub-clause is not hidden inside a true sentence; surfaced as
factual_precision.Proper calibration: the calibration report now adds a Brier score and Expected Calibration Error (ECE) over claims.
Configurable adaptive exit (Huginn, Geiping et al., 2025):
mythos_runacceptsexit_criterion(latent_diffdefault, orema) andexit_threshold, mapping their latent/KL exit idea onto the belief iteration.Front-loaded verification:
mythos_optimizeattaches the CoVe questions (verification_questions) and folds the top ones intocodex_instruction, so the guardrail shapes the answer before it is written — not only after.
What is new in 0.2
Version 0.1 had a placeholder "recurrent" loop that just added a fixed pass * 0.04 to a confidence number every iteration — monotonic by construction and carrying no information. 0.2 replaces the core and adds three analysis tools:
True recurrent depth. The reasoning state relaxes toward a fixed point:
belief_t = belief_{t-1} + lr * (target(state) - belief_{t-1}), wheretargetis derived from evidence strength, open unknowns, and detected contradictions. Iteration halts on convergence, oscillation, or stall.Adaptive compute. Harder inputs (weak evidence, more unknowns/contradictions) get a lower learning rate and a larger pass budget, so the engine spends more passes where it matters and fewer on easy inputs.
Contradiction detection. Negation, modal (
alwaysvssometimes), numeric/version, and claim-vs-evidence mismatches.Calibration. Each claim's stated confidence is compared against its evidence-derived confidence; systematic overconfidence is penalized.
New tools:
mythos_consistency(self-consistency / ensemble),mythos_assumptions(assumption mining),mythos_probe(edge-case / counterfactual probing).
Codex Status
This project is installed at:
<your local path>/mythos-mcpCodex config:
[mcp_servers.mythos_reasoning]
command = "node"
args = ["<your local path>/mythos-mcp/dist/index.js"]
cwd = "<your local path>/mythos-mcp"
enabled = true
startup_timeout_sec = 20
tool_timeout_sec = 120When Codex exposes tools named like mcp__mythos_reasoning__mythos_optimize, the MCP is connected.
Tools
The v0.6 public surface is 19 tools.
Quality pipeline:
mythos_budget: estimate overhead, recommendoff/lite/standard/deep/max, and map active modes to the host model effort dial (low/medium/high/xhigh/max).mythos_plan: classify task mode, risk level, risk factors, and evidence needs.mythos_pass: run one recurrent relaxation step (carries state across turns).mythos_verify: score claims by evidence support; report per-claim status, risk tags, and calibration.mythos_critique: find unsupported claims, contradictions, overconfidence, miscalibration, missing verification, and edge-case gaps.mythos_finalize: produce a concise answer with evidence summary, contradictions, and uncertainty notes.mythos_run: run the full workflow (plan -> depth -> verify -> contradictions -> critique -> finalize -> score).mythos_gate: decideready/revise/gather_evidence/formalize/block. Merge mode (candidates>= 2): gate a multi-agent merge withmerge_ready/reconcile/block_merge.mythos_formalize: turn an unresolved problem into variables, constraints, an objective, and next actions ranked by expected value.mythos_optimize: single-entry orchestration for Codex (budget, gate, formalize, assumptions, open edge cases, evidence requests, output contract).mythos_compare: compare baseline output against a Mythos-checked answer.
Analysis tools (deeper thinking):
mythos_consistency: self-consistency / ensemble over multiple candidate answers — agreement score, consensus claims, and cross-candidate contradictions.mythos_assumptions: mine the hidden assumptions a draft depends on and rank them byrisk_if_false / cost.mythos_probe: generate the edge cases / counterfactuals an answer must survive, flagging which are already addressed.mythos_verify_chain: Chain-of-Verification — independent verification questions for a draft's claims, prioritized by weakness and risk.mythos_judge: optional Tier 1 fresh-context semantic judge for claim/evidence entailment (supported | contradicted | unsupported). Defaults to disabled without an API key; useprovider: "mock"for local deterministic testing. Supportssample_count,entropy_threshold,cross_model,secondary_model, anduse_cache.mythos_check_provenance: deterministically checks whether claims are bound to verifiable pointers (file_line,test_id,url, orquote) and counts unbound specific claims.mythos_route: routes claims totier0_only,tier1_judge, orgather_evidenceand estimates judge cost before spending it.mythos_audit_progress: compare progress statements to real tool results and return grounded / false / unverifiable labels plus floating claims.
Resources: mythos://protocol (the workflow) and mythos://depth (the recurrent-depth math).
Recurrent depth and adaptive compute
belief_t = belief_{t-1} + lr * (target(state) - belief_{t-1})
target(state) = base(evidence_strength) - unknown_penalty - contradiction_load
lr = 0.85 - 0.45 * difficulty # harder inputs creep, using more passesHalting (halted_reason):
converged—|delta| < epsilonwith belief high enough.stalled— belief stopped moving but unknowns/contradictions remain (needs new evidence, not more passes).oscillation— the target keeps flipping across passes (conflicting evidence).max_passes— budget exhausted;recommended_additional_passesestimates what remains.
The depth result is auditable: each pass reports belief, target, delta, contradiction count, and open checks.
Verification, contradictions, and calibration
mythos_verify assigns each claim an evidence-derived confidence from a single documented weight table (test_result > file_line > docs > web > user_input > inference > unknown), adjusted for a cited source, absolute language, and bare specifics. Status bands:
supported: strong, sourced, typed evidence.needs_more_evidence: plausible but not proven (e.g. a cited inference).unsupported: no usable evidence.
Calibration compares stated vs evidence-derived confidence and reports overconfident_claims and a calibration_score. Note that negated absolutes (cannot guarantee, not always) are treated as hedging, not overconfidence.
Budget Modes
Mode | Model effort | Typical use | Passes | Estimated overhead |
| skip call | Casual chat, creative drafting, low-risk preference questions | 0 | 0 |
| low | Short factual answers, quick final check | 1 | about 180 tokens + 15% of checked text |
| high | Normal coding, debugging, current information, design tradeoffs | 2 | about 420 tokens + 35% of checked text |
| xhigh | Security, legal, finance, auth, destructive actions, costly decisions | 4 | about 900 tokens + 80% of checked text |
| max | Highest-capability sensitive tasks where the extra cost/latency is justified | 5 | about 1300 tokens + 110% of checked text |
For best value, start with mythos_budget. Use mythos_run / mythos_optimize when the recommendation is standard or deep, or when the answer contains important factual claims.
mythos_run, mythos_gate, and mythos_optimize can also receive:
{
"response_source": "fable5 | opus_fallback | unknown",
"stop_reason": "optional provider stop/refusal reason"
}Use opus_fallback when the host fell back from the requested Fable 5 / Mythos 5-class path. The gate does not blindly reject fallback output, but it reports a downgraded provenance posture and is stricter on high-risk tasks.
Optional Tier 1 Judge
The judge layer is exposed through mythos_judge and routed by mythos_route. The default mythos_gate path remains deterministic and zero-network; mythos_optimize reports which claims justify Tier 1 and the estimated spend.
Environment contract:
ANTHROPIC_API_KEY enables the Anthropic judge client
MYTHOS_JUDGE_PROVIDER=mock uses the local deterministic mock client
MYTHOS_JUDGE_PROVIDER=disabled forces Tier 1 off
MYTHOS_JUDGE_MODEL defaults to claude-fable-5
MYTHOS_JUDGE_EFFORT low | medium | high | xhigh | max, defaults to high
MYTHOS_JUDGE_MAX_OUTPUT_TOKENS defaults to 1200
MYTHOS_JUDGE_TIMEOUT_MS defaults to 30000
MYTHOS_JUDGE_ENDPOINT defaults to https://api.anthropic.com/v1/messagesThe Anthropic request uses the Messages API with model, max_tokens, messages, and output_config.effort; it sends the required anthropic-version: 2023-06-01 header. The judge prompt asks only for structured JSON verdicts and explicitly avoids private reasoning.
Example local call:
{
"task": "Judge whether claims are supported.",
"provider": "mock",
"sample_count": 2,
"use_cache": true,
"claims": [
{
"claim": "The unit tests pass.",
"evidence_type": "test_result",
"source": "npm test"
}
]
}For progress reports, pass either statements or progress_report with compact tool_results:
{
"statements": ["I ran npm test.", "I deployed to production."],
"tool_results": [
{ "id": "t1", "tool": "shell", "output": "npm test passed all checks" }
]
}Recommended Codex workflow
1. Use normal local tools first: read files, run tests, inspect docs.
2. Call mythos_optimize with task + draft + compact claims/evidence.
3. ready -> answer under output_contract.
4. gather_evidence-> collect the requested evidence first.
5. formalize -> follow the ranked next step.
6. revise -> rewrite under the output rules.
7. block -> do not answer; resolve the contradiction first.Use the lower-level tools (mythos_verify, mythos_critique, mythos_consistency, mythos_assumptions, mythos_probe, mythos_pass) when debugging the pipeline itself or when you want one specific check.
Maximum benefit with controlled cost
skip: casual/creative/no factual stakes
lite: short answers with 1-3 claims
standard: coding, debugging, docs, current info, architecture decisions
deep: security, credentials, payments, destructive operations, legal/medical/financial risk
max: highest-capability sensitive tasks after the deterministic gate justifies the spendSend compact drafts, claims, and evidence summaries — not entire files, logs, or transcripts. The MCP cannot retrieve sources or run tests; it makes evidence gaps, contradictions, and overconfidence visible before the final answer.
Replay evaluation
Record real Codex tasks after an answer attempt:
npm run replay:record -- replay/sample_record.json replay/replay.jsonl
npm run replay:analyze -- replay/replay.jsonl
npm run replay:report -- replay/replay.jsonl reports/replay.md
npm run replay:reset -- replay/replay.jsonlRecords include task, draft, claims, evidence, outcome, and notes. Use outcomes consistently: accepted | reworked | failed | blocked | unknown. The strongest tuning signal is ready decisions that later become failed or reworked — that means the gate is too permissive for that task class. (reset truncates the log even on filesystems that block file deletion.)
References
The deterministic heuristics are inspired by — not implementations of — these works:
Geiping et al., "Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach" (arXiv:2502.05171, 2025).
Dhuliawala et al., "Chain-of-Verification Reduces Hallucination in Large Language Models" (arXiv:2309.11495, 2023).
Farquhar, Kossen, Kuhn & Gal, "Detecting hallucinations in large language models using semantic entropy", Nature 630 (2024).
Min et al., "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation" (EMNLP 2023).
Wang et al., "Self-Consistency Improves Chain-of-Thought Reasoning in Language Models" (arXiv:2203.11171, 2022).
Test
npm testnpm test builds, runs the MCP smoke test, the v0.2 unit suite (recurrent convergence, contradiction detection, calibration, gate decisions, and the new tools), the comparison and dataset evaluations, and a replay round-trip. Reports are written to:
reports/comparison.md
reports/real_tasks.sample.eval.md
reports/gate_eval.md
reports/test_replay.mdThis server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/sorryorc/mythos-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server