Skip to main content
Glama

Mythos Reasoning MCP

npm version license mcp zero network no llm

A deterministic verification gate for Codex / Claude Code / any MCP client — independent of the model under check. Mythos does not make the model smarter. It checks the model's output against evidence, contradictions, calibration, and provenance — because a model's self-assessment is exactly what should not be trusted blindly.


TL;DR

Before:  model -> answer -> ship  ❓  (self-graded, no independent check)
After:   model -> answer -> Mythos gate -> ready | revise | gather_evidence | formalize | block
  • 19 tools, single MCP server, zero LLM and zero network by default.

  • Tier 0 (default): fully local heuristics — claim verification, contradiction detection, calibration (Brier + ECE), semantic entropy, provenance binding.

  • Tier 1 (opt-in): optional mythos_judge for fresh-context semantic verdicts. Disabled unless you set an API key.

  • Measurable: npm run eval:gate evaluates a labeled dataset and writes precision / recall / F1 / claim-level recall to reports/gate_eval.*.


Related MCP server: ultrabrain-mcp

Quick start

git clone https://github.com/sorryorc/mythos-mcp.git
cd mythos-mcp
npm install
npm run build
npm test

Wire it into Codex (or any MCP client) by pointing at the built entry:

[mcp_servers.mythos_reasoning]
command = "node"
args = ["<absolute path to>/mythos-mcp/dist/index.js"]
cwd    = "<absolute path to>/mythos-mcp"
enabled = true
startup_timeout_sec = 20
tool_timeout_sec    = 120

When your client exposes tools named like mcp__mythos_reasoning__mythos_optimize, the MCP is connected.


What it actually does

Mythos Reasoning MCP turns the Prelude → recurrent depth → Coda idea into a practical workflow:

plan -> recurrent depth (to convergence) -> verify + calibrate -> detect contradictions -> critique + probe -> gate -> verified coda

It runs no LLM and makes no network calls on the default path. It exposes auditable summaries — confidence scores, evidence coverage, calibration, contradiction reports, risk flags, and token-budget estimates — never private chain-of-thought.

Positioning: "Mythos" here is a deterministic verification layer, not a specific model brand. Modern reasoning models already supply depth and self-validation natively; this layer does not replace that — it independently checks the model's output, because a model's own (potentially grader-aware) self-assessment is exactly what should not be trusted blindly.

What is new in 0.5

0.6 builds on the v0.5 refocus with measurable gate evaluation, deterministic provenance binding, and claim routing. The public tool surface is now 19 tools: the quality pipeline, deeper verification helpers, provenance/route checks, the optional judge, and progress authenticity audit.

  • Refocused tool surface: removed context tiering, tool orchestration, memory, scaffolding, agent coordination, protocol batching, security lattice, and observability tuning tools.

  • Verification-first shape: retained budget, plan/pass/run, verify/critique/finalize, gate/optimize, compare, formalize, consistency, assumptions, probe, verify-chain, judge, provenance, route, and audit-progress.

  • Gate effectiveness is measurable: npm run eval:gate evaluates datasets/labeled.sample.json and writes precision/recall/false-block/F1 plus claim-level recall to reports/gate_eval.*.

  • Provenance and routing: mythos_check_provenance flags factual claims without verifiable pointers; mythos_route decides which claims stay on Tier 0, need Tier 1 judge, or need more evidence.

  • Docs and tests match the runtime: unit tests no longer cover deleted runtime primitives, and compliance audit scans the remaining 3 source files that contain public tool descriptions or judge prompts.

What was new in 0.4

0.4 retargets the layer for Mythos 5 / Fable 5-class models, whose native effort dial supplies the reasoning depth this MCP used to simulate. Value moves to what the model's own self-assessment cannot be trusted to do — independent verification, calibration, and gating.

  • Budget → effort mapping (mythos_budget): each active mode maps onto the host model's test-time-compute effort dial — lite → low, standard → high, deep → xhigh, max → max; off skips the host-model call — with an effort_rationale. Depth comes from the model's effort; the MCP governs when that spend is worth it. The recurrent-pass loop is kept only as a budget/convergence governor (stalled ⇒ gather evidence, don't spend more effort).

  • Multi-agent merge gate (mythos_gate merge mode): pass candidates (≥2 sub-agent outputs) and the gate runs cross-agent consistency + contradiction detection and gates the merge (merge_ready / reconcile / block_merge) — the deterministic check where async sub-agents (e.g. Git-shared worktrees) converge. A high-severity cross-agent contradiction forces block_merge.

  • Response provenance awareness: mythos_run, mythos_gate, and mythos_optimize accept response_source (fable5 | opus_fallback | unknown) plus stop_reason. Opus fallback paths are reported in provenance; high-risk gates tighten the quality bar and require fallback-aware release notes.

  • Reasoning-extraction compliance audit: npm test runs dist/tests/compliance_audit.js, which scans tool descriptions and prompt templates for unsafe requests to reveal, repeat, transcribe, or extract private reasoning.

  • Tier 1 judge tool: mythos_judge exposes the JudgeClient adapter with disabled, mock, and Anthropic HTTP implementations. It supports multi-sample self-checks, semantic-entropy stability, optional cross-model second pass, and per-session cache keys. Without ANTHROPIC_API_KEY, it returns a disabled result and Tier 0 remains the default zero-network path.

  • Progress authenticity audit: mythos_audit_progress checks a model's progress report against real tool-result logs and flags unsupported or unverifiable progress claims.

What is new in 0.3

0.3 mines established techniques from the literature and folds them into the deterministic, no-LLM core:

  • Chain-of-Verification (mythos_verify_chain): generate independent verification questions per claim / risk-tag / specific, to answer before finalizing (Dhuliawala et al., 2023).

  • Semantic entropy in mythos_consistency: cluster candidate answers by meaning and report normalized entropy + cluster stats — a stronger confabulation signal than lexical overlap (Farquhar et al., Nature 2024).

  • Atomic-fact decomposition (FActScore, Min et al., 2023): split sentences into atomic claims so one false sub-clause is not hidden inside a true sentence; surfaced as factual_precision.

  • Proper calibration: the calibration report now adds a Brier score and Expected Calibration Error (ECE) over claims.

  • Configurable adaptive exit (Huginn, Geiping et al., 2025): mythos_run accepts exit_criterion (latent_diff default, or ema) and exit_threshold, mapping their latent/KL exit idea onto the belief iteration.

  • Front-loaded verification: mythos_optimize attaches the CoVe questions (verification_questions) and folds the top ones into codex_instruction, so the guardrail shapes the answer before it is written — not only after.

What is new in 0.2

Version 0.1 had a placeholder "recurrent" loop that just added a fixed pass * 0.04 to a confidence number every iteration — monotonic by construction and carrying no information. 0.2 replaces the core and adds three analysis tools:

  • True recurrent depth. The reasoning state relaxes toward a fixed point: belief_t = belief_{t-1} + lr * (target(state) - belief_{t-1}), where target is derived from evidence strength, open unknowns, and detected contradictions. Iteration halts on convergence, oscillation, or stall.

  • Adaptive compute. Harder inputs (weak evidence, more unknowns/contradictions) get a lower learning rate and a larger pass budget, so the engine spends more passes where it matters and fewer on easy inputs.

  • Contradiction detection. Negation, modal (always vs sometimes), numeric/version, and claim-vs-evidence mismatches.

  • Calibration. Each claim's stated confidence is compared against its evidence-derived confidence; systematic overconfidence is penalized.

  • New tools: mythos_consistency (self-consistency / ensemble), mythos_assumptions (assumption mining), mythos_probe (edge-case / counterfactual probing).

Codex Status

This project is installed at:

<your local path>/mythos-mcp

Codex config:

[mcp_servers.mythos_reasoning]
command = "node"
args = ["<your local path>/mythos-mcp/dist/index.js"]
cwd = "<your local path>/mythos-mcp"
enabled = true
startup_timeout_sec = 20
tool_timeout_sec = 120

When Codex exposes tools named like mcp__mythos_reasoning__mythos_optimize, the MCP is connected.

Tools

The v0.6 public surface is 19 tools.

Quality pipeline:

  • mythos_budget: estimate overhead, recommend off/lite/standard/deep/max, and map active modes to the host model effort dial (low/medium/high/xhigh/max).

  • mythos_plan: classify task mode, risk level, risk factors, and evidence needs.

  • mythos_pass: run one recurrent relaxation step (carries state across turns).

  • mythos_verify: score claims by evidence support; report per-claim status, risk tags, and calibration.

  • mythos_critique: find unsupported claims, contradictions, overconfidence, miscalibration, missing verification, and edge-case gaps.

  • mythos_finalize: produce a concise answer with evidence summary, contradictions, and uncertainty notes.

  • mythos_run: run the full workflow (plan -> depth -> verify -> contradictions -> critique -> finalize -> score).

  • mythos_gate: decide ready / revise / gather_evidence / formalize / block. Merge mode (candidates >= 2): gate a multi-agent merge with merge_ready / reconcile / block_merge.

  • mythos_formalize: turn an unresolved problem into variables, constraints, an objective, and next actions ranked by expected value.

  • mythos_optimize: single-entry orchestration for Codex (budget, gate, formalize, assumptions, open edge cases, evidence requests, output contract).

  • mythos_compare: compare baseline output against a Mythos-checked answer.

Analysis tools (deeper thinking):

  • mythos_consistency: self-consistency / ensemble over multiple candidate answers — agreement score, consensus claims, and cross-candidate contradictions.

  • mythos_assumptions: mine the hidden assumptions a draft depends on and rank them by risk_if_false / cost.

  • mythos_probe: generate the edge cases / counterfactuals an answer must survive, flagging which are already addressed.

  • mythos_verify_chain: Chain-of-Verification — independent verification questions for a draft's claims, prioritized by weakness and risk.

  • mythos_judge: optional Tier 1 fresh-context semantic judge for claim/evidence entailment (supported | contradicted | unsupported). Defaults to disabled without an API key; use provider: "mock" for local deterministic testing. Supports sample_count, entropy_threshold, cross_model, secondary_model, and use_cache.

  • mythos_check_provenance: deterministically checks whether claims are bound to verifiable pointers (file_line, test_id, url, or quote) and counts unbound specific claims.

  • mythos_route: routes claims to tier0_only, tier1_judge, or gather_evidence and estimates judge cost before spending it.

  • mythos_audit_progress: compare progress statements to real tool results and return grounded / false / unverifiable labels plus floating claims.

Resources: mythos://protocol (the workflow) and mythos://depth (the recurrent-depth math).

Recurrent depth and adaptive compute

belief_t = belief_{t-1} + lr * (target(state) - belief_{t-1})
target(state) = base(evidence_strength) - unknown_penalty - contradiction_load
lr           = 0.85 - 0.45 * difficulty      # harder inputs creep, using more passes

Halting (halted_reason):

  • converged|delta| < epsilon with belief high enough.

  • stalled — belief stopped moving but unknowns/contradictions remain (needs new evidence, not more passes).

  • oscillation — the target keeps flipping across passes (conflicting evidence).

  • max_passes — budget exhausted; recommended_additional_passes estimates what remains.

The depth result is auditable: each pass reports belief, target, delta, contradiction count, and open checks.

Verification, contradictions, and calibration

mythos_verify assigns each claim an evidence-derived confidence from a single documented weight table (test_result > file_line > docs > web > user_input > inference > unknown), adjusted for a cited source, absolute language, and bare specifics. Status bands:

  • supported: strong, sourced, typed evidence.

  • needs_more_evidence: plausible but not proven (e.g. a cited inference).

  • unsupported: no usable evidence.

Calibration compares stated vs evidence-derived confidence and reports overconfident_claims and a calibration_score. Note that negated absolutes (cannot guarantee, not always) are treated as hedging, not overconfidence.

Budget Modes

Mode

Model effort

Typical use

Passes

Estimated overhead

off

skip call

Casual chat, creative drafting, low-risk preference questions

0

0

lite

low

Short factual answers, quick final check

1

about 180 tokens + 15% of checked text

standard

high

Normal coding, debugging, current information, design tradeoffs

2

about 420 tokens + 35% of checked text

deep

xhigh

Security, legal, finance, auth, destructive actions, costly decisions

4

about 900 tokens + 80% of checked text

max

max

Highest-capability sensitive tasks where the extra cost/latency is justified

5

about 1300 tokens + 110% of checked text

For best value, start with mythos_budget. Use mythos_run / mythos_optimize when the recommendation is standard or deep, or when the answer contains important factual claims.

mythos_run, mythos_gate, and mythos_optimize can also receive:

{
  "response_source": "fable5 | opus_fallback | unknown",
  "stop_reason": "optional provider stop/refusal reason"
}

Use opus_fallback when the host fell back from the requested Fable 5 / Mythos 5-class path. The gate does not blindly reject fallback output, but it reports a downgraded provenance posture and is stricter on high-risk tasks.

Optional Tier 1 Judge

The judge layer is exposed through mythos_judge and routed by mythos_route. The default mythos_gate path remains deterministic and zero-network; mythos_optimize reports which claims justify Tier 1 and the estimated spend.

Environment contract:

ANTHROPIC_API_KEY                 enables the Anthropic judge client
MYTHOS_JUDGE_PROVIDER=mock        uses the local deterministic mock client
MYTHOS_JUDGE_PROVIDER=disabled    forces Tier 1 off
MYTHOS_JUDGE_MODEL                defaults to claude-fable-5
MYTHOS_JUDGE_EFFORT               low | medium | high | xhigh | max, defaults to high
MYTHOS_JUDGE_MAX_OUTPUT_TOKENS    defaults to 1200
MYTHOS_JUDGE_TIMEOUT_MS           defaults to 30000
MYTHOS_JUDGE_ENDPOINT             defaults to https://api.anthropic.com/v1/messages

The Anthropic request uses the Messages API with model, max_tokens, messages, and output_config.effort; it sends the required anthropic-version: 2023-06-01 header. The judge prompt asks only for structured JSON verdicts and explicitly avoids private reasoning.

Example local call:

{
  "task": "Judge whether claims are supported.",
  "provider": "mock",
  "sample_count": 2,
  "use_cache": true,
  "claims": [
    {
      "claim": "The unit tests pass.",
      "evidence_type": "test_result",
      "source": "npm test"
    }
  ]
}

For progress reports, pass either statements or progress_report with compact tool_results:

{
  "statements": ["I ran npm test.", "I deployed to production."],
  "tool_results": [
    { "id": "t1", "tool": "shell", "output": "npm test passed all checks" }
  ]
}
1. Use normal local tools first: read files, run tests, inspect docs.
2. Call mythos_optimize with task + draft + compact claims/evidence.
3. ready          -> answer under output_contract.
4. gather_evidence-> collect the requested evidence first.
5. formalize      -> follow the ranked next step.
6. revise         -> rewrite under the output rules.
7. block          -> do not answer; resolve the contradiction first.

Use the lower-level tools (mythos_verify, mythos_critique, mythos_consistency, mythos_assumptions, mythos_probe, mythos_pass) when debugging the pipeline itself or when you want one specific check.

Maximum benefit with controlled cost

skip: casual/creative/no factual stakes
lite: short answers with 1-3 claims
standard: coding, debugging, docs, current info, architecture decisions
deep: security, credentials, payments, destructive operations, legal/medical/financial risk
max: highest-capability sensitive tasks after the deterministic gate justifies the spend

Send compact drafts, claims, and evidence summaries — not entire files, logs, or transcripts. The MCP cannot retrieve sources or run tests; it makes evidence gaps, contradictions, and overconfidence visible before the final answer.

Replay evaluation

Record real Codex tasks after an answer attempt:

npm run replay:record -- replay/sample_record.json replay/replay.jsonl
npm run replay:analyze -- replay/replay.jsonl
npm run replay:report  -- replay/replay.jsonl reports/replay.md
npm run replay:reset   -- replay/replay.jsonl

Records include task, draft, claims, evidence, outcome, and notes. Use outcomes consistently: accepted | reworked | failed | blocked | unknown. The strongest tuning signal is ready decisions that later become failed or reworked — that means the gate is too permissive for that task class. (reset truncates the log even on filesystems that block file deletion.)

References

The deterministic heuristics are inspired by — not implementations of — these works:

  • Geiping et al., "Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach" (arXiv:2502.05171, 2025).

  • Dhuliawala et al., "Chain-of-Verification Reduces Hallucination in Large Language Models" (arXiv:2309.11495, 2023).

  • Farquhar, Kossen, Kuhn & Gal, "Detecting hallucinations in large language models using semantic entropy", Nature 630 (2024).

  • Min et al., "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation" (EMNLP 2023).

  • Wang et al., "Self-Consistency Improves Chain-of-Thought Reasoning in Language Models" (arXiv:2203.11171, 2022).

Test

npm test

npm test builds, runs the MCP smoke test, the v0.2 unit suite (recurrent convergence, contradiction detection, calibration, gate decisions, and the new tools), the comparison and dataset evaluations, and a replay round-trip. Reports are written to:

reports/comparison.md
reports/real_tasks.sample.eval.md
reports/gate_eval.md
reports/test_replay.md
A
license - permissive license
-
quality - not tested
B
maintenance

Maintenance

Maintainers
Response time
Release cycle
1Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/sorryorc/mythos-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server