Skip to main content
Glama

cuad-audit

CI License: MIT

An MCP server that audits a contract liability clause against a derived company standard, and only produces a verdict when it can point at the exact retrieved evidence it relied on, and that verdict has passed a faithfulness check against that evidence. When the evidence is too weak, it says so — insufficient-grounding is a first-class result, not an error.

This is a study in when an agent should abstain, built as a small, fully-tested MCP server over a real legal-contracts dataset (CUAD, CC BY 4.0).

Quickstart

uv sync
make demo

make demo runs two pinned fixtures through audit_clause and prints the raw tool JSON — no MCP wiring, no full dataset download (the Chroma index auto-builds in seconds from a committed slice of the data). Without ANTHROPIC_API_KEY set, it runs the abstain case only (gate 1 needs no LLM) and tells you so:

{
  "verdict": "insufficient-grounding",
  "reason": "retrieval evidence too weak to ground a verdict (top similarity 0.423 < threshold 0.64) — escalate to a human reviewer",
  "citation": null,
  "gate1_score": 0.4228,
  "gate1_threshold": 0.64,
  "faithfulness": null,
  "failure_cause": "gate1"
}

With a key set, it also runs a cited verdict case (a mutual 3x work-order liability cap) and prints acceptable/risky/off-standard with a chunk_id citation the server resolved to an exact precedent span.

Related MCP server: hivelaw

Architecture

            ┌──────────────────┐
   agent ──▶│   MCP server     │  stdio; logging to stderr only
            │   (server.py)    │  (stdout is reserved for the protocol)
            └──┬──────┬─────┬──┘
   search_clauses  get_standard  audit_clause (reuses both)
        │              │              │
        ▼              ▼              ▼
   ┌──────────┐  ┌───────────┐  ┌─────────────────────────────┐
   │BM25 +    │  │standard,   │  │ GATE 1 (pre-LLM): cosine    │
   │cosine    │  │derived from│  │   evidence score < 0.64     │
   │(Chroma)  │  │15 read     │  │   → insufficient-grounding  │
   │          │  │clauses     │  │ verdict LLM (Sonnet, t=0)   │
   └────┬─────┘  └───────────┘  │ GATE 2 (post-LLM): Haiku     │
        │                       │   faithfulness judge         │
        ▼                       └─────────────────────────────┘
   CUAD liability spans (Cap On Liability / Uncapped Liability)
   → chunked with stable chunk_ids

The three tools

  • search_clauses(query, clause_type="liability", k=5) — BM25-ranked precedent chunks (each with a stable chunk_id, source contract, char span, and score), gated by a cosine evidence-confidence score. Below threshold returns status: "below_threshold" with the scores — an abstention, not an error.

  • get_standard(clause_type="liability") — the liability "playbook": six positions (P1–P6, e.g. mutuality, cap basis, carve-outs) derived from 15 hand-read CUAD clauses, with provenance. Explicitly scoped — not legal advice, not corpus-wide extraction.

  • audit_clause(incoming_clause, clause_type="liability") — reuses both tools above, then runs the two-gate grounding contract below.

The grounding contract (two gates, both in code)

  1. Gate 1 — pre-LLM evidence gate. A leave-one-out cosine similarity score is checked before any LLM call. If the best match is below 0.64 (calibrated against a 13-query negative set — gibberish, out-of-domain clauses, cross-referenced caps), the tool returns insufficient-grounding immediately. No API key needed for this path.

  2. Gate 2 — post-LLM faithfulness judge. A single cheap Haiku call decomposes the verdict's reasoning into claims and checks each is supported by the cited chunk and the standard. An unfaithful verdict is downgraded to insufficient-grounding and counted as a hallucination — never silently shipped.

Citations are chunk_id lookups, not string matching: the verdict LLM picks an id from the evidence it was shown, and the server resolves it to the exact span. If the LLM names an invalid verdict or a chunk_id outside the retrieved evidence, that's also caught and downgraded to insufficient-grounding.

escalate-infra (API timeout/rate-limit/malformed output/refusal) is a separate verdict from insufficient-grounding — "the system is honest" and "the API is flaky" are never conflated.

Connect to Claude Code

claude mcp add cuad-audit -- uv run --directory /path/to/luminance python -m cuad_audit.server
claude mcp list   # health check — should show cuad-audit as connected

Then ask Claude Code to call search_clauses, get_standard, or audit_clause. Tool descriptions document the abstain semantics — agents should relay insufficient-grounding / below_threshold / escalate-infra verbatim, not retry until they get a verdict.

Symptom

Likely cause

Fix

Server doesn't appear in claude mcp list

wrong --directory or uv not on PATH

run uv run python -m cuad_audit.server directly from the repo root and check stderr

audit_clause always returns escalate-infra

ANTHROPIC_API_KEY not set

export ANTHROPIC_API_KEY=... (search_clauses/get_standard still work without it)

First call is slow / looks hung

first-run embedding model download (~90 MB)

progress prints to stderr; subsequent runs are cached

Setup

  • Python 3.11+, dependency management via uv with a committed lockfile (uv sync).

  • Embedding model: sentence-transformers/all-MiniLM-L6-v2 (~90 MB, downloaded once and cached).

  • Tested on macOS and Linux.

  • First make demo / make ingest: ~1–2 minutes (model download + index build from the committed slice of 680 chunks). Subsequent runs: seconds. Measured smoke test (uv sync && make demo && make eval-retrieval && make test, warm model cache): 97s total, 33 passed + 1 skipped (the skipped test needs ANTHROPIC_API_KEY).

  • ANTHROPIC_API_KEY — only required for audit_clause's verdict + judge calls and make eval-verdicts. Not required for search_clauses, get_standard, the abstain half of make demo, make eval-retrieval, or make calibrate.

Reproducing the measurements

make eval-retrieval   # keyless, deterministic — retrieval vs CUAD spans + kill criteria
make calibrate        # keyless — gate-1 threshold calibration distributions
make eval-verdicts    # needs ANTHROPIC_API_KEY — ~70 LLM calls, a few dollars

Retrieval vs CUAD expert spans (held-out split, docs/DAY2_RESULTS.md)

166 held-out queries against 680 library chunks (80% Cap On Liability / 20% Uncapped Liability):

Metric

BM25

MiniLM (cosine)

precision@1 (overall)

0.741

0.705

precision@1 (Uncapped Liability, n=34)

0.559

0.500

success@3 (≥1 relevant in top 3)

0.970

0.970

Embeddings did not beat the keyword baseline, so retrieval ranking and citations are BM25-first (pre-registered Day-2 rule); the semantic index stays in the repo, tested, as the measured comparison and feeds gate 1.

Gate-1 threshold (docs/DAY3_CALIBRATION.md)

Calibrated on leave-one-out cosine top-scores: library positives (n=669, p10=0.64) vs a 13-query negative set (gibberish, out-of-domain clauses, cross-referenced caps). Threshold 0.64 catches 8/13 negatives outright; the remaining 5 (cross-referenced caps, near-domain insurance/audit-rights text) are real liability-adjacent text with no auditable content — owned by the verdict path (standard position P5), not by gate 1. See docs/gate1_calibration.png.

Verdicts vs hand-labeled standard (docs/EVAL.md, results: docs/DAY5_RESULTS.md)

35-item eval set (28 hand labels + 3 cross-reference cases + 3 capped/uncapped confusion pairs + 1 prompt-injection probe), run via make eval-verdicts. Reported as counts and a failure taxonomy, never a headline accuracy — n is too small for that, and the report says so. Columns distinguish grounding abstentions (justified vs unjustified) from infra abstentions (API failures).

One live run (2026-06-10; ±1–2 expected on re-run):

n

% of 28

Non-abstained verdict

11

39%

Grounding abstain — justified (R9)

1

4%

Grounding abstain — unjustified

16

57%

Infra abstain

0

0%

6/11 non-abstained verdicts matched the hand label exactly. Of the 16 unjustified abstentions, 14 came from the gate-2 faithfulness judge — a hand-verified sample found the judge, not the verdict LLM, was usually the weak link (rejecting reasonable inferential claims as "unsupported"). Citations were valid on 100% of non-abstained verdicts; adversarial defenses (cross-referenced caps + prompt-injection probe) held 4/4. Full taxonomy, root-cause analysis, and "what I'd do next": docs/DAY5_RESULTS.md. One-page project writeup: docs/WRITEUP.md.

Known limitations (named on purpose)

  • Polarity risk: "Cap on Liability" and "Uncapped Liability" are a negation pair that embedding similarity can confuse. Gate 1 measures evidence strength, not correctness — the verdict LLM owns the capped/uncapped call, and confusion pairs are in the adversarial eval set.

  • Tool-level vs agent-level grounding: the server cannot stop a client agent from speculating after an insufficient-grounding result. The demo harness instructs verbatim relay and shows raw tool output.

  • Single lane (liability), single segmenter (CUAD's expert spans, not a production clause segmenter), standard derived from 15 read clauses — all scoped claims, not corpus-wide extraction. See PLAN.md for the full "Not in Scope" list and rationale.

  • The faithfulness judge (gate 2) is itself an ungated LLM call; a hand-verified sample of judge outputs is reported alongside the eval.

Project layout

src/cuad_audit/
  download.py       CUAD v1 download (pinned sha256)
  derive_slice.py   reproduces the committed data slice byte-identically
  ingest.py         chunking, token-length checks, Chroma index build
  retrieval.py      BM25 (KeywordIndex) + cosine (SemanticIndex)
  llm.py            CompleteFn seam — typed failures, no silent fallbacks
  audit.py          the three tools + both gates
  server.py         MCP stdio entrypoint
  demo.py           make demo
  calibrate.py      gate-1 threshold calibration
  eval_retrieval.py make eval-retrieval
  eval_verdicts.py  make eval-verdicts (resumable JSONL)
data/               committed slice (split, standard, labels, chunks)
docs/               split methodology, rubric, eval definitions, results
tests/              34 tests, LLM seam fully mocked — CI is free

Data & attribution

Built on the Contract Understanding Atticus Dataset (CUAD) v1, © The Atticus Project, licensed under CC BY 4.0. This repo commits a small derived slice (liability-clause spans, data/liability_spans_all.json and data/split.json) for reproducibility; make ingest can re-derive the index from a fresh download via src/cuad_audit/download.py.

Code is MIT licensed.

A
license - permissive license
-
quality - not tested
B
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/master997/luminance'

If you have feedback or need assistance with the MCP directory API, please join our Discord server