cuad-audit
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@cuad-auditAudit liability clause: 'Liability limited to $500k'"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
cuad-audit
An MCP server that audits a contract liability clause against a derived
company standard, and only produces a verdict when it can point at the
exact retrieved evidence it relied on, and that verdict has passed a
faithfulness check against that evidence. When the evidence is too weak,
it says so — insufficient-grounding is a first-class result, not an error.
This is a study in when an agent should abstain, built as a small, fully-tested MCP server over a real legal-contracts dataset (CUAD, CC BY 4.0).
Quickstart
uv sync
make demomake demo runs two pinned fixtures through audit_clause and prints the
raw tool JSON — no MCP wiring, no full dataset download (the Chroma index
auto-builds in seconds from a committed slice of the data). Without
ANTHROPIC_API_KEY set, it runs the abstain case only (gate 1 needs no
LLM) and tells you so:
{
"verdict": "insufficient-grounding",
"reason": "retrieval evidence too weak to ground a verdict (top similarity 0.423 < threshold 0.64) — escalate to a human reviewer",
"citation": null,
"gate1_score": 0.4228,
"gate1_threshold": 0.64,
"faithfulness": null,
"failure_cause": "gate1"
}With a key set, it also runs a cited verdict case (a mutual 3x
work-order liability cap) and prints acceptable/risky/off-standard
with a chunk_id citation the server resolved to an exact precedent span.
Related MCP server: hivelaw
Architecture
┌──────────────────┐
agent ──▶│ MCP server │ stdio; logging to stderr only
│ (server.py) │ (stdout is reserved for the protocol)
└──┬──────┬─────┬──┘
search_clauses get_standard audit_clause (reuses both)
│ │ │
▼ ▼ ▼
┌──────────┐ ┌───────────┐ ┌─────────────────────────────┐
│BM25 + │ │standard, │ │ GATE 1 (pre-LLM): cosine │
│cosine │ │derived from│ │ evidence score < 0.64 │
│(Chroma) │ │15 read │ │ → insufficient-grounding │
│ │ │clauses │ │ verdict LLM (Sonnet, t=0) │
└────┬─────┘ └───────────┘ │ GATE 2 (post-LLM): Haiku │
│ │ faithfulness judge │
▼ └─────────────────────────────┘
CUAD liability spans (Cap On Liability / Uncapped Liability)
→ chunked with stable chunk_idsThe three tools
search_clauses(query, clause_type="liability", k=5)— BM25-ranked precedent chunks (each with a stablechunk_id, source contract, char span, and score), gated by a cosine evidence-confidence score. Below threshold returnsstatus: "below_threshold"with the scores — an abstention, not an error.get_standard(clause_type="liability")— the liability "playbook": six positions (P1–P6, e.g. mutuality, cap basis, carve-outs) derived from 15 hand-read CUAD clauses, with provenance. Explicitly scoped — not legal advice, not corpus-wide extraction.audit_clause(incoming_clause, clause_type="liability")— reuses both tools above, then runs the two-gate grounding contract below.
The grounding contract (two gates, both in code)
Gate 1 — pre-LLM evidence gate. A leave-one-out cosine similarity score is checked before any LLM call. If the best match is below 0.64 (calibrated against a 13-query negative set — gibberish, out-of-domain clauses, cross-referenced caps), the tool returns
insufficient-groundingimmediately. No API key needed for this path.Gate 2 — post-LLM faithfulness judge. A single cheap Haiku call decomposes the verdict's reasoning into claims and checks each is supported by the cited chunk and the standard. An unfaithful verdict is downgraded to
insufficient-groundingand counted as a hallucination — never silently shipped.
Citations are chunk_id lookups, not string matching: the verdict LLM
picks an id from the evidence it was shown, and the server resolves it to
the exact span. If the LLM names an invalid verdict or a chunk_id outside
the retrieved evidence, that's also caught and downgraded to
insufficient-grounding.
escalate-infra (API timeout/rate-limit/malformed output/refusal) is a
separate verdict from insufficient-grounding — "the system is honest"
and "the API is flaky" are never conflated.
Connect to Claude Code
claude mcp add cuad-audit -- uv run --directory /path/to/luminance python -m cuad_audit.server
claude mcp list # health check — should show cuad-audit as connectedThen ask Claude Code to call search_clauses, get_standard, or
audit_clause. Tool descriptions document the abstain semantics — agents
should relay insufficient-grounding / below_threshold / escalate-infra
verbatim, not retry until they get a verdict.
Symptom | Likely cause | Fix |
Server doesn't appear in | wrong | run |
|
|
|
First call is slow / looks hung | first-run embedding model download (~90 MB) | progress prints to stderr; subsequent runs are cached |
Setup
Python 3.11+, dependency management via
uvwith a committed lockfile (uv sync).Embedding model:
sentence-transformers/all-MiniLM-L6-v2(~90 MB, downloaded once and cached).Tested on macOS and Linux.
First
make demo/make ingest: ~1–2 minutes (model download + index build from the committed slice of 680 chunks). Subsequent runs: seconds. Measured smoke test (uv sync && make demo && make eval-retrieval && make test, warm model cache): 97s total, 33 passed + 1 skipped (the skipped test needsANTHROPIC_API_KEY).ANTHROPIC_API_KEY— only required foraudit_clause's verdict + judge calls andmake eval-verdicts. Not required forsearch_clauses,get_standard, the abstain half ofmake demo,make eval-retrieval, ormake calibrate.
Reproducing the measurements
make eval-retrieval # keyless, deterministic — retrieval vs CUAD spans + kill criteria
make calibrate # keyless — gate-1 threshold calibration distributions
make eval-verdicts # needs ANTHROPIC_API_KEY — ~70 LLM calls, a few dollarsRetrieval vs CUAD expert spans (held-out split, docs/DAY2_RESULTS.md)
166 held-out queries against 680 library chunks (80% Cap On Liability / 20% Uncapped Liability):
Metric | BM25 | MiniLM (cosine) |
precision@1 (overall) | 0.741 | 0.705 |
precision@1 (Uncapped Liability, n=34) | 0.559 | 0.500 |
success@3 (≥1 relevant in top 3) | 0.970 | 0.970 |
Embeddings did not beat the keyword baseline, so retrieval ranking and citations are BM25-first (pre-registered Day-2 rule); the semantic index stays in the repo, tested, as the measured comparison and feeds gate 1.
Gate-1 threshold (docs/DAY3_CALIBRATION.md)
Calibrated on leave-one-out cosine top-scores: library positives (n=669, p10=0.64) vs a 13-query negative set (gibberish, out-of-domain clauses, cross-referenced caps). Threshold 0.64 catches 8/13 negatives outright; the remaining 5 (cross-referenced caps, near-domain insurance/audit-rights text) are real liability-adjacent text with no auditable content — owned by the verdict path (standard position P5), not by gate 1. See docs/gate1_calibration.png.
Verdicts vs hand-labeled standard (docs/EVAL.md, results: docs/DAY5_RESULTS.md)
35-item eval set (28 hand labels + 3 cross-reference cases + 3
capped/uncapped confusion pairs + 1 prompt-injection probe), run via
make eval-verdicts. Reported as counts and a failure taxonomy, never a
headline accuracy — n is too small for that, and the report says so. Columns
distinguish grounding abstentions (justified vs unjustified) from infra
abstentions (API failures).
One live run (2026-06-10; ±1–2 expected on re-run):
n | % of 28 | |
Non-abstained verdict | 11 | 39% |
Grounding abstain — justified (R9) | 1 | 4% |
Grounding abstain — unjustified | 16 | 57% |
Infra abstain | 0 | 0% |
6/11 non-abstained verdicts matched the hand label exactly. Of the 16 unjustified abstentions, 14 came from the gate-2 faithfulness judge — a hand-verified sample found the judge, not the verdict LLM, was usually the weak link (rejecting reasonable inferential claims as "unsupported"). Citations were valid on 100% of non-abstained verdicts; adversarial defenses (cross-referenced caps + prompt-injection probe) held 4/4. Full taxonomy, root-cause analysis, and "what I'd do next": docs/DAY5_RESULTS.md. One-page project writeup: docs/WRITEUP.md.
Known limitations (named on purpose)
Polarity risk: "Cap on Liability" and "Uncapped Liability" are a negation pair that embedding similarity can confuse. Gate 1 measures evidence strength, not correctness — the verdict LLM owns the capped/uncapped call, and confusion pairs are in the adversarial eval set.
Tool-level vs agent-level grounding: the server cannot stop a client agent from speculating after an
insufficient-groundingresult. The demo harness instructs verbatim relay and shows raw tool output.Single lane (liability), single segmenter (CUAD's expert spans, not a production clause segmenter), standard derived from 15 read clauses — all scoped claims, not corpus-wide extraction. See PLAN.md for the full "Not in Scope" list and rationale.
The faithfulness judge (gate 2) is itself an ungated LLM call; a hand-verified sample of judge outputs is reported alongside the eval.
Project layout
src/cuad_audit/
download.py CUAD v1 download (pinned sha256)
derive_slice.py reproduces the committed data slice byte-identically
ingest.py chunking, token-length checks, Chroma index build
retrieval.py BM25 (KeywordIndex) + cosine (SemanticIndex)
llm.py CompleteFn seam — typed failures, no silent fallbacks
audit.py the three tools + both gates
server.py MCP stdio entrypoint
demo.py make demo
calibrate.py gate-1 threshold calibration
eval_retrieval.py make eval-retrieval
eval_verdicts.py make eval-verdicts (resumable JSONL)
data/ committed slice (split, standard, labels, chunks)
docs/ split methodology, rubric, eval definitions, results
tests/ 34 tests, LLM seam fully mocked — CI is freeData & attribution
Built on the Contract Understanding Atticus Dataset (CUAD) v1,
© The Atticus Project, licensed under
CC BY 4.0. This repo commits
a small derived slice (liability-clause spans, data/liability_spans_all.json
and data/split.json) for reproducibility; make ingest can re-derive the
index from a fresh download via src/cuad_audit/download.py.
Code is MIT licensed.
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/master997/luminance'
If you have feedback or need assistance with the MCP directory API, please join our Discord server