fhir-synthetic-mcp
Integrates with NVIDIA NIM (Nemotron Parse) for parsing clinical PDF documents with OCR and table extraction.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@fhir-synthetic-mcpList all patients"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Clinical AI Governance Platform
Agents propose. A deterministic layer validates. A human approves. Every action is audited.
A production-grade reference implementation for deploying LLM agents over clinical data with deterministic safety guardrails. Built as a reusable framework for healthcare operators — deploy once, apply across a portfolio.
Live demo: https://clinical-ai-governance-platform-production.up.railway.app
End-to-end pipeline
Raw PDF / prior auth letter / treatment plan
↓
parse_clinical_document (Nemotron Parse — NVIDIA NIM)
Multi-column OCR, table extraction, reading-order reconstruction
↓
de-identification layer (deidentify.py)
Strip name/MRN, hash patient ID, bucket age — before any external API
↓
extract_entities (ClinicalNLP — Anthropic structured output, temp=0)
ICD-10-CM · LOINC · NPI · RxNorm · calibrated confidence
↓
search_guidelines (RAG — BM25 + ChromaDB hybrid, RRF fusion)
Evidence-based thresholds from 8 clinical guidelines
↓
search_clinical_trials (ClinicalTrials.gov v2 API — on flagged observations)
Recruiting trials the patient may qualify for
↓
propose_observation (LOINC deterministic gate — 14 codes)
Hard reject on impossible values · warning on clinical flags
↓
══ HUMAN-IN-THE-LOOP GATE ══
approve_write / reject_write (verified approver only, DUA-gated)
↓
SQLite commit (WAL mode, FK enforcement, field-level encryption)
↓
SHA-256 audit chain (tamper-evident JSONL, verify_chain())Related MCP server: Smart EHR MCP Server
Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ Clinical AI Governance Platform │
│ │
│ [IN] Nemotron Parse (NVIDIA NIM / self-hosted for PHI) │
│ Raw PDF → structured markdown (prior auth, EOB, plan) │
│ │ │
│ ▼ │
│ De-identification layer (deidentify.py) │
│ Hash patient ID · strip name/MRN · bucket age │
│ │ │
│ ▼ │
│ Agent SDK Orchestration (src/clinical_agent/) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Reader │→ │ RAG │→ │ Proposal │ │
│ │ Subagent │ │ Subagent │ │ Subagent │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
│ PostToolUse hooks: audit logging + cost/latency tracking │
│ │ │
│ MCP Server (FastMCP 3.x) 10 tools · 2 resources │
│ │ │
│ Deterministic Validation (validator.py) │
│ LOINC registry · value ranges · unit enforcement │
│ │ │
│ Auth + DUA layer (auth.py) │
│ Principal · Approver · Data Use Agreement verification │
│ │ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ SQLite Store │ │ ChromaDB RAG │ │ Audit Chain │ │
│ │ WAL · FK │ │ BM25+Semantic│ │ SHA-256 JSONL │ │
│ │ Fernet enc. │ │ RRF fusion │ │ verify_chain() │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
│ │
│ ClinicalTrials.gov v2 · Eval harness (25 cases, LLM-judge) │
└─────────────────────────────────────────────────────────────────────┘Build status
Component | Status |
SQLite persistence (WAL, FK) | ✓ Day 1 |
Tamper-evident audit (SHA-256 chain) | ✓ Day 1 |
Auth (principal + approver verification) | ✓ Day 1 |
LOINC deterministic validation (14 codes) | ✓ Day 2 |
Clinical data (8 guidelines, 4 notes) | ✓ Day 2 |
MCP resources + prompts + prompt caching | ✓ Day 2 |
RAG — BM25 + ChromaDB hybrid (RRF) | ✓ Day 3 |
Agent SDK orchestration (3 subagents, hooks) | ✓ Day 4 |
Extended thinking routing (flagged proposals) | ✓ Day 4 |
Clinical NLP entity extraction (structured output) | ✓ Day 5 |
Calibrated confidence scoring (Brier score) | ✓ Day 5 |
Eval harness (25 golden cases, LLM-as-judge) | ✓ Day 6 |
GitHub Actions CI (pytest + eval regression gate) | ✓ Day 6 |
HTTP server (FastAPI SSE, claude.ai connector) | ✓ Day 7 |
Dockerfile + Railway deploy | ✓ Day 7 |
ClinicalTrials.gov integration | ✓ Day 8 |
Nemotron Parse — raw PDF → structured text → NLP → audit | ✓ Day 9 |
DUA enforcement ( | ✓ Day 10 |
Field-level encryption at rest (Fernet, PHI fields) | ✓ Day 10 |
De-identification layer (hash ID, strip name/MRN, age bucket) | ✓ Day 10 |
Performance (eval harness, smoke suite)
Metric | Value |
Accuracy (accept/reject correct) | 100% |
False-negative rate | 0% |
Regression threshold | 80% |
Brier score | 0.3174 |
Mean validation latency | 0.32 ms |
Eval suite size | 25 golden cases |
Setup
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
python3 scripts/seed_db.py
pytest -qConnect to Claude Code (local stdio)
claude mcp add clinical-governance -- \
/path/to/.venv/bin/python -m fhir_mcp.serverConnect to claude.ai (remote SSE)
Settings → Connectors → Add → https://clinical-ai-governance-platform-production.up.railway.app/sse
Environment variables
Variable | Default | Purpose |
| — | Required for Agent SDK + NLP |
| — | Required for Nemotron Parse (NVIDIA NIM) |
| NIM cloud | Override with self-hosted NIM URL for PHI docs |
|
| Set |
| — | Fernet key for PHI field encryption at rest |
| — | Comma-separated actor IDs with signed DUA |
|
| SQLite database path |
|
| Agent audit identity |
| stderr | Audit JSONL path |
|
| LOINC validation rules |
| (unset = dev mode) | Allowed agent actor IDs |
| (unset = dev mode) | Allowed human approver IDs |
|
| Set |
|
| HTTP server port |
Generate an encryption key
python3 -c "from fhir_mcp.store import generate_encryption_key; print(generate_encryption_key())"Store the output in your secrets manager as FHIR_MCP_ENCRYPTION_KEY.
Verify audit chain
python3 scripts/audit_verify.py data/audit.jsonlRun evals
python3 scripts/run_evals.py --suite smoke
python3 scripts/run_evals.py --suite full --judgeRepository structure
src/
fhir_mcp/
server.py FastMCP 10 tools + resources + prompts
store.py SQLite store + field-level encryption (only PHI touchpoint)
models.py Pydantic v2 FHIR models
audit.py SHA-256 hash-chain audit
auth.py Principal + approver + DUA verification
validator.py LOINC deterministic gate
deidentify.py De-identification layer (hash ID, strip PHI, age bucket)
rag.py BM25 + ChromaDB hybrid RAG
nlp.py Clinical NLP entity extraction
confidence.py Calibrated confidence scoring
trials.py ClinicalTrials.gov v2 API client
parse.py Nemotron Parse (NVIDIA NIM) client
http_server.py FastAPI SSE transport
clinical_agent/
orchestrator.py ClinicalOrchestrator (3-subagent workflow)
subagents.py Reader / RAG / Proposal subagent configs
hooks.py PostToolUse audit + cost hook
evals/
golden_dataset.json 25 test cases
runner.py Code-based + LLM-as-judge grading
judge_prompt.py LLM-as-judge prompt template
mimic_cdm_eval.py MIMIC-CDM 4-axis governance agent eval
data/
synthetic_patients.json Seed data
loinc_rules.json 14 LOINC validation rules
clinical_guidelines.json 8 evidence-based guidelines
clinical_notes.json 4 synthetic notes
scripts/
seed_db.py JSON → SQLite
audit_verify.py Chain integrity verifier
run_agent.py Agent SDK CLI
run_evals.py Eval harness CLI
docs/
architecture.md Full system design + diagrams
adr/ 4 Architecture Decision Records
scale.md Portfolio deployment playbook
ci.md GitHub Actions setupDay-by-day build log
Day | Milestone |
1 | SQLite store, tamper-evident audit, auth layer |
2 | LOINC validator + clinical data (guidelines, notes) |
3 | RAG: BM25 + ChromaDB hybrid over clinical guidelines |
4 | Agent SDK orchestration (Reader/RAG/Proposal subagents, hooks) |
5 | Clinical NLP entity extraction + calibrated confidence scoring |
6 | Eval harness: golden dataset, LLM-as-judge, GitHub Actions CI |
7 | HTTP server (FastAPI SSE), Dockerfile, Railway deploy |
8 | ClinicalTrials.gov integration: surface recruiting trials on flagged observations |
9 | Nemotron Parse: raw PDF → structured text → NLP → validation → audit |
10 | PHI infrastructure: DUA enforcement, field-level encryption, de-identification layer |
Datasets & Evaluation Architecture
Four datasets ground the system across two sub-projects. Each is academically sourced, operates on de-identified or synthetic data, and has a dedicated evaluation methodology drawn from peer-reviewed literature.
Dataset 1 — MedQuAD
Academic source
Ben Abacha, A., & Demner-Fushman, D. (2019). A question-entailment approach to question answering. BMC Bioinformatics, 20(1), 511. https://doi.org/10.1186/s12859-019-3119-4
47,457 question–answer pairs sourced from 12 NIH websites (MedlinePlus, CancerGov, NIDDK, NINDS, GARD, and others). Covers 37 question types across common and rare diseases. License: CC BY 4.0. No PHI — all content is public NIH patient education material.
Location: evidence_pipeline/datasets/medquad.py
LLM architecture — entity linking via deterministic crosswalk
Each QA pair carries a focus (condition name) and optional UMLS CUI gold label. The pipeline maps focus → CUI via ontology/cui_mapper.py — fully deterministic, no LLM in the mapping step. The LLM role is upstream: clinical question generation and metatag refinement.
Test suite — evidence_pipeline/tests/test_datasets.py
Dataset structure, field validation, is_answered / has_gold_cui / is_rare_disease properties, CSV and XML format compatibility.
LLM reasoning framework — BioEL entity linking
Sung, M., Jeon, H., Lee, J., & Kang, J. (2020). Biomedical Entity Representations with Synonym Marginalization. arXiv:2005.00239. https://arxiv.org/abs/2005.00239
Implemented in evidence_pipeline/evals/entity_linking.py and runner.py. Top-k accuracy and Mean Reciprocal Rank (MRR) over the full corpus.
Metric | Smoke target | Full corpus |
Top-1 accuracy | 100% | graded |
Top-5 accuracy | 100% | graded |
MRR | 1.0 | graded |
Coverage (gold CUI present) | 100% | graded |
Dataset 2 — MIMIC-IV Discharge Summaries
Academic source
Johnson, A.E.W., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T.J., Hao, S., Moody, B., Gow, B., Lehman, L.H., Celi, L.A., & Mark, R.G. (2023). MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data, 10, 1. https://doi.org/10.1038/s41597-022-01899-x
Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.Ch., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., & Stanley, H.E. (2000). PhysioBank, PhysioToolkit, and PhysioNet. Circulation, 101(23), e215–e220. https://doi.org/10.1161/01.CIR.101.23.e215
De-identified ICU discharge summaries from Beth Israel Deaconess Medical Center. Demo subset (100 patients): physionet.org/content/mimic-iv-demo/ — free PhysioNet account, no CITI training. Full dataset requires CITI training + signed DUA. PHI note: loader logs note_id only, never raw text.
Location: evidence_pipeline/datasets/mimic.py, evidence_pipeline/extraction/loinc_extractor.py, evidence_pipeline/pipeline/end_to_end.py
LLM architecture — deterministic extraction + governance gate
24 regex patterns extract LOINC-coded observations from discharge text (zero LLM in extraction). The HumanGate class enforces the core governance invariant: every proposed observation is queued with a full audit entry (who/what/when/why) and committed = 0 in automated mode. Human .approve() is required to commit — wiring to src/fhir_mcp/store.py in production.
Test suite — evidence_pipeline/tests/test_mimic.py, test_loinc_extractor.py, test_end_to_end.py
5 dataset tests, 14 LOINC extraction tests, 7 end-to-end tests including core governance invariant (committed == 0).
LLM reasoning framework — FACTS Grounding
Jacovi, A., Caciularu, A., Goldman, O., & Goldberg, Y. (2025). FACTS Grounding: A New Benchmark for Evaluating the Factuality of Large Language Models. arXiv:2501.03200. https://arxiv.org/abs/2501.03200
Implemented in evidence_pipeline/evals/grounding.py. Every ICD-10, RxNorm, LOINC, CUI, and NCT-ID in pipeline output is checked against its canonical source. Grounding score = attributable_claims / total_claims. Score of 1.0 = zero unattributed claims.
Outcome metric (measured, 10 synthetic notes):
Extracted 62 LOINC-coded observations from 10 synthetic discharge notes, 100% validated, 0% rejected by deterministic gate, 0 committed without human approval.
python evidence_pipeline/demo_mimic.py # synthetic
python evidence_pipeline/demo_mimic.py --notes-dir /path/to/mimic # real MIMIC-IV DemoDataset 3 — MIMIC-CDM (Clinical Decision Making)
Academic source
Hager, P., Jungmann, F., Holland, R., Bhagat, K., Hubrecht, I., Knauer, M., Vielhauer, J., Makowski, M., Braren, R., Kaissis, G., & Rueckert, D. (2024). Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nature Medicine. https://doi.org/10.1038/s41591-024-03097-1
Hager, P., Jungmann, F., & Rueckert, D. (2024). MIMIC-IV-Ext Clinical Decision Making (version 1.0). PhysioNet. https://doi.org/10.13026/2pfq-5b68
Derived from MIMIC-IV. Evaluates LLMs on 4-axis clinical decision making given a patient presentation. Available at physionet.org/content/mimic-iv-ext-cdm/. Leaderboard: huggingface.co/spaces/MIMIC-CDM/leaderboard.
Location: evidence_pipeline/datasets/mimic_cdm.py, evidence_pipeline/evals/clinical_decision.py, evals/mimic_cdm_eval.py
LLM architecture — dual-layer CDM eval
Two separate eval targets share the same CDMCase schema and CDMScore rubric:
evidence_pipeline/evals/clinical_decision.py— grades the evidence layer: does the ontology pipeline support correct decisions?evals/mimic_cdm_eval.py— grades the governance agent (src/clinical_agent/orchestrator.py): does the LLM itself make correct decisions? CI uses a deterministic crosswalk-backed mock; production wires to liveClinicalOrchestrator.
Test suite — evidence_pipeline/tests/test_mimic_cdm.py
4 dataset structure tests, 4 F1 scoring unit tests, 3 CDM eval layer tests (composite ≥ 0.75 CI gate).
LLM reasoning framework — AMIE multi-axis auto-rater
Tu, T., Palepu, A., Schaekermann, M., Saab, K., Freyberg, J., Tanno, R., Wang, A., Li, B., Amin, M., Tomasev, N., Ghassemi, M., Azizi, S., Kannan, A., Chou, K., Hassidim, A., Matias, Y., Xu, Y., Singhal, K., Gottweis, J., & Natarajan, V. (2024). Towards conversational diagnostic AI. arXiv:2401.05654. https://arxiv.org/abs/2401.05654
Token-level F1 per axis against gold ICD-10 / RxNorm / LOINC / CPT labels. Composite = mean across 4 axes. CI gate: composite ≥ 0.75.
Axis | Gold standard | CI target |
Diagnosis accuracy | ICD-10 F1 | ≥ 0.75 |
Treatment accuracy | RxNorm F1 | ≥ 0.75 |
Lab ordering accuracy | LOINC F1 | ≥ 0.75 |
Procedure accuracy | CPT F1 | ≥ 0.75 |
Composite | mean | ≥ 0.75 |
Dataset 4 — Governance Agent Eval Harness (25 golden cases)
Source: Internal synthetic dataset, no PHI. Designed against the LOINC validation rules in data/loinc_rules.json and 8 clinical guidelines in data/clinical_guidelines.json.
Location: evals/golden_dataset.json, evals/runner.py, evals/judge_prompt.py
LLM architecture — code-based + LLM-as-judge 25 cases covering accept / reject / borderline observations across 14 LOINC codes. Deterministic code-based grading (exact accept/reject match) plus LLM-as-judge for reasoning quality. Calibrated confidence scoring uses the Brier score:
Brier, G.W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1), 1–3. https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2
Test suite — evals/runner.py
Code-based accuracy + false-negative rate, LLM-as-judge reasoning quality, calibrated Brier score.
LLM reasoning framework — LLM-as-judge
Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., & Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685. https://arxiv.org/abs/2306.05685
Metric | Value |
Accuracy (accept/reject) | 100% |
False-negative rate | 0% |
Brier score | 0.3174 |
Regression threshold | 80% |
Ontology foundation — UMLS CUI crosswalk
All four datasets share a common ontological foundation: the UMLS Concept Unique Identifier (CUI) as the canonical hub linking ICD-10-CM, RxNorm, LOINC, SNOMED CT, and CPT-4.
Bodenreider, O. (2004). The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32(suppl_1), D267–D270. https://doi.org/10.1093/nar/gkh061
Vocabulary | Authority | Citation |
ICD-10-CM | WHO / CMS | World Health Organization. (2019). International Statistical Classification of Diseases (10th ed.). |
RxNorm | NLM | Nelson, S.J., Zeng, K., Kilbourne, J., Powell, T., & Moore, R. (2011). Normalized names for clinical drugs: RxNorm at 6 years. JAMIA, 18(4), 441–448. https://doi.org/10.1136/amiajnl-2011-000116 |
LOINC | Regenstrief Institute | McDonald, C.J., et al. (2003). LOINC, a universal standard for identifying laboratory observations. Clinical Chemistry, 49(4), 624–633. https://doi.org/10.1373/49.4.624 |
SNOMED CT | SNOMED International | Donnelly, K. (2006). SNOMED-CT: The advanced terminology and coding system for eHealth. Studies in Health Technology and Informatics, 121, 279–290. |
CPT-4 | AMA | American Medical Association. (2023). Current Procedural Terminology: CPT 2024. AMA Press. |
Implementation: evidence_pipeline/ontology/cui_mapper.py — 13 conditions, deterministic lookup, zero hallucination. Grounding validated by evidence_pipeline/evals/grounding.py (FACTS Grounding, Jacovi et al. 2025).
Clinical Evidence Intelligence Pipeline
A sub-project built on top of this governance platform — zero changes to any
src/file.
The governance platform answers how to safely deploy clinical agents. This companion demonstrates what those agents generate: structured, evidence-backed clinical content at scale — directly aligned with real-world evidence (RWE) generation workflows.
Clinical question (e.g. "paroxysmal nocturnal hemoglobinuria")
↓
evidence_pipeline/datasets/medquad.py 47,457 NIH QA pairs (CC BY 4.0)
question types · UMLS CUI labels
common conditions + GARD rare diseases
↓
evidence_pipeline/ontology/cui_mapper.py UMLS CUI → ICD-10-CM / RxNorm / LOINC
SNOMED CT / CPT-4 crosswalk
deterministic lookup — no hallucination
↓
Live APIs ClinicalTrials.gov v2 (recruiting trials)
CMS Medicare Coverage Database (NCDs + LCDs)
↓
evidence_pipeline/demo.py Structured, metatagged JSON output
optimised for search and retrieval indexing
↓
evidence_pipeline/demo_mimic.py End-to-end outcome metric
62 LOINC observations · 100% validated
0 committed without human approvalKey design principle: the cui_mapper.py crosswalk is the deterministic validation gate for ontology codes — the same agent-proposes / deterministic-validates pattern as validator.py in the main platform.
Quick demo (no API key required):
# Condition evidence brief
python evidence_pipeline/demo.py "paroxysmal nocturnal hemoglobinuria"
# → ICD-10 D59.5 CUI C0028344 RxNorm 727910 (eculizumab)
# → 5 recruiting trials CMS coverage queried metatagged JSON output
# End-to-end MIMIC-IV pipeline (synthetic notes, zero PHI)
python evidence_pipeline/demo_mimic.py
# → Extracted 62 LOINC observations from 10 notes
# → 100% validated, 0% rejected, 0 committed without human approvalThis server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
- Your AI Chatbot Just Exposed Your CEO's Salary to an InternBy Om-Shree-0709 on .Agent IdentityMCP SecurityOAuth Delegation
- Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)By Om-Shree-0709 on .Agentic AiPrompt InjectionWebAssembly
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/KrishnaKakani-GitHub/clinical-ai-governance-platform'
If you have feedback or need assistance with the MCP directory API, please join our Discord server