healthsec-mcp
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@healthsec-mcpcompute security posture score for my clinical model"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
healthsec-mcp
MCP connector for clinical-AI security evaluation: adversarial robustness, privacy leakage, and standards-compliance tools composed into a Security Posture Score, callable directly by AI agents.
Full paper structure, research contribution, and milestones:
../STRUCTURE.md. Practical tool-by-tool usage
reference: docs/TOOLS.md. Illustrative worked example:
examples/end_to_end_security_evaluation.md.
Two ways to run this: locally via Claude Desktop (below), or via the
agent-usefulness study harness (reproduce/agent_usefulness_study/).
Status: M1-M5 implemented, with two exceptions noted below. All 10
tools (the 9 planned + get_audit_log, added to close a gap found during
M4) are registered on the server, lint-clean (ruff), type-clean
(mypy), and covered by CI (.github/workflows/ci.yml). One documented
open limitation remains: boundary attack's flip_rate/auroc_drop don't
reproduce the reference numbers even though auroc_clean matches exactly
(see TECHNICAL_DESIGN.md section 12; golden test marked xfail(strict=True)).
Everything else that has a reference target is golden-verified -- exactly
for the deterministic tools (M3 standards, M4 compute_sps), within
tolerance for the ML-based ones (M1 run_fgsm). run_membership_inference
(M2) has no published golden target (see TECHNICAL_DESIGN.md section 4.3)
so it's covered by property tests only. The agent-usefulness study has
been run once (6 tasks, real transcripts), but its rating data does not
match the protocol's design — see
reproduce/agent_usefulness_study/PROTOCOL.md's "Actual execution status"
section for the full, honest account: the "2 raters" were, in turn, one
person filling both columns, then one person plus ChatGPT as an
LLM-judge cross-check. Neither is two independent human raters. A real
second human rater still needs to score transcripts/blinded/ from
scratch before this data can support the paper's agent-usefulness claim.
Everything else in M5 (docs/TOOLS.md, CI, MCP contract tests,
examples/) is done. See Milestones in STRUCTURE.md.
Layout
healthsec-mcp/
├── src/healthsec_mcp/
│ ├── adversarial/ # fgsm.py, boundary.py, plausibility.py -- implemented (M1)
│ ├── privacy/ # membership_inference.py -- implemented (M2)
│ ├── standards/ # attack_coverage.py, rbac.py, audit.py, compliance.py -- implemented (M3)
│ ├── tools/ # adversarial_tools.py, privacy_tools.py, standards_tools.py, sps_tools.py
│ ├── io/ # schemas.py -- FeatureBatch (n<=100), FeaturePool (n<=5000)
│ ├── registry.py # in-session model-handle registry: handle -> model object map only
│ ├── authz.py # authorization gate (M4) -- wraps registry.resolve(), records every
│ │ # attempt (success or denial) to audit.py, regardless of outcome
│ ├── audit.py # append-only log of this connector's own tool calls (M4)
│ ├── sps.py # Security Posture Score composer (M4) -- weights are a parameter,
│ │ # not hardcoded, per the open design question this resolved
│ ├── report.py # generate_security_report's logic (M5) -- never states a deployment
│ │ # recommendation unless compute_sps's output was actually supplied
│ ├── server.py # FastMCP server, registers all 10 implemented tools
│ └── local_datasets.py # shared loader/registrar for the 4 local models -- the one place
│ # CONNECTOR_DATA_ROOT is resolved, used by study_server.py and
│ # scripts/run_local_server.py so path logic isn't duplicated per script
├── tests/
│ ├── unit/ # per-module unit tests (synthetic models + pure-function cases, fast) --
│ │ # includes test_sps.py (exact SPS=78.9 golden match), test_authz.py/
│ │ # test_audit.py (gate + audit-trail behavior), test_report.py
│ ├── golden/ # regression tests against reference/ (validated result tables) --
│ │ # standards tools + compute_sps match exactly; ML-attack tools match
│ │ # within tolerance (boundary attack currently xfails, see Status)
│ ├── contract/ # MCP tool-schema contract tests (M5) -- every tool has a real
│ │ # description + valid JSON Schema; spot-checks required params
│ └── fault_injection/ # bad models/inputs, cap enforcement (FGSM n≤100, MI pool≤5k)
├── reproduce/
│ ├── diagnose_boundary_discrepancy.py # fast standalone RNG/AUROC diagnostic
│ ├── diagnose_nearest_vs_first.py # confirms nearest vs. first-found pick different points
│ ├── diagnose_nearest_full_compare.py # confirms they still give identical attack outcomes
│ ├── run_attacks.py # run FGSM + boundary attack against any of the 4 local models
│ ├── results/ # tracked: JSON output of run_attacks.py (aggregate metrics, no PHI)
│ └── agent_usefulness_study/ # (M5) PROTOCOL.md + full execution harness (study_server.py,
│ # run_study.py, blind_transcripts.py, analyze_ratings.py) --
│ # has been run once, see Status and PROTOCOL.md
├── scripts/
│ └── run_local_server.py # stdio entry point for Claude Desktop -- pre-registers one of the
│ # 4 local models under a fixed handle, see "Running with Claude Desktop"
├── reference/ # tracked: validated result tables (aggregate metrics, no PHI)
├── examples/ # (M5) end_to_end_security_evaluation.md -- illustrative, hand-authored
│ # transcript with real reference numbers, not a live-recorded session
├── docs/ # (M5) TOOLS.md -- practical tool-by-tool usage reference
└── .github/workflows/ # (M5) ci.yml -- ruff + mypy + pytest, path-scoped to this connector
../data/ # sibling to this package, gitignored (see ../.gitignore)
├── models/ # icu_mortality_rf.pkl, ed_admission_rf.pkl, ckd_rf.pkl, wdbc_rf.pkl, *_meta.json
├── mimic/ # icu_cohort/, ed_cohort/ train+test splits
├── ckd/processed/ # train+test splits
└── breast_cancer/processed/ # train+test splitsRelated MCP server: inkog
Running the tests
This project's path (deeply nested under the thesis directory tree) exceeds
Windows' 260-character limit for scikit-learn's compiled binaries, so the
venv must live at a short path outside the project. Replace <you> below
with your actual Windows username (e.g. C:\Users\wisdo\...) -- it's a
placeholder, not literal text to paste:
uv venv --python 3.11 "C:\Users\<you>\.venvs\healthsec-mcp"
uv pip install -e ".[dev]" --python "C:\Users\<you>\.venvs\healthsec-mcp\Scripts\python.exe"
# fast suites (~25s) -- unit, fault-injection, and MCP contract tests
& "C:\Users\<you>\.venvs\healthsec-mcp\Scripts\python.exe" -m pytest tests/unit tests/fault_injection tests/contract -v
# golden regression against ../data/ (~7 min -- LIME explains each sample individually)
& "C:\Users\<you>\.venvs\healthsec-mcp\Scripts\python.exe" -m pytest tests/golden -v -s
# lint + type-check (instant, what CI runs)
& "C:\Users\<you>\.venvs\healthsec-mcp\Scripts\python.exe" -m ruff check src/ tests/ reproduce/ scripts/
& "C:\Users\<you>\.venvs\healthsec-mcp\Scripts\python.exe" -m mypy src/ scripts/run_local_server.py../data/models/ and ../data/mimic/ must be populated first (see Data below).
CI (.github/workflows/ci.yml) runs pytest tests/ directly, letting the
MIMIC-IV-dependent golden tests skip automatically via their own
skipif — ../data/ is gitignored and never present in CI.
Running attacks against a model
reproduce/run_attacks.py runs FGSM + boundary attack against any of the
four locally available models. icu_mortality/ed_admission are the
regression baseline (compared against reference/ in the golden tests);
ckd/wdbc have no published ground truth -- this is new evaluation used
to demonstrate the tools generalize beyond the two validated cohorts.
# one dataset at a time
& "C:\Users\<you>\.venvs\healthsec-mcp\Scripts\python.exe" reproduce\run_attacks.py --dataset ckd
& "C:\Users\<you>\.venvs\healthsec-mcp\Scripts\python.exe" reproduce\run_attacks.py --dataset wdbc
& "C:\Users\<you>\.venvs\healthsec-mcp\Scripts\python.exe" reproduce\run_attacks.py --dataset icu_mortality
& "C:\Users\<you>\.venvs\healthsec-mcp\Scripts\python.exe" reproduce\run_attacks.py --dataset ed_admission
# all four in one run (slowest -- icu_mortality/ed_admission each have up to
# 100 samples for boundary attack, similar runtime to the golden tests)
& "C:\Users\<you>\.venvs\healthsec-mcp\Scripts\python.exe" reproduce\run_attacks.py --dataset allEach run writes its results to reproduce/results/<dataset>_attack_results.json
(tracked in git, overwritten on each run) so nothing is lost once the
terminal scrolls past it.
Running with Claude Desktop
This connects healthsec-mcp to Claude Desktop as a local MCP server --
Claude Desktop spawns it as a subprocess and talks to it over stdio. This
is the intended way to actually use the connector day to day; the other
supported path is the agent-usefulness study harness
(reproduce/agent_usefulness_study/), which spawns the same server
programmatically to script Condition A/B comparisons instead.
Two categories of tools, two setup paths
The 7 standards/SPS/report tools (assess_attack_coverage, check_rbac,
score_audit_completeness, score_compliance, compute_sps,
generate_security_report, get_audit_log) don't touch a registered
model at all -- they score or compose evidence you pass directly in the
tool call. These work with zero setup the moment Claude Desktop can
launch the server at all.
The 3 model-touching tools (run_fgsm, run_boundary_attack,
run_membership_inference) need a model registered under a model_handle
before Claude can call them. This is the part that trips people up:
Claude Desktop spawns healthsec-mcp as a brand-new subprocess with an
empty registry every time -- there's no interactive Python session inside
that subprocess for you to call registry.register() from after the fact.
scripts/run_local_server.py solves this: it's a small wrapper that
loads one of your 4 local models, registers it under a fixed, predictable
handle, and then starts the same server -- point Claude Desktop at this
script instead of the bare healthsec-mcp command if you need the
model-touching tools.
Setup
1. Make sure the venv is set up (see "Running the tests" above if not).
2. Decide which path you need:
Only need the standards/SPS/report tools? Skip to step 4 and point Claude Desktop at
healthsec-mcpdirectly (or the equivalentpython -m healthsec_mcp.server) -- no dataset flag needed.Need the model-touching tools too? Use
scripts/run_local_server.pywith one of--dataset icu_mortality|ed_admission|ckd|wdbc. This requires../data/models/and the matching processed dataset to already be populated (see "Data" below) --icu_mortality/ed_admissionneed PhysioNet-credentialed MIMIC-IV data;ckd/wdbcare public and work out of the box if you've run the setup in "Running the tests."
3. Find (or create) Claude Desktop's config file:
%APPDATA%\Claude\claude_desktop_config.jsonOn Windows that's typically
C:\Users\<you>\AppData\Roaming\Claude\claude_desktop_config.json. If the
file doesn't exist yet, create it with just {"mcpServers": {}} and add
your entry inside.
4. Add an entry under mcpServers. Replace <you> with your actual
Windows username and adjust the repo path to match where you've cloned
this project. Model-touching setup (recommended default -- gives you all
10 tools):
{
"mcpServers": {
"healthsec": {
"command": "C:\\Users\\<you>\\.venvs\\healthsec-mcp\\Scripts\\python.exe",
"args": [
"C:\\Users\\<you>\\OneDrive\\Desktop\\University_of_the_Cumberlands\\Courses\\Thesis_2026_Proposing\\Papers_and_Code\\ai-agents-connectors\\01-healthcare-ai-security-connector\\healthsec-mcp\\scripts\\run_local_server.py",
"--dataset",
"icu_mortality"
]
}
}
}Standards/report-tools-only setup (no model registration, works without
../data/ at all):
{
"mcpServers": {
"healthsec": {
"command": "C:\\Users\\<you>\\.venvs\\healthsec-mcp\\Scripts\\healthsec-mcp.exe"
}
}
}JSON requires double backslashes in Windows paths (\\, not \) --
copy the pattern above exactly, don't use single backslashes.
5. Restart Claude Desktop completely (quit from the system tray, not just close the window) so it picks up the config change.
6. Verify it connected. In a new Claude Desktop chat, look for a
tools/connector icon indicating healthsec is available, or just ask
Claude something that requires a tool, e.g. "What MCP tools do you have
available from healthsec?" If nothing shows up, check Claude Desktop's
logs (Help menu, or %APPDATA%\Claude\logs\) for a subprocess spawn error
-- the most common cause is a typo'd path or JSON syntax error in the
config file.
Using it
If you registered a model (step 2's model-touching path), reference the handle directly in your prompt -- it equals the dataset name you chose, e.g.:
Using model_handle="icu_mortality", check whether this model is vulnerable to small adversarial perturbations.
For the standards/report tools, just supply the evidence directly (Claude
will ask for it, or you can paste it inline) -- see
docs/TOOLS.md for a worked example of every tool,
including the exact input shapes each one expects.
Troubleshooting
Claude says it has no tools from
healthsec-- almost always a config path typo, or Claude Desktop wasn't fully restarted. Check the logs mentioned in step 6.Model-touching tools fail with "model_handle is not registered" -- you're pointed at the bare
healthsec-mcpcommand instead ofscripts/run_local_server.py, or the--datasetyou chose doesn't match the handle you referenced in your prompt (they're the same string, e.g.--dataset ckdgives youmodel_handle="ckd", not anything else).run_local_server.pycrashes on startup -- almost always a missing file under../data/.icu_mortality/ed_admissionspecifically require the PhysioNet-credentialed MIMIC-IV data (see "Data" below);ckd/wdbcshould work if you've completed the venv setup in "Running the tests," since those two are the ones with no such restriction.
MCP tools
Tool | Status | Authz-gated? | Input | Output |
| implemented, golden-verified | yes | model, batch (n≤100), ε | flip rate, AUROC drop, plausibility rate |
| implemented (known limitation, see Status) | yes | model, batch | flip rate, drop, mean steps |
| implemented, no published golden target | yes | model, member_pool + nonmember_pool (≤5k each) | MI accuracy/AUROC, privacy risk, patients-at-risk (direct count, not extrapolated) |
| implemented, golden-verified exactly | no | control set | PASS/PARTIAL/FAIL + mitigated/tested counts + coverage % |
| implemented, golden-verified exactly | no | already-executed probe results | pass count + enforcement rate |
| implemented, golden-verified exactly | no | audit log entries | completeness rate |
| implemented, golden-verified exactly | no | HIPAA/FHIR checklist | per-standard % + overall % |
| implemented, golden-verified exactly (SPS=78.9 on reference inputs) | no | the 9 subscore inputs above | composite SPS 0–100, deployment tier, per-dimension breakdown |
| implemented | no | any subset of the above tools' outputs | Markdown + structured report; only states a deployment recommendation if |
| implemented | no | (none) | this session's full audit trail |
"Authz-gated" tools resolve model_handle through authz.authorize(), which
records every attempt to the audit trail whether it succeeds or is denied.
Standards, compute_sps, generate_security_report, and get_audit_log
don't touch a model at all -- they score/compose evidence the caller
already has -- so they aren't gated.
See docs/TOOLS.md for a worked example of every tool
plus a full end-to-end workflow.
Audit trail
Every authz-gated tool call is recorded to audit.default_audit_log
in-memory, for the life of the server process: {timestamp, tool, model_handle, authorized, input_hash, detail}. This is the connector's own
non-repudiation record; retrieve it via the get_audit_log tool. You can
audit the auditor: standards.audit.score_audit_completeness will happily
score default_audit_log.entries() against the same completeness check it
runs on any other log, though the field names differ (this log's schema is
authz.py's own, not the validated methodology's REQUIRED_FIELDS).
Tech stack
Python 3.11 (not 3.13 — passlib/bcrypt incompatibility; use import bcrypt
directly), mcp (FastMCP), scikit-learn, numpy/pandas, FastAPI, pytest.
License: Apache 2.0.
Data
MIMIC-IV ICU + ED cohorts and trained RF models. reference/ (this
directory) holds the validated result tables — small, aggregate, no PHI,
safe to commit. ../data/ (one level up, sibling to this package) holds
the actual models and MIMIC-IV-derived data used to run the golden tests
locally — this requires PhysioNet credentialing and is gitignored at the
connector root (../.gitignore); it must never be committed or published.
Override its location with CONNECTOR_DATA_ROOT if needed.
Safety
Adversarial and privacy-attack tools are scoped to a user's own models and
gated behind an authorization check (authz.py) — a model only becomes
usable by being registered directly through registry.py's Python API in
the user's own script; there is no MCP tool that lets an agent register or
guess a handle. Every authorized and denied attempt is recorded to the
audit trail (audit.py) — see "Risks / limitations / ethics" in
../STRUCTURE.md.
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
- Your AI Chatbot Just Exposed Your CEO's Salary to an InternBy Om-Shree-0709 on .Agent IdentityMCP SecurityOAuth Delegation
- Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)By Om-Shree-0709 on .Agentic AiPrompt InjectionWebAssembly
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/MichaelEnny/healthsec-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server