Skip to main content
Glama
mleary

CDISC SDTM Validator MCP

by mleary

CDISC SDTM Validator MCP

A Model Context Protocol (MCP) server for validating CDISC SDTM datasets against SDTMIG 3.4 specifications.

Overview

This MCP server provides AI agents and clinical programmers with tools to validate Study Data Tabulation Model (SDTM) datasets. It demonstrates a complete end-to-end pipeline: raw pharmaceutical data → SDTM transformation → validation.

Related MCP server: Define-XML MCP Server

Tools

1. check_required_variables(columns, domain="DM")

Validates that a dataset contains the three universal SDTM identifier variables required in every domain:

  • STUDYID — Study identifier

  • DOMAIN — Domain abbreviation (e.g., "DM")

  • USUBJID — Unique subject identifier

Returns: {"domain", "required", "missing", "ok"}

2. check_dm_required_variables(columns)

Validates that the Demographics (DM) domain contains all required variables per SDTMIG 3.4 Table 3-1:

  • Universal (3): STUDYID, DOMAIN, USUBJID

  • DM-specific (12): SUBJID, AGE, AGEU, SEX, RACE, ETHNIC, COUNTRY, ARMCD, ARM, ACTARMCD, ACTARM, RFSTDTC

Returns: {"required", "missing", "ok"}

3. check_controlled_terminology(column, values)

Validates that values in a column conform to CDISC Controlled Terminology (CT) codelists.

Supported variables:

  • SEX (C66731): F, M, U, UNDIFFERENTIATED

  • ETHNIC (C66790): HISPANIC OR LATINO, NOT HISPANIC OR LATINO, NOT REPORTED, UNKNOWN

  • RACE (C74457): WHITE, BLACK OR AFRICAN AMERICAN, ASIAN, AMERICAN INDIAN OR ALASKA NATIVE, NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER, MULTIPLE, NOT REPORTED, UNKNOWN

  • AGEU (C66781): YEARS, MONTHS, WEEKS, DAYS, HOURS

  • DTHFL: Y (only valid value for death flag; null/absent means no death)

Returns: {"column", "codelist_id", "valid_values", "invalid", "ok"}

4. validate_dataset(dataset_name=None, dataset=None)

Runs the full validation suite in a single call and returns a combined report. Provide either a bundled sample name (see list_sample_datasets) or an inline Dataset JSON object. This is the recommended entry point for agents — it loads the data, runs all three checks (controlled terminology only for the codelist columns present), and summarizes pass/fail.

Returns: {"dataset", "checks": [...], "summary": {"ok", "passed", "failed"}}

5. list_sample_datasets()

Lists the sample datasets bundled with the server (read live from samples/). Each entry is {"name", "label", "study", "records", "columns", "description"}; use the name with validate_dataset or the sample://<name> resource.

Resources

Each bundled sample is also exposed as an MCP resource at sample://<name> (e.g. sample://pharmaverse_dm), so MCP clients can discover and load the raw Dataset JSON through the native resource primitive.

End-to-End Demo: pharmaverseraw → sdtm.oak → MCP Validation

Background

The pharmaverse ecosystem provides industry-standard tools and data for learning and teaching SDTM. The pharmaverseraw R package contains the CDISCPILOT01 study — a realistic clinical trial dataset in pre-SDTM "raw EDC" format. The sdtm.oak package transforms this raw data into a valid SDTM dataset.

This MCP server completes the pipeline by validating the output:

pharmaverseraw (raw EDC data)
    ↓  sdtm.oak transformation
SDTM DM domain
    ↓  MCP validation
Validation report

Sample Data: CDISCPILOT01

samples/pharmaverse_dm.json contains 5 subjects from the CDISCPILOT01 study in Dataset JSON format (CDISC's standard JSON data representation). This is real, realistic data used by clinical programmers to learn SDTM transformation workflows.

Subject demographics:

  • Age range: 63–77 years

  • Treatment arms: PLACEBO, Xanomeline High Dose, Xanomeline Low Dose

  • Countries: USA, Japan, Germany

  • One subject with a recorded death (DTHFL = "Y")

Running the Demo

The server owns the whole pipeline. With the server running (see below), validate a bundled sample with one MCP call:

# List the available samples
curl -s localhost:8000/samples

# Fetch one sample's raw Dataset JSON
curl -s localhost:8000/samples/pharmaverse_dm

# Run the full validation suite via the validate_dataset tool
curl -s localhost:8000/mcp \
  -H "Content-Type: application/json" -H "Accept: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/call",
       "params":{"name":"validate_dataset","arguments":{"dataset_name":"pharmaverse_dm"}}}'

The validate_dataset report includes a summary ({"ok", "passed", "failed"}) plus a per-check breakdown. Try dataset_name: "dm_missing_studyid" to see check_required_variables fail on the missing STUDYID identifier.

The interactive landing page at / does the same thing visually: it lists samples from /samples and renders the validate_dataset report.

Running the Server

Local Development

# Install dependencies
pip install -r requirements.txt

# Start the development server
uvicorn cdisc-mcp:app --reload --port 8000

# Landing page with interactive tool testing
open http://localhost:8000/

# MCP endpoint (for AI agents / clients)
http://localhost:8000/mcp

Configuration

Configuration is optional. One environment variable is recognized (set it in your shell or a local .env file):

  • CONNECT_SERVER — Restrict incoming connections to a specific Posit Connect hostname (via TrustedHostMiddleware). Leave unset for local development.

Sample Datasets

The server reads sample datasets from samples/ on disk and serves them via /samples, the sample:// resources, and validate_dataset. Edit a file in samples/ and the change is reflected immediately — there is no copy baked into the front-end.

  • pharmaverse_dm.json — Realistic CDISCPILOT01 data (all 24 DM columns)

  • dm.json — Hand-crafted valid DM domain (5 subjects, common variables)

  • dm_missing_studyid.json — Test case: missing STUDYID (validates error detection)

Add another sample by dropping a Dataset JSON file into samples/; it appears automatically in the listing, the dropdown, and as a sample://<name> resource. (Optionally add a one-line description to SAMPLE_DESCRIPTIONS in cdisc-mcp.py.)

Dataset JSON Format

Datasets are represented in CDISC Dataset JSON 1.1.0 format. Example structure:

{
  "studyOID": "CDISCPILOT01",
  "name": "DM",
  "label": "Demographics",
  "columns": [
    {"name": "STUDYID", "label": "Study Identifier", ...},
    {"name": "DOMAIN", "label": "Domain Abbreviation", ...},
    ...
  ],
  "rows": [
    ["CDISCPILOT01", "DM", "01-701-1015", ...],
    ...
  ]
}

See the CDISC Dataset JSON specification for full details.

Architecture

All work happens in the server. cdisc-mcp.py reads the sample data from disk, owns the validation orchestration (validate_dataset), and serves both the MCP endpoint and the landing page. The front-end is a thin viewer with no data of its own.

Runtime files (everything the deployed app needs):

  • cdisc-mcp.py — FastMCP server: validation tools, validate_dataset orchestration, /samples routes, and sample:// resources; the deployment entrypoint

  • landing.html — Interactive landing page; fetches samples and results from the server at runtime

  • samples/ — Sample datasets in Dataset JSON format; read live by the server (a single source of truth)

  • requirements.txt — Python dependencies, used by Connect to build the environment

Key Design:

  • Single source of truth for data: samples/*.json on disk, read on every request — editing a sample is reflected everywhere immediately

  • Stateless HTTP service (scales on Posit Connect)

  • Hardcoded CT codelists (no external API calls — checks run fully offline)

  • Deployment-agnostic front-end (MCP and /samples URLs derived client-side from the page location)

  • Validation tools registered as plain callables, so validate_dataset and other Python code can call them directly (not over HTTP)

Deployment to Posit Connect

The server can be deployed to Posit Connect with Posit Publisher, which generates the .posit/ deployment metadata for your environment. Once deployed:

  • Set CONNECT_SERVER to the Connect hostname so TrustedHostMiddleware scopes incoming connections.

  • The server is stateless, so it scales horizontally without shared state.

The deployed bundle needs all runtime files — cdisc-mcp.py, landing.html, requirements.txt, and the samples/ directory (the server reads it at runtime). Make sure the files list in the Posit Publisher config includes samples/.

Future Expansion

Possible directions for richer validation:

  • Full CDISC Conformance Rules (CORE) validation

  • Completeness checks beyond required variables

  • Domain-specific business rule validation

  • Version-specific SDTMIG checking

References

F
license - not found
-
quality - not tested
C
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/mleary/mcp-cdisc'

If you have feedback or need assistance with the MCP directory API, please join our Discord server