M4

m4
docs

CUSTOM_DATASETS.md•5.62 KiB

# Adding Custom Datasets M4 supports any PhysioNet dataset. This guide shows how to add your own. ## Quick Start: JSON Definition Create a JSON file in `m4_data/datasets/`: **Example: `m4_data/datasets/mimic-iv-ed.json`** ```json { "name": "mimic-iv-ed", "description": "MIMIC-IV Emergency Department Module", "file_listing_url": "https://physionet.org/files/mimic-iv-ed/2.2/", "subdirectories_to_scan": ["ed"], "primary_verification_table": "mimiciv_ed.edstays", "requires_authentication": true, "bigquery_project_id": "physionet-data", "bigquery_dataset_ids": ["mimiciv_ed"], "modalities": ["TABULAR"], "schema_mapping": {"ed": "mimiciv_ed"}, "bigquery_schema_mapping": {"mimiciv_ed": "mimiciv_ed"} } ``` Then initialize: ```bash m4 init mimic-iv-ed --src /path/to/your/csv/files ``` ## JSON Fields Reference | Field | Required | Description | |-------|----------|-------------| | `name` | Yes | Unique identifier (used in `m4 use <name>`) | | `description` | Yes | Human-readable description | | `file_listing_url` | No | PhysioNet URL for auto-download (demo datasets only) | | `subdirectories_to_scan` | No | Subdirs containing CSV files (e.g., `["hosp", "icu"]`) | | `primary_verification_table` | Yes | Table to verify initialization succeeded | | `requires_authentication` | No | `true` if PhysioNet credentialing required | | `bigquery_project_id` | No | GCP project for BigQuery access | | `bigquery_dataset_ids` | No | BigQuery dataset IDs | | `modalities` | No | Data types in this dataset (see below). Defaults to `["TABULAR"]` | | `schema_mapping` | No | Maps filesystem subdirectories to canonical schema names (see below) | | `bigquery_schema_mapping` | No | Maps canonical schema names to BigQuery dataset IDs (see below) | ### Available Modalities | Modality | Description | Available Tools | |----------|-------------|-----------------| | `TABULAR` | Structured tables (labs, demographics, vitals, etc.) | `get_database_schema`, `get_table_info`, `execute_query` | | `NOTES` | Clinical notes and discharge summaries | `search_notes`, `get_note`, `list_patient_notes` | Tools are filtered based on the dataset's declared modalities. If not specified, defaults to `["TABULAR"]`. ### Schema Mapping (Canonical Table Names) M4 uses canonical `schema.table` names (e.g., `mimiciv_hosp.patients`) that work identically on both DuckDB and BigQuery backends. The `schema_mapping` and `bigquery_schema_mapping` fields control how these canonical names are constructed. **`schema_mapping`** maps filesystem subdirectories to canonical schema names. When DuckDB creates views, files from each subdirectory are placed into the corresponding schema: ```json { "schema_mapping": { "hosp": "mimiciv_hosp", "icu": "mimiciv_icu" } } ``` With this mapping, a file at `hosp/patients.csv` becomes queryable as `mimiciv_hosp.patients`. For datasets where all files are in the root directory (no subdirectories), use an empty string key: ```json { "schema_mapping": { "": "eicu_crd" } } ``` **`bigquery_schema_mapping`** maps canonical schema names to BigQuery dataset IDs. This allows the BigQuery backend to translate canonical names to the actual GCP dataset names: ```json { "bigquery_schema_mapping": { "mimiciv_hosp": "mimiciv_hosp", "mimiciv_icu": "mimiciv_icu" } } ``` With this, a query for `mimiciv_hosp.patients` is rewritten to `physionet-data.mimiciv_hosp.patients` on BigQuery. Custom datasets without `schema_mapping` still work — tables will be created with flat names in the `main` schema (backward-compatible behavior). ## Initialization Process When you run `m4 init <dataset>`: 1. **Download** (if `file_listing_url` exists and files missing) 2. **Convert** CSV.gz files to Parquet format 3. **Create** DuckDB views over the Parquet files 4. **Verify** by querying `primary_verification_table` ## Directory Structure M4 organizes data like this: ``` m4_data/ ├── datasets/ # Custom JSON definitions │ └── my-dataset.json ├── raw_files/ # Downloaded CSV.gz files │ └── my-dataset/ │ └── *.csv.gz ├── parquet/ # Converted Parquet files │ └── my-dataset/ │ └── *.parquet └── databases/ # DuckDB databases └── my_dataset.duckdb ``` ## Using Existing CSV Files If you already have CSV files (either `.csv` or `.csv.gz`), point to them with `--src`: ```bash m4 init my-dataset --src /path/to/csvs ``` M4 will: 1. Convert CSV/CSV.gz files to Parquet format 2. Create DuckDB views 3. Set the dataset as active ## Credentialed Datasets For datasets requiring PhysioNet credentials (most full datasets): 1. Get credentialed access on PhysioNet 2. Download manually using wget: ```bash wget -r -N -c -np --user YOUR_USERNAME --ask-password \ https://physionet.org/files/dataset-name/version/ \ -P m4_data/raw_files/dataset-name ``` 3. Initialize: ```bash m4 init dataset-name ``` ## Programmatic Registration For more control, register datasets in Python: ```python from m4.core.datasets import DatasetDefinition, DatasetRegistry, Modality my_dataset = DatasetDefinition( name="my-custom-dataset", description="My custom clinical dataset", primary_verification_table="patients", modalities=frozenset({Modality.TABULAR}), ) DatasetRegistry.register(my_dataset) ``` ## Tips - **Start with demo data:** Test your setup with `mimic-iv-demo` first - **Check table names:** Use `get_database_schema` tool to see available tables - **Verify initialization:** `m4 status` shows if Parquet and DuckDB are ready - **Force reinitialize:** `m4 init <dataset> --force` recreates the database

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/hannesill/m4'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

CUSTOM_DATASETS.md•5.62 KiB