edamcp
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@edamcpprofile my data.csv for quality issues"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
edamcp
A general-purpose Exploratory Data Analysis MCP server. Point it at an unknown dataset and it will tell you what's really in there, what to fix first, and how to make it model-ready — quality issues, leakage, bias, PII, missingness patterns, modeling blockers, and a clean export. The hard analytical reasoning lives in the MCP, not in the agent driving it.
Works with Claude Code, Claude Desktop, Cursor, LM Studio, Open WebUI, and any other MCP-capable client. Designed to be fully usable from a local 30B-class model (Qwen 27B / Mistral Small 3.2 / GLM-4 32B) and validated end-to-end on 11M-row production data.
Quick start
# clone + run from a checkout
git clone https://github.com/charliecpeterson/edamcp.git
cd edamcp
uv run edamcp # full surface (70 tools)
uv run edamcp --mode local # thin surface (35 tools) for local modelsWire it into your MCP client. The exact path to mcp.json varies — Claude
Desktop is ~/Library/Application Support/Claude/claude_desktop_config.json,
LM Studio is ~/.lmstudio/mcp.json, Cursor is in Settings → MCP. The entry
itself is the same shape:
{
"mcpServers": {
"edamcp": {
"command": "/path/to/uv",
"args": [
"run", "--directory", "/path/to/edamcp/checkout",
"edamcp"
]
}
}
}For local-model setups (LM Studio + Qwen, Ollama with the bridge, etc.) add
"--mode", "local" to the args — see Local mode.
Related MCP server: MCP File Analyzer
What you can do with it
A typical agent session — one user message, one or two tool calls per question, paste-ready Markdown back:
"Use edamcp on
/data/orders.csv. Tell me the three things I should fix first and whether I can train a model predictingis_returned."
→ Agent calls auto_explore then auto_modeling_audit. edamcp returns
ranked critical issues, suggested cleaning steps, modeling blockers (class
imbalance, leakage, multicollinearity), and pre-filled SQL drill-downs.
"This is taxi data, 11 million rows of dirty parquet. Find the worst problems, propose a cleaning plan, apply it, save the result, and tell me what would leak if I trained a tip-prediction model."
→ One agent run, ~15 tool calls, ends with a 7.8M-row cleaned parquet on
disk and a flagged target-leakage warning (total_amount = fare + tip
with VIF 1,164). Both the macro work and the heavy modeling audit auto-sample
to ~500K rows so the 30-second MCP transport never times out.
What edamcp surfaces
Everything below is detected automatically, with no specific instruction from the agent:
Schema quality — constants, all-null columns, sentinel values (
-999,"NA", etc.) masquerading as data, type drift across rows.Missingness patterns — including structurally-coupled nulls (e.g. five columns all NULL in the exact same rows).
Distribution shape — skew, multimodality, zero-inflation, heavy tails; with suggested transforms (log, Yeo-Johnson, reflect+log).
Correlations, multicollinearity (VIF), correlated-feature groups.
Duplicates — exact + by-key, with dedupe SQL.
Temporal — gaps, monotonicity, drift, daily-count seasonality.
Target leakage — name heuristics + correlation + temporal-constant detection.
Bias & fairness — class imbalance, sampling bias, 80%-rule disparate impact.
PII — regex + Luhn-validated card numbers + column-name heuristics; samples returned redacted.
Bootstrap stability + signal-to-noise per feature.
PCA dimensionality + Hopkins clustering tendency.
Text columns — length, vocab, near-duplicate %, mojibake.
Geospatial — lat/lon validation (out-of-range, null island, bbox).
Nested JSON — flattens deep structures, leaf-presence %, type drift.
HDF5 scientific arrays — leaf-name aggregation, fill values, NaN/Inf, dtype drift, valid_range violations. Validated on real ANI-1 dataset (3.5 GB, 47K molecule groups).
Cross-source schema diff + KS-test / PSI distribution drift.
Plus a full cleaning + export workflow (auto_clean → clean_pipeline →
export_source), eight plot tools with inline image rendering, an interactive
HTML report, and YAML plugins for domain-specific checks (e.g. compliance
rules, chemistry valence) that load on startup.
Tool surface (70 tools, organized into eight groups)
Source management · load_source · list_sources · describe_source · detect_pattern · detect_metadata · fingerprint_source · infer_recipes · unload_source · list_files
Query · run_sql · sample_rows · peek_array (HDF5 slice reader)
EDA checks · profile · check_quality · check_distributions · check_correlations · check_duplicates · check_multicollinearity · check_temporal · check_leakage · check_bias · check_pii · check_dimensionality · check_stability · check_text_columns · check_outliers · check_feature_signal · check_arrays · check_geospatial · check_nested_structure · check_custom (plugin runner) · suggest_plots · recommend_tasks
Orchestration · run_eda (presets: quick / standard / deep / exhaustive) · get_eda_findings
Visualization · plot_distribution · plot_correlation_heatmap · plot_scatter · plot_timeseries · plot_boxplot · plot_missingness · plot_pair · plot_qq · plot_violin · plot_facet · eda_storyboard (7-plot guided tour) · generate_report (interactive HTML)
Comparison · compare_sources · compare_groups · detect_schema_evolution
Synthesis · data_card · summarize_run · recommend_next
Cleaning + export · auto_clean (proposes plan) · clean_pipeline (executes + materializes) · clean_drop_columns · clean_rename_columns · clean_cast · clean_replace · clean_filter · clean_drop_duplicates · clean_impute · clean_transform · export_source (parquet / csv / json / ndjson)
Macros (one-call workflows) · auto_explore · auto_quality · auto_modeling_audit · auto_share_check · auto_compare
Design highlights
DuckDB views as the substrate. Every loaded source is a DuckDB view in a single in-memory connection. Cross-format JOINs (CSV ↔ Parquet ↔ HDF5 metadata) work for free; large data is out-of-core; no copies until export.
Materialized clean output.
clean_pipelinecollapses its final view into a realBASE TABLE(CTAS) so downstream queries scan it once instead of re-running the whole filter chain — 10× speedup on every follow-up query.Scale-aware sampling. Heavy tools (
auto_modeling_audit,suggest_plots,eda_storyboard,check_temporal,check_dimensionality) auto-build a reservoir sample (default 500K rows) when the source is larger. Statistics converge well before that threshold; without sampling, a 7.8M-row source takes ~3.5 minutes per macro call and risks OOM-killing the MCP. With sampling: ~9 seconds, same verdict.Thick tools, thin agent. Macros chain 4–8 granular checks into one call and return Markdown-first output. A 30B local model picking from 35 tools is reliable; picking from 70 is not. See Local mode.
Findings are severity-sorted, top-K by default. Every check returns
top_findings[≤10]+ anartifact_pathto a JSON with the full results. Context budget is the scarcest resource.Reproducibility recipe in every result. Each non-trivial tool returns the equivalent SQL (and where useful, Polars/Pandas/Python) so the user can replicate without the MCP.
SQL injection-safe. All identifier interpolations go through
sqlsafe.quote_ident()(doubles embedded"), so column names with awkward characters work and adversarial inputs don't break out of quoting.Thread-safe DuckDB access. FastMCP dispatches synchronous tool handlers on a thread pool;
Session._lockserializesexecute/sqlso concurrent tool calls can't corrupt session state.
Local mode
For 30B-class models, the 70-tool surface is too wide. Run with
--mode local to expose 35 high-level tools:
Source/query basics:
load_source,list_sources,list_files,run_sql,sample_rows,unload_source5 macros:
auto_explore,auto_quality,auto_modeling_audit,auto_share_check,auto_compareModality-specific entry points:
check_arrays,peek_array,check_geospatial,check_nested_structure,check_customAnalyst-question tools:
suggest_plots,check_outliers,compare_groups,check_feature_signal,recommend_tasks,eda_storyboard,generate_reportCleaning + export:
auto_clean,clean_pipeline,clean_drop_columns,clean_replace,clean_cast,clean_drop_duplicates,clean_impute,clean_filter,export_sourceSynthesis:
data_card,summarize_run,recommend_next
The macros internally chain the granular tools and return Markdown. Validated against Qwen 27B in LM Studio — completes a full 5-question EDA prompt on 11M-row data without timeouts.
Prompts (slash-commands)
Hosts that surface prompts as a menu (Claude Desktop, Cursor) get eight pre-canned templates that nudge the model toward the right workflow:
Prompt | Args | What it does |
|
| "I have no idea what this is." Calls |
|
| Standard EDA walkthrough with plain-English explanations. |
|
| Critical-issue-only audit. Skips info-severity noise. |
|
| Modeling-readiness verdict + hard blockers + soft mitigations. |
|
| Hypothesizes about origin, population, missing pieces. |
|
| Drop-in compatibility check. |
|
| Pattern-detect → load → data_card. |
|
| PII + fairness gate before sharing externally. |
Resources (read-only handles)
eda://sources— list of loaded sourceseda://sources/{source_id}— schema + metadata for one sourceeda://runs/{run_id}— full JSON artifact from a priorrun_edaeda://plots/{plot_id}— Vega-Lite spec + paths to rendered PNG/SVGeda://plugins— list of registered YAML plugins
Plugins (YAML + SQL)
Drop a YAML file into a directory and pass --plugins /path/to/dir
(repeatable). Bundled defaults under plugins_builtin/ load automatically.
Example — age_sanity.yaml:
name: age_sanity
description: Detects rows with biologically implausible ages
category: domain_violation
severity: critical
applies_to:
has_columns: [age]
modality: tabular
sql: |
SELECT
'age out of range' AS title,
'age' AS column,
COUNT(*) AS rows_affected,
MIN(CAST(age AS BIGINT)) AS min_age_seen,
MAX(CAST(age AS BIGINT)) AS max_age_seen
FROM {source}
WHERE age IS NOT NULL AND (CAST(age AS BIGINT) < 0 OR CAST(age AS BIGINT) > 130)
HAVING COUNT(*) > 0
interpretation: Ages outside [0, 130] are biologically implausible; usually they're sentinel values.
suggested_action: "Cast to NULL: UPDATE source SET age = NULL WHERE age < 0 OR age > 130."The SQL must reference the source via the literal {source} placeholder
(substituted with the quoted alias at runtime). Each row the SQL returns
becomes a Finding. Plugins self-skip when applies_to.has_columns doesn't
match the source schema, so you can ship many and only the relevant ones
run.
Plugins run automatically as part of run_eda (any preset) and via the
standalone check_custom tool. They appear in data_card output alongside
built-in checks.
Architecture
┌────────────────────────────────────────────────────────────────┐
│ MCP client (Claude Code, Cursor, Claude Desktop, LM Studio…) │
└─────────────────────────┬──────────────────────────────────────┘
│ stdio JSON-RPC (MCP)
┌─────────────▼─────────────┐
│ edamcp (Python) │
│ ─ tool dispatch │
│ ─ session (DuckDB views) │
│ ─ thread-safe lock │
│ ─ artifact cache (./_eda)│
└─────────────┬─────────────┘
│
┌──────────────────────┼────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌────────────────┐ ┌──────────────────┐
│ ingest │ │ query engine │ │ eda checks │
│ csv/pq/ │◄─────►│ DuckDB views │◄──────►│ + viz + synth │
│ json/h5 │ │ + sample tables│ │ + cleaning ops │
└─────────┘ └────────────────┘ │ + plugins │
└──────────────────┘DuckDB is the engine. Every loaded source becomes a DuckDB view in a single connection; cleaning pipelines materialize their final result as a real table.
Polars for in-memory ops where convenient.
Altair + vl-convert for plots — returns Vega-Lite JSON (so the agent can reason about / modify) AND a rendered PNG (for chat UIs).
scipy for statistical tests (chi-square, KS, Mann-Whitney, ANOVA); numpy for PCA via SVD.
h5py for HDF5 inspection with chunked streaming reduces.
Everything is statelessly orchestratable: tools return either a small
summary card with top_findings[≤10] and an artifact_path, or paste-ready
Markdown.
Development
# clone + install
git clone https://github.com/charliecpeterson/edamcp.git
cd edamcp
uv sync
# generate the synthetic test datasets
uv run python scripts/generate_test_data.py ~/edamcp-testdata
# run the full smoke suite
EDAMCP_TEST_DATA=~/edamcp-testdata uv run python scripts/smoke_test.py
# verify local-mode tool surface
uv run python scripts/smoke_local_mode.pyThe smoke tests exercise every tool end-to-end through an stdio MCP
handshake. Set EDAMCP_ANI1_PATH=/path/to/ANI-1_release/ to additionally
exercise the HDF5 scientific-array path against real data.
Status
Production-validated:
Three months of NYC Yellow Taxi parquet (11M rows, 20 cols, 190 MB) through the full pipeline: load → audit → clean → export → modeling audit → storyboard. Completes in ~15 tool calls under 32K context with Qwen 27B in LM Studio.
ANI-1 scientific dataset (3.5 GB HDF5, 47,934 molecule groups, ~288K leaf datasets). Full analysis in ~50 seconds;
peek_arrayreads slices by leaf name with|S1SMILES auto-joined to readable strings.Synthetic dirty e-commerce with curated quality issues — catches constants, all-nulls, sentinels, type drift, duplicates, multicollinearity, bias, PII, leakage, all in one
data_cardcall.
Smoke coverage: 100+ assertions across 70 tools, both modes, plugins, prompts, resources, and the end-to-end cleaning pipeline.
License
MIT — see LICENSE.
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/charliecpeterson/edamcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server