Disco
The Disco server enables automated discovery of statistically validated patterns in tabular data through an MCP interface, covering everything from account creation to running analyses and retrieving results.
Account & Authentication
Sign up / verify: Create a new account via email and 6-digit verification code — no password or credit card required.
Log in / verify: Retrieve a new API key for an existing account via email verification.
Check account status: View your current plan, available credits, and payment method status.
Pricing, Plans & Payments
List plans: Browse available subscription tiers (Free, Researcher, Team) with pricing and credit allowances.
Subscribe: Switch to or enroll in a plan.
Add payment method: Attach a Stripe-tokenized payment method.
Purchase credits: Buy credit packs ($10/pack, 100 credits each) for private analyses.
Data Analysis Workflow
Estimate costs: Before running, get a cost estimate (credits, duration, sufficiency) for a given file size, column count, depth, and visibility setting.
Upload data: Upload datasets via URL, local file path, or base64-encoded content — supports CSV, TSV, Excel, JSON, Parquet, ARFF, and Feather (up to 5 GB).
Run analysis: Launch a discovery pipeline on uploaded data to find feature interactions, subgroup effects, and conditional relationships — with FDR-corrected p-values and optional academic literature novelty checks. Choose between public runs (free, but results are published) or private runs (credit cost, results kept confidential). Optionally leverage LLMs for smarter pre-processing, richer summaries, and more accurate novelty assessments.
Check status: Poll a running analysis (typically 3–15 minutes) for its status, queue position, active pipeline step, and time estimates.
Get results: Retrieve discovered patterns (conditions, effect sizes, p-values, novelty classifications, citations), feature importances, summary insights, and interactive dashboard links.
Provides integration with Jupyter notebooks through the discovery-engine-api[jupyter] package, enabling interactive pattern discovery and analysis within notebook environments.
Provides integration with pandas for data analysis, allowing users to prepare tabular data for pattern discovery while excluding pandas-based operations like summary statistics, visualization, and filtering from the discovery process.
Provides integration with PyPI for installing the discovery-engine-api Python package, enabling users to access the Disco pattern discovery service through the Python SDK.
Provides integration with Python through the discovery-engine-api SDK, enabling programmatic access to Disco's pattern discovery capabilities including data analysis, account management, and result retrieval.
Disco
Find novel, statistically validated patterns in tabular data — feature interactions, subgroup effects, and conditional relationships that correlation analysis and LLMs miss.
Made by Leap Laboratories.
What it actually does
Most data analysis starts with a question. Disco starts with the data.
Without biases or assumptions, it finds combinations of feature conditions that significantly shift your target column — things like "patients aged 45–65 with low HDL and high CRP have 3× the readmission rate" — without you needing to hypothesise that interaction first.
Each pattern is:
Validated on a hold-out set — increases the chance of generalisation
FDR-corrected — p-values included, adjusted for multiple testing
Checked against academic literature — to help you understand what you've found, and identify if it is novel.
The output is structured: conditions, effect sizes, p-values, citations, and a novelty classification for every pattern found.
Use it when: "which variables are most important with respect to X", "are there patterns we're missing?", "I don't know where to start with this data", "I need to understand how A and B affect C".
Not for: summary statistics, visualisation, filtering, SQL queries — use pandas for those
Related MCP server: Data Analysis MCP Server
Quickstart
pip install discovery-engine-apiGet an API key:
# Step 1: request verification code (no password, no card)
curl -X POST https://disco.leap-labs.com/api/signup \
-H "Content-Type: application/json" \
-d '{"email": "you@example.com"}'
# Step 2: submit code from email → get key
curl -X POST https://disco.leap-labs.com/api/signup/verify \
-H "Content-Type: application/json" \
-d '{"email": "you@example.com", "code": "123456"}'
# → {"key": "disco_...", "credits": 10, "tier": "free_tier"}Or create a key at disco.leap-labs.com/developers.
Run your first analysis:
from discovery import Engine
engine = Engine(api_key="disco_...")
result = await engine.discover(
file="data.csv",
target_column="outcome",
)
for pattern in result.patterns:
if pattern.p_value < 0.05 and pattern.novelty_type == "novel":
print(f"{pattern.description} (p={pattern.p_value:.4f})")
print(f"Explore: {result.report_url}")Runs take a few minutes. discover() polls automatically and logs progress — queue position, estimated wait, current pipeline step, and ETA. For background runs, see Running asynchronously.
→ Full Python SDK reference · Example notebook
What you get back
Each Pattern in result.patterns looks like this (real output from a crop yield dataset):
Pattern(
description="When humidity is between 72–89% AND wind speed is below 12 km/h, "
"crop yield increases by 34% above the dataset average",
conditions=[
{"type": "continuous", "feature": "humidity_pct",
"min_value": 72.0, "max_value": 89.0},
{"type": "continuous", "feature": "wind_speed_kmh",
"min_value": 0.0, "max_value": 12.0},
],
p_value=0.003, # FDR-corrected
novelty_type="novel",
novelty_explanation="Published studies examine humidity and wind speed as independent "
"predictors, but this interaction effect — where low wind amplifies "
"the benefit of high humidity within a specific range — has not been "
"reported in the literature.",
citations=[
{"title": "Effects of relative humidity on cereal crop productivity",
"authors": ["Zhang, L.", "Wang, H."], "year": "2021",
"journal": "Journal of Agricultural Science"},
],
target_change_direction="max",
abs_target_change=0.34, # 34% increase
support_count=847, # rows matching this pattern
support_percentage=16.9,
)Key things to notice:
Patterns are combinations of conditions — humidity AND wind speed together, not just "more humidity is better"
Specific thresholds — 72–89%, not a vague correlation
Novel vs confirmatory — every pattern is classified; confirmatory ones validate known science, novel ones are what you came for
Citations — shows what IS known, so you can see what's genuinely new
report_urllinks to an interactive web report with all patterns visualised
The result.summary gives an LLM-generated narrative overview:
result.summary.overview
# "Disco identified 14 statistically significant patterns. 5 are novel.
# The strongest driver is a previously unreported interaction between humidity
# and wind speed at specific thresholds."
result.summary.key_insights
# ["Humidity × low wind speed at 72–89% humidity produces a 34% yield increase — novel.",
# "Soil nitrogen above 45 mg/kg shows diminishing returns when phosphorus is below 12 mg/kg.",
# ...]How it works
Disco is a pipeline, not prompt engineering over data. It:
Trains machine learning models on a subset of your data
Uses interpretability techniques to extract learned patterns
Validates every pattern on the held-out data with FDR correction (Benjamini-Hochberg)
Checks surviving patterns against academic literature via semantic search
You cannot replicate this by writing pandas code or asking an LLM to look at a CSV. It finds structure that hypothesis-driven analysis misses because it doesn't start with hypotheses.
Preparing your data
Before running, exclude columns that would produce meaningless findings. Disco finds statistically real patterns — but if the input includes columns that are definitionally related to the target, the patterns will be tautological.
Exclude:
Identifiers — row IDs, UUIDs, patient IDs, sample codes
Data leakage — the target renamed or reformatted (e.g.,
diagnosis_textwhen the target isdiagnosis_code)Tautological columns — alternative encodings of the same construct as the target. If target is
serious, thenserious_outcome,not_serious,deathare all part of the same classification. If target isprofit, thenrevenueandcosttogether compose it. If target is a survey index, the sub-items are tautological.
Full guidance with examples: SKILL.md
Parameters
await engine.discover(
file="data.csv", # path, Path, or pd.DataFrame
target_column="outcome", # column to predict/explain
analysis_depth=2, # 2=default, higher=deeper analysis, lower = faster and cheaper
visibility="public", # "public" (always free, data and report is published) or "private" (costs credits)
column_descriptions={ # improves pattern explanations and literature context
"bmi": "Body mass index",
"hdl": "HDL cholesterol in mg/dL",
},
excluded_columns=["id", "timestamp"], # see "Preparing your data" above
use_llms=False, # Defaults to False. If True, runs are slower and more expensive, but you get smarter pre-processing, summary page, literature context and novelty assessment. Public runs always use LLMs.
title="My dataset",
description="...", # improves pattern explanations and literature context
)Public runs are free but results are published. Set
visibility="private"for private data — this costs credits.
Running asynchronously
Runs take a few minutes. For agent workflows or scripts that do other work in parallel:
# Submit without waiting
run = await engine.run_async(file="data.csv", target_column="outcome", wait=False)
print(f"Submitted {run.run_id}, continuing...")
# ... do other things ...
result = await engine.wait_for_completion(run.run_id, timeout=1800)For synchronous scripts and Jupyter notebooks:
result = engine.run(file="data.csv", target_column="outcome", wait=True)
# or: pip install discovery-engine-api[jupyter] for notebook compatibilityMCP server
Disco is available as an MCP server — no local install required.
{
"mcpServers": {
"discovery-engine": {
"url": "https://disco.leap-labs.com/mcp",
"env": { "DISCOVERY_API_KEY": "disco_..." }
}
}
}Tools: discovery_list_plans, discovery_estimate, discovery_upload, discovery_analyze, discovery_status, discovery_get_results, discovery_account, discovery_signup, discovery_signup_verify, discovery_login, discovery_login_verify, discovery_add_payment_method, discovery_subscribe, discovery_purchase_credits.
Pricing
Cost | |
Public runs | Free — results and data are published |
Private runs | Credits vary by file size and configuration — use |
Free tier | 10 credits/month, no card required |
Researcher | $49/month — 50 credits |
Team | $199/month — 200 credits |
Credits | $0.10 per credit |
Estimate before running:
estimate = await engine.estimate(file_size_mb=10.5, num_columns=25, analysis_depth=2, visibility="private")
# estimate["cost"]["credits"] → 55
# estimate["account"]["sufficient"] → True/FalseAccount management is fully programmatic — attach payment methods, subscribe to plans, and purchase credits via the SDK or REST API. See Python SDK reference or SKILL.md.
Expected data format
Disco expects a flat table — columns for features, rows for samples.
| patient_id | age | bmi | smoker | outcome |
|------------|-----|------|--------|---------|
| 001 | 52 | 28.3 | yes | 1 |
| 002 | 34 | 22.1 | no | 0 |
| ... | ... | ... | ... | ... |One row per observation — a patient, a sample, a transaction, a measurement, etc.
One column per feature — numeric, categorical, datetime, or free text are all fine
One target column — the outcome you want to understand. Must have at least 2 distinct values.
Missing values are OK — Disco handles them automatically. Don't drop rows or impute beforehand.
No pivoting needed — if your data is already in a flat table, it's ready to go
Supported formats: CSV, TSV, Excel (.xlsx), JSON, Parquet, ARFF, Feather. Max 5 GB.
Not supported: images, raw text documents, nested/hierarchical JSON, multi-sheet Excel (use the first sheet or export to CSV)
Compared to other tools
Goal | Tool |
Summary statistics, data quality | ydata-profiling, sweetviz |
Predictive model | AutoML (auto-sklearn, TPOT, H2O) |
Quick correlations | pandas, seaborn |
Answer a specific question about data | ChatGPT, Claude |
Find what you don't know to look for | Disco |
Disco isn't a replacement for EDA or AutoML — it finds the patterns those tools miss. We tested 18 data analysis tools on a dataset with known ground-truth patterns. Most confidently reported wrong results. Disco was the only one that found every pattern.
Links
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/leap-laboratories/discovery-engine'
If you have feedback or need assistance with the MCP directory API, please join our Discord server