data-aggregator-mcp
Server Configuration
Describes the environment variables required to run the server.
| Name | Required | Description | Default |
|---|---|---|---|
| NCBI_API_KEY | No | Raises the NCBI E-utilities rate limit (3 → 10 req/s) used by the omics, literature, and taxonomy lookups. | |
| UNPAYWALL_EMAIL | No | Enables the Unpaywall fallback leg of literature full-text retrieval. |
Capabilities
Features and capabilities supported by this server
| Capability | Details |
|---|---|
| tools | {
"listChanged": false
} |
| prompts | {
"listChanged": false
} |
| resources | {
"subscribe": false,
"listChanged": false
} |
| experimental | {} |
Tools
Functions exposed to the LLM to take actions
| Name | Description |
|---|---|
| searchA | Search public research-data archives, omics registries, and the literature for datasets, software, publications, and sequencing data. Fans out across Zenodo, DataCite (Dryad, Figshare, Dataverse, OSF, Mendeley, OpenNeuro), NCBI omics (GEO, SRA, BioProject), literature (PubMed + OpenAIRE), HuggingFace Hub (datasets), DataONE (eco/environmental federation), OmicsDI (proteomics/metabolomics), RCSB PDB (macromolecular structures), GWAS Catalog (genotype-phenotype studies), OpenML (ML datasets), DANDI (neurophysiology dandisets), and CZ CELLxGENE (single-cell datasets). Returns compact DataResource records; per-source failures are reported in errors{}. Use resolve for the full record (SRA resolve attaches the ENA FASTQ manifest; publication resolve attaches links[] to datasets/accessions, normalized identifiers (pmid/pmcid/doi), and — when open access — a full-text file), then fetch to download files. Pass organism= to expand the query with NCBI-Taxonomy synonyms; results carry normalized taxa[] + plant cross-links. Pass disease= to expand the query with MeSH descriptor synonyms (e.g. 'breast cancer' also matches 'Breast Neoplasms'); the expansion is echoed in mesh_expansion. Pass tissue= to expand the query with UBERON synonyms (e.g. 'liver' also matches 'iecur'/'jecur'); the expansion is echoed in tissue_expansion. Pass chemical= to expand the query with ChEBI compound synonyms (e.g. 'caffeine' also matches '1,3,7-trimethylxanthine'); the expansion is echoed in chemical_expansion. Pass assay= to expand the query with EDAM assay/method synonyms (e.g. 'ChIP-seq' also matches 'ChIP-sequencing'); echoed in assay_expansion. Pass collapse_mirrors=true to opt into conservative cross-repo mirror collapse: same-dataset copies under different/no DOIs are folded into one record, with the folded copies annotated under mirrors[]. |
| resolveA | Fetch the full DataResource for a known id (e.g. 'zenodo:7654321', 'datacite:10.5061/dryad.x', 'hf:owner/name', a bare Zenodo record id, or a DOI), including the complete files[] manifest. Publication resolve also attaches normalized identifiers (pmid/pmcid/doi) and, when open access, a full-text file. Pass cite= to render a citation onto the result (citation field); omitted means no citation. Pass trust=true to attach retraction status (via Crossref) under trust{}. Pass fair=true to attach an RDA-grounded FAIRness score (0–100 + F/A/I/R sub-scores + actionable gaps) computed from the record under fair{}. Pass use= (commercial/redistribute/modify/ml-training) to attach a licence-compatibility advisory (ALLOW/REVIEW/DENY, not legal advice) under license_compat{}. Pass format=provenance for a one-call RO-Crate 1.1 data-availability dossier (under provenance{}) composing version-currency, licence+SPDX, FAIR score, retraction status, and the source/DOI/ID chain — it auto-attaches fair + trust. |
| fetchA | Download a resource's files to local disk and return the PATHS (never the file contents). Fetchable backends: Zenodo (md5-verified); SRA via ENA FASTQ (md5-verified); GEO supplementary files (unverified); DataCite sub-repos — Figshare/Dataverse/OSF (md5-verified), OpenNeuro (snapshot manifest, unverified), Dryad is manifest-only (resolve lists files, fetch fails loud), Mendeley + other DataCite repos fail loud; PubMed/OpenAIRE open-access full text (EuropePMC XML / Unpaywall PDF, unverified); HuggingFace Hub (unverified); DataONE Member-Node objects (md5/SHA-256-verified); OmicsDI — PRIDE + MetaboLights only (unverified), MassIVE/GNPS/PeptideAtlas/Metabolomics Workbench fail loud; DANDI dandisets (302→S3, unverified); CZ CELLxGENE H5AD/RDS assets (unverified); OpenML ARFF (md5-verified); RCSB PDB .cif/.pdb structure files (unverified). Fails loud if selected files exceed max_bytes unless force=true. Verifies checksums; writes a .dataresource.json sidecar. |
| list_sourcesA | List wired data sources and their capabilities (layer, kinds, supported filters, auth requirement, rate limit, status). |
| operateA | Inspect or query a remote tabular file (Parquet/CSV/TSV) WITHOUT downloading it. op='schema' returns columns+types; 'preview' a small sample; 'head' the first n rows; 'sql' a read-only SELECT against the file (exposed as the view 'data', e.g. "SELECT * FROM data WHERE x > 1"). op='peek' profiles every column WITHOUT downloading — type, null-rate, approximate distinct count, min/max, and numeric quartiles (a DuckDB SUMMARIZE; like head/sql it reads the whole file, so it honors the source-size ceiling). Addresses a file by catalog id + file name (resolve the id first to see files[] and access_modes). Requires the [operate] extra; fails loud if the file is not an operable tabular file. |
| relateA | Given 2-10 resource ids, return metadata-level join/harmonization HINTS: how the datasets relate and on what key they could be joined. Detects shared accessions (BioProject/SRA/GEO), shared cross-identifiers (doi/pmid/pmcid), explicit links between the inputs, and version lineage. HINTS ONLY — it does not read file columns, fetch files, or execute any join/merge/conversion; each hint names the shared value as evidence. Resolve ids first if you only have a search result. Per-id resolve failures are reported, not fatal. |
Prompts
Interactive templates invoked by user choice
| Name | Description |
|---|---|
| find_data | Find datasets/data for a topic, optionally scoped to an organism. |
| data_behind_paper | Find the datasets / accessions behind a paper (by DOI, PMID, or title). |
| search_resolve_fetch | Walk the search → resolve → fetch flow for a data need. |
Resources
Contextual data attached and managed by the client
| Name | Description |
|---|---|
| sources | The wired data sources and their capabilities (same payload as the list_sources tool), as JSON. |
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/musharna/data-aggregator-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server