Skip to main content
Glama
musharna

data-aggregator-mcp

πŸ”Ž data-aggregator-mcp

One MCP server to find and fetch research data across archives, omics registries, and literature β€” behind a single normalized model.

PyPI Python License: MIT CI

search one query across Zenodo, DataCite (Dryad / Figshare / Dataverse / OSF / Mendeley), NCBI omics (GEO / SRA / BioProject), and literature (PubMed / OpenAIRE) β€” deduplicated, normalized, and cross-linked. resolve any hit to its file manifest, citation, and the data it points at. fetch it to disk with checksum verification.

mcp-name: io.github.musharna/data-aggregator-mcp

✨ Why this

Most data MCPs wrap a single source. This one unifies them behind four tools and one DataResource model, so an agent searches once and gets back comparable records:

  • Multi-domain, one model β€” generalist archives + raw omics + literature, deduplicated by DOI (the fetchable record wins over bare metadata).

  • Taxonomy synonym expansion β€” organism="Orobanche aegyptiaca" also matches Phelipanche aegyptiaca (NCBI Taxonomy), so a species rename doesn't cost you results.

  • Paper β†’ data bridge β€” resolve a paper and get links to the GEO / SRA / BioProject / DataCite records it produced.

  • Verified fetch β€” streams to disk with md5 verification where the source exposes a checksum, optional archive unpacking, and a fail-loud integrity sniff that rejects an HTML paywall page served as a "PDF".

  • Citations, access & full text β€” render a citation in any CSL style, get normalized access/license, and pull open-access full text β€” all in one resolve.

⚑ Quickstart

Run with no install:

uvx data-aggregator-mcp

Register with Claude Code:

claude mcp add data-aggregator -- uvx data-aggregator-mcp

A typical agent flow:

search("drought stress RNA-seq", organism="Sorghum bicolor")
  β†’ [ geo:GSE..., sra:SRX..., zenodo:..., pubmed:... ]   # deduped, taxa-normalized

resolve("sra:SRX079566")
  β†’ DataResource{ files: [ENA FASTQ urls…], access: "open", taxa: [...] }

fetch("sra:SRX079566", dest="./data")
  β†’ ["./data/SRX079566_1.fastq.gz", …]                   # md5-verified
pip install data-aggregator-mcp
data-aggregator-mcp        # or: python -m data_aggregator_mcp

Add to a client's MCP config (e.g. Claude Desktop claude_desktop_config.json):

{
  "mcpServers": {
    "data-aggregator": {
      "command": "uvx",
      "args": ["data-aggregator-mcp"],
      "env": { "NCBI_API_KEY": "your-optional-key" }
    }
  }
}

πŸ—‚οΈ Sources

Source

Discover

Fetch

Checksum

Zenodo

βœ…

βœ…

md5

DataCite β†’ Figshare

βœ…

βœ…

md5

DataCite β†’ Dataverse

βœ…

βœ…

md5

DataCite β†’ OSF

βœ…

βœ…

md5

DataCite β†’ Dryad

βœ…

manifest onlyΒΉ

sha-256 (listed)

DataCite β†’ Mendeley & others

βœ…

β€”

β€”

NCBI SRA

βœ…

βœ… (ENA FASTQ)

md5

NCBI GEO

βœ…

βœ… (suppl/)

noneΒ²

NCBI BioProject

βœ…

β†’ SRA links

β€”

PubMed / OpenAIRE

βœ…

βœ… (OA full text)

noneΒ²

ΒΉ Dryad downloads are token / bot-challenge gated, so fetch fails loud; resolve still lists the files. Β² No upstream checksum β€” fetch verifies content-type instead (rejects an HTML page served in place of a binary).

πŸ› οΈ Tools

search(query, size?, sources?, organism?)

Fan out across all wired sources in parallel and return compact DataResource records, deduped by DOI. Per-source failures land in errors{} β€” never silently dropped.

  • organism β€” expand the query with NCBI-Taxonomy synonyms; the expansion is echoed in taxon_expansion, and results carry normalized taxa[] ({taxid, name}) plus a described_in link to plant-genomics-mcp for plant taxa.

  • sources β€” restrict the fan-out, e.g. ["omics"].

  • size β€” max results (1–50).

resolve(id)

Full record + files manifest. Routes by id shape β€” zenodo:7654321, a bare DOI, datacite:10.5061/dryad.x, an omics id (sra:SRX079566, geo:GSE332789, bioproject:PRJNA1468572), or a literature id (pubmed:34320281, openaire:<id>). Attaches, where available:

  • files[] β€” ENA FASTQ manifest (SRA), GEO suppl/, or the host repo's native manifest (Figshare / Dataverse / OSF / Dryad).

  • links[] β€” paper β†’ data: pubmed: β†’ sra: / geo: / bioproject: (NCBI elink); openaire: β†’ datacite: (ScholeXplorer Scholix).

  • access / license β€” normalized status (open / embargoed / restricted / closed / unknown) and license where the source exposes it.

  • identifiers β€” normalized {pmid, pmcid, doi}, plus an open-access full-text FileEntry (EuropePMC XML, or an Unpaywall PDF fallback) for papers.

  • citation β€” pass cite=<format>: bibtex, ris, csl-json, or any CSL style name (apa, mla, vancouver, …). DOI records use content negotiation; others render CSL-JSON from metadata. Off by default; failures degrade quietly.

fetch(id, dest?, files?, max_bytes?, force?, extract?)

Download files to disk and return their paths. Streams under a max_bytes guard (force to override) with md5 verification wherever a checksum exists.

  • files β€” restrict to a subset of the resolved manifest.

  • extract β€” unpack downloaded zip / tar archives in place, guarded against path traversal and runaway extracted size. Off by default.

  • Unverified fetches (GEO suppl/, literature full text) get a content-type sniff that fails loud if a declared binary is actually an HTML page.

  • Fetchable: Zenodo, SRA, GEO, DataCite-hosted Figshare / Dataverse / OSF, and literature open-access full text. Dryad and other DataCite repos are discovery-only and raise FetchNotSupportedError.

list_sources()

Wired sources with their capabilities β€” layer, kinds, supported filters, fetchability, id examples, auth, and rate limits.

βš™οΈ Configuration

Both optional, set via environment variables:

  • NCBI_API_KEY β€” raises the NCBI E-utilities rate limit (3 β†’ 10 req/s) used by the omics, literature, and taxonomy lookups.

  • UNPAYWALL_EMAIL β€” enables the Unpaywall fallback leg of literature full-text retrieval (the EuropePMC leg works without it).

πŸ§ͺ Develop

uv venv && uv pip install -e ".[dev]"
uv run pytest -q
uv run ruff check src tests
DATA_AGGREGATOR_MCP_LIVE=1 uv run pytest -k live -q   # real-API probes

The README demo (examples/assets/demo.svg) is recorded network-free from examples/_demo_stdio.py β€” see the header of that file to re-record.

License

MIT β€” see LICENSE.

Install Server
A
license - permissive license
A
quality
B
maintenance

Maintenance

–Maintainers
–Response time
–Release cycle
1Releases (12mo)

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/musharna/data-aggregator-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server