Skip to main content
Glama
musharna

data-aggregator-mcp

search

Read-only

Search public research-data archives, omics registries, and literature for datasets, software, publications, and sequencing data, with ontology-based query expansion.

Instructions

Search public research-data archives, omics registries, and the literature for datasets, software, publications, and sequencing data. Fans out across Zenodo, DataCite (Dryad, Figshare, Dataverse, OSF, Mendeley, OpenNeuro), NCBI omics (GEO, SRA, BioProject), literature (PubMed + OpenAIRE), HuggingFace Hub (datasets), DataONE (eco/environmental federation), OmicsDI (proteomics/metabolomics), RCSB PDB (macromolecular structures), GWAS Catalog (genotype-phenotype studies), OpenML (ML datasets), DANDI (neurophysiology dandisets), and CZ CELLxGENE (single-cell datasets). Returns compact DataResource records; per-source failures are reported in errors{}. Use resolve for the full record (SRA resolve attaches the ENA FASTQ manifest; publication resolve attaches links[] to datasets/accessions, normalized identifiers (pmid/pmcid/doi), and — when open access — a full-text file), then fetch to download files. Pass organism= to expand the query with NCBI-Taxonomy synonyms; results carry normalized taxa[] + plant cross-links. Pass disease= to expand the query with MeSH descriptor synonyms (e.g. 'breast cancer' also matches 'Breast Neoplasms'); the expansion is echoed in mesh_expansion. Pass tissue= to expand the query with UBERON synonyms (e.g. 'liver' also matches 'iecur'/'jecur'); the expansion is echoed in tissue_expansion. Pass chemical= to expand the query with ChEBI compound synonyms (e.g. 'caffeine' also matches '1,3,7-trimethylxanthine'); the expansion is echoed in chemical_expansion. Pass assay= to expand the query with EDAM assay/method synonyms (e.g. 'ChIP-seq' also matches 'ChIP-sequencing'); echoed in assay_expansion. Pass collapse_mirrors=true to opt into conservative cross-repo mirror collapse: same-dataset copies under different/no DOIs are folded into one record, with the folded copies annotated under mirrors[].

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
queryNoFree-text search query
sizeNoMax results (1-50, default 10)
sourcesNoRestrict fan-out to these sources (default: all). Available: zenodo, dataone, cellxgene, datacite, dandi, omics, literature, huggingface, omicsdi, openml, pdb, gwas
organismNoOptional organism name. Resolved via NCBI Taxonomy; the query is expanded with the canonical name + synonyms (e.g. 'Orobanche aegyptiaca' also matches 'Phelipanche aegyptiaca'). The expansion is echoed in taxon_expansion.
diseaseNoOptional disease/phenotype name. Resolved via MeSH (NCBI E-utilities); the query is expanded with the canonical descriptor + entry-term synonyms (e.g. 'breast cancer' also matches 'Breast Neoplasms'). The expansion is echoed in mesh_expansion.
tissueNoOptional tissue/anatomy name. Resolved via UBERON (EBI OLS); the query is expanded with the canonical term + exact synonyms (e.g. 'liver' also matches 'iecur'/'jecur'). The expansion is echoed in tissue_expansion.
chemicalNoOptional chemical/compound name. Resolved via ChEBI (EBI OLS); the query is expanded with the canonical name + exact synonyms (e.g. 'caffeine' also matches '1,3,7-trimethylxanthine'), capped to a bounded number of synonyms. An unknown term yields no expansion; an OLS failure surfaces in errors. The expansion is echoed in chemical_expansion.
assayNoOptional assay/method name. Resolved via EDAM topics (EBI OLS); the query is expanded with the canonical name + exact synonyms (e.g. 'ChIP-seq' also matches 'ChIP-sequencing'/'ChIP-exo'). An unknown term yields no expansion; an OLS failure surfaces in errors. The expansion is echoed in assay_expansion.
cursorNoOpaque pagination token from a prior search's next_cursor. When set, all other search params are read from the cursor.
published_afterNoKeep results with year >= this.
published_beforeNoKeep results with year <= this.
kindNoKeep only results of this kind.
rankNoResult ordering. 'relevance' (default) = upstream/merged order. 'semantic' re-ranks the fetched page by embedding similarity to the query (needs EMBEDDING_API_BASE; degrades to relevance order with an errors['semantic'] note if unconfigured). In semantic mode pagination is window-based (each page consumes its full fetched window).relevance
collapse_mirrorsNoOpt into conservative cross-repo content dedup (default false). On top of the always-on exact-DOI dedup, folds records that are the SAME dataset deposited under different (or no) DOIs — e.g. a Zenodo mirror of a figshare deposit, GEO<->ArrayExpress — into one record, annotating the survivor with the folded copies under mirrors[]. Conservative: a merge needs a shared file checksum OR identical (normalized-title, first-author-surname, year); title-only or partial matches never merge. Intra-page / best-effort only (a mirror on a different page is not collapsed), so a page may return fewer than size items; pagination is unaffected.
understandNoOpt into LLM query understanding: a free-text query is rewritten into a keyword core + structured params (organism/disease/tissue/chemical/assay, kind, year) before fan-out; extracted entities are validated by the same ontology resolvers (a hallucinated entity that doesn't resolve is simply dropped), explicit params you pass always win, and the interpretation is echoed in query_understanding. Requires an LLM endpoint (LLM_API_BASE); with none configured the search runs unchanged and notes it in errors['understand'].
multi_queryNoOpt into diverse multi-query recall expansion: an LLM generates up to a few deliberately-diverse reformulations of your query, each is fanned out across all sources, and the deduped union is re-ranked against your original query — surfacing relevant records a single keyword query would miss. Costs N× the upstream calls (bounded). Requires an LLM endpoint (LLM_API_BASE); with none configured the search runs as a normal single query and notes it in errors['multi_query']. The variants used are echoed in query_expansion. Composes with understand=. NOTE: multi_query=true ALWAYS applies semantic re-ranking of the window internally regardless of rank=; the rank= param has no effect in this mode.
provenanceNoOpt into a whole-search RO-Crate 1.1 Run Crate (default false). Attaches provenance_crate{} — a machine-readable manifest documenting this search: the query, the sources queried, the ontology expansions that fired, the per-source errors (a partial search is disclosed), and per-hit provenance for every result (version-currency, licence + normalized SPDX, FAIR score). Per-hit RETRACTION is omitted — it would need one Crossref call per hit; use per-record resolve(format=provenance) for that. Covers THIS search page only (intra-page; each page of a paginated search gets its own crate).

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
queryYes
totalYes
countYes
resultsNo
errorsNo
next_cursorNo
taxon_expansionNo
mesh_expansionNo
tissue_expansionNo
chemical_expansionNo
assay_expansionNo
query_understandingNo
query_expansionNo
provenance_crateNo
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description extensively details behavior beyond the readOnlyHint annotation: fan-out across sources, per-source failure reporting in errors{}, conservative mirror collapse, pagination via cursor, and semantic reranking. No contradiction with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is lengthy but every sentence adds value, covering all features clearly. It is well-structured with logical flow from main purpose to parameter details. Could be slightly more concise, but the depth justifies the length.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the high complexity (17 parameters, advanced features like multi_query, understand, collapse_mirrors, provenance) and no output schema shown, the description is remarkably complete. It explains return format, error handling, pagination, and each parameter's effect. No gaps identified.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, but the description adds substantial meaning for every parameter: e.g., organism expansion via NCBI Taxonomy with synonym examples, disease via MeSH, tissue via UBERON. It also explains complex interactions like cursor overriding other params, multi_query always using semantic ranking, and provenance option generating a Run Crate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description explicitly states it searches public research-data archives, omics registries, and literature for datasets, software, publications, and sequencing data. It names specific sources (Zenodo, DataCite, NCBI omics, etc.) and distinguishes from siblings by noting that resolve fetches full records and fetch downloads files.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear guidance on when to use this tool (for broad, multi-source searching) and when to use alternatives (resolve for full record, fetch for downloads). It also explains optional query expansions (organism, disease, etc.) but lacks explicit when-not-to-use scenarios.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/musharna/data-aggregator-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server