search

Read-only

Search multiple public research archives and omics registries for datasets, publications, and sequencing data with ontology-based query expansion and cross-repo deduplication.

Instructions

Search public research-data archives, omics registries, and the literature for datasets, software, publications, and sequencing data. Fans out across Zenodo, DataCite (Dryad, Figshare, Dataverse, OSF, Mendeley, OpenNeuro), NCBI omics (GEO, SRA, BioProject), literature (PubMed + OpenAIRE), HuggingFace Hub (datasets), DataONE (eco/environmental federation), OmicsDI (proteomics/metabolomics), RCSB PDB (macromolecular structures), GWAS Catalog (genotype-phenotype studies), OpenML (ML datasets), DANDI (neurophysiology dandisets), and CZ CELLxGENE (single-cell datasets). Returns compact DataResource records; per-source failures are reported in errors{}. Use resolve for the full record (SRA resolve attaches the ENA FASTQ manifest; publication resolve attaches links[] to datasets/accessions, normalized identifiers (pmid/pmcid/doi), and — when open access — a full-text file), then fetch to download files. Pass organism= to expand the query with NCBI-Taxonomy synonyms; results carry normalized taxa[] + plant cross-links. Pass disease= to expand the query with MeSH descriptor synonyms (e.g. 'breast cancer' also matches 'Breast Neoplasms'); the expansion is echoed in mesh_expansion. Pass tissue= to expand the query with UBERON synonyms (e.g. 'liver' also matches 'iecur'/'jecur'); the expansion is echoed in tissue_expansion. Pass chemical= to expand the query with ChEBI compound synonyms (e.g. 'caffeine' also matches '1,3,7-trimethylxanthine'); the expansion is echoed in chemical_expansion. Pass assay= to expand the query with EDAM assay/method synonyms (e.g. 'ChIP-seq' also matches 'ChIP-sequencing'); echoed in assay_expansion. Pass collapse_mirrors=true to opt into conservative cross-repo mirror collapse: same-dataset copies under different/no DOIs are folded into one record, with the folded copies annotated under mirrors[]. An ontology param that matches no term in its registry (e.g. organism='yeast' — NCBI Taxonomy indexes no such common name) is reported in unresolved[] and the search runs WITHOUT that expansion, so a dropped filter is never silent. Clients that support form elicitation are asked for a replacement term before the search runs.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`kind`	No	Keep only results of this kind.
`rank`	No	Result ordering. 'relevance' (default) = upstream/merged order. 'semantic' re-ranks the fetched page by embedding similarity to the query (needs EMBEDDING_API_BASE; degrades to relevance order with an errors['semantic'] note if unconfigured). In semantic mode pagination is window-based (each page consumes its full fetched window).	relevance
`size`	No	Max results (1-50, default 10)
`assay`	No	Optional assay/method name. Resolved via EDAM topics (EBI OLS); the query is expanded with the canonical name + exact synonyms (e.g. 'ChIP-seq' also matches 'ChIP-sequencing'/'ChIP-exo'). An unknown term yields no expansion; an OLS failure surfaces in errors. The expansion is echoed in assay_expansion.
`query`	No	Free-text search query
`cursor`	No	Opaque pagination token from a prior search's next_cursor. When set, all other search params are read from the cursor.
`tissue`	No	Optional tissue/anatomy name. Resolved via UBERON (EBI OLS); the query is expanded with the canonical term + exact synonyms (e.g. 'liver' also matches 'iecur'/'jecur'). The expansion is echoed in tissue_expansion.
`disease`	No	Optional disease/phenotype name. Resolved via MeSH (NCBI E-utilities); the query is expanded with the canonical descriptor + entry-term synonyms (e.g. 'breast cancer' also matches 'Breast Neoplasms'). The expansion is echoed in mesh_expansion.
`sources`	No	Restrict fan-out to these sources (default: all). Available: zenodo, dataone, gbif, cellxgene, datacite, dandi, omics, literature, huggingface, datagov, nasacmr, omicsdi, openml, pdb, uniprot, gwas, biostudies
`chemical`	No	Optional chemical/compound name. Resolved via ChEBI (EBI OLS); the query is expanded with the canonical name + exact synonyms (e.g. 'caffeine' also matches '1,3,7-trimethylxanthine'), capped to a bounded number of synonyms. An unknown term yields no expansion; an OLS failure surfaces in errors. The expansion is echoed in chemical_expansion.
`organism`	No	Optional organism name. Resolved via NCBI Taxonomy; the query is expanded with the canonical name + synonyms (e.g. 'Orobanche aegyptiaca' also matches 'Phelipanche aegyptiaca'). The expansion is echoed in taxon_expansion.
`provenance`	No	Opt into a whole-search RO-Crate 1.1 Run Crate (default false). Attaches provenance_crate{} — a machine-readable manifest documenting this search: the query, the sources queried, the ontology expansions that fired, the per-source errors (a partial search is disclosed), and per-hit provenance for every result (version-currency, licence + normalized SPDX, FAIR score). Per-hit RETRACTION is omitted — it would need one Crossref call per hit; use per-record resolve(format=provenance) for that. Covers THIS search page only (intra-page; each page of a paginated search gets its own crate).
`understand`	No	Opt into LLM query understanding: a free-text query is rewritten into a keyword core + structured params (organism/disease/tissue/chemical/assay, kind, year) before fan-out; extracted entities are validated by the same ontology resolvers (a hallucinated entity that doesn't resolve is simply dropped), explicit params you pass always win, and the interpretation is echoed in query_understanding. Requires an LLM endpoint (LLM_API_BASE); with none configured the search runs unchanged and notes it in errors['understand'].
`multi_query`	No	Opt into diverse multi-query recall expansion: an LLM generates up to a few deliberately-diverse reformulations of your query, each is fanned out across all sources, and the deduped union is re-ranked against your original query — surfacing relevant records a single keyword query would miss. Costs N× the upstream calls (bounded). Requires an LLM endpoint (LLM_API_BASE); with none configured the search runs as a normal single query and notes it in errors['multi_query']. The variants used are echoed in query_expansion. Composes with understand=. NOTE: multi_query=true ALWAYS applies semantic re-ranking of the window internally regardless of rank=; the rank= param has no effect in this mode.
`published_after`	No	Keep results with year >= this.
`collapse_mirrors`	No	Opt into conservative cross-repo content dedup (default false). On top of the always-on exact-DOI dedup, folds records that are the SAME dataset deposited under different (or no) DOIs — e.g. a Zenodo mirror of a figshare deposit, GEO<->ArrayExpress — into one record, annotating the survivor with the folded copies under mirrors[]. Conservative: a merge needs a shared file checksum OR identical (normalized-title, first-author-surname, year); title-only or partial matches never merge. Intra-page / best-effort only (a mirror on a different page is not collapsed), so a page may return fewer than size items; pagination is unaffected.
`published_before`	No	Keep results with year <= this.

Output Schema

TableJSON Schema

Name	Required	Description	Default
`count`	Yes
`query`	Yes
`total`	Yes
`errors`	No
`results`	No
`unresolved`	No
`next_cursor`	No
`mesh_expansion`	No
`assay_expansion`	No
`query_expansion`	No
`taxon_expansion`	No
`provenance_crate`	No
`tissue_expansion`	No
`chemical_expansion`	No
`query_understanding`	No

data-aggregator-mcp

search

Instructions

Input Schema

Output Schema

Tool Definition Quality

Other Tools

Latest Blog Posts

MCP directory API