search_dataset
Search datasets from Hugging Face Hub using BM25 ranking to find specific content within dataset splits for analysis and exploration.
Instructions
Full-text search within a dataset split using BM25 ranking
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| dataset | Yes | Dataset ID (e.g., 'stanfordnlp/imdb') | |
| config | Yes | Configuration name | |
| split | Yes | Split name (train, test, validation) | |
| query | Yes | Text to search for | |
| offset | No | Result offset for pagination (default: 0) | |
| length | No | Number of results (default: 100, max: 100) |
Implementation Reference
- src/tools/search-dataset.ts:44-62 (handler)The handler function for the 'search_dataset' tool which calls the search endpoint.
async ({ dataset, config, split, query, offset, length }) => { const data = await fetchDatasetViewer<SearchResponse>("/search", { dataset, config, split, query, offset: offset ?? 0, length: length ?? 100, }); return { content: [ { type: "text" as const, text: JSON.stringify(data, null, 2), }, ], }; } - src/tools/search-dataset.ts:25-43 (schema)Zod schema for validating the input arguments of the 'search_dataset' tool.
{ dataset: z.string().describe("Dataset ID (e.g., 'stanfordnlp/imdb')"), config: z.string().describe("Configuration name"), split: z.string().describe("Split name (train, test, validation)"), query: z.string().describe("Text to search for"), offset: z .number() .int() .min(0) .optional() .describe("Result offset for pagination (default: 0)"), length: z .number() .int() .min(1) .max(100) .optional() .describe("Number of results (default: 100, max: 100)"), }, - src/tools/search-dataset.ts:21-64 (registration)Registration function for the 'search_dataset' tool within the MCP server.
export function registerSearchDataset(server: McpServer) { server.tool( "search_dataset", "Full-text search within a dataset split using BM25 ranking", { dataset: z.string().describe("Dataset ID (e.g., 'stanfordnlp/imdb')"), config: z.string().describe("Configuration name"), split: z.string().describe("Split name (train, test, validation)"), query: z.string().describe("Text to search for"), offset: z .number() .int() .min(0) .optional() .describe("Result offset for pagination (default: 0)"), length: z .number() .int() .min(1) .max(100) .optional() .describe("Number of results (default: 100, max: 100)"), }, async ({ dataset, config, split, query, offset, length }) => { const data = await fetchDatasetViewer<SearchResponse>("/search", { dataset, config, split, query, offset: offset ?? 0, length: length ?? 100, }); return { content: [ { type: "text" as const, text: JSON.stringify(data, null, 2), }, ], }; } ); }