Skip to main content
Glama
cfahlgren1

HF Dataset MCP

by cfahlgren1

search_dataset

Search datasets from Hugging Face Hub using BM25 ranking to find specific content within dataset splits for analysis and exploration.

Instructions

Full-text search within a dataset split using BM25 ranking

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
datasetYesDataset ID (e.g., 'stanfordnlp/imdb')
configYesConfiguration name
splitYesSplit name (train, test, validation)
queryYesText to search for
offsetNoResult offset for pagination (default: 0)
lengthNoNumber of results (default: 100, max: 100)

Implementation Reference

  • The handler function for the 'search_dataset' tool which calls the search endpoint.
    async ({ dataset, config, split, query, offset, length }) => {
      const data = await fetchDatasetViewer<SearchResponse>("/search", {
        dataset,
        config,
        split,
        query,
        offset: offset ?? 0,
        length: length ?? 100,
      });
    
      return {
        content: [
          {
            type: "text" as const,
            text: JSON.stringify(data, null, 2),
          },
        ],
      };
    }
  • Zod schema for validating the input arguments of the 'search_dataset' tool.
    {
      dataset: z.string().describe("Dataset ID (e.g., 'stanfordnlp/imdb')"),
      config: z.string().describe("Configuration name"),
      split: z.string().describe("Split name (train, test, validation)"),
      query: z.string().describe("Text to search for"),
      offset: z
        .number()
        .int()
        .min(0)
        .optional()
        .describe("Result offset for pagination (default: 0)"),
      length: z
        .number()
        .int()
        .min(1)
        .max(100)
        .optional()
        .describe("Number of results (default: 100, max: 100)"),
    },
  • Registration function for the 'search_dataset' tool within the MCP server.
    export function registerSearchDataset(server: McpServer) {
      server.tool(
        "search_dataset",
        "Full-text search within a dataset split using BM25 ranking",
        {
          dataset: z.string().describe("Dataset ID (e.g., 'stanfordnlp/imdb')"),
          config: z.string().describe("Configuration name"),
          split: z.string().describe("Split name (train, test, validation)"),
          query: z.string().describe("Text to search for"),
          offset: z
            .number()
            .int()
            .min(0)
            .optional()
            .describe("Result offset for pagination (default: 0)"),
          length: z
            .number()
            .int()
            .min(1)
            .max(100)
            .optional()
            .describe("Number of results (default: 100, max: 100)"),
        },
        async ({ dataset, config, split, query, offset, length }) => {
          const data = await fetchDatasetViewer<SearchResponse>("/search", {
            dataset,
            config,
            split,
            query,
            offset: offset ?? 0,
            length: length ?? 100,
          });
    
          return {
            content: [
              {
                type: "text" as const,
                text: JSON.stringify(data, null, 2),
              },
            ],
          };
        }
      );
    }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/cfahlgren1/hf-dataset-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server