Skip to main content
Glama
cfahlgren1

HF Dataset MCP

by cfahlgren1

search_dataset

Search datasets from Hugging Face Hub using BM25 ranking to find specific content within dataset splits for analysis and exploration.

Instructions

Full-text search within a dataset split using BM25 ranking

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
datasetYesDataset ID (e.g., 'stanfordnlp/imdb')
configYesConfiguration name
splitYesSplit name (train, test, validation)
queryYesText to search for
offsetNoResult offset for pagination (default: 0)
lengthNoNumber of results (default: 100, max: 100)

Implementation Reference

  • The handler function for the 'search_dataset' tool which calls the search endpoint.
    async ({ dataset, config, split, query, offset, length }) => {
      const data = await fetchDatasetViewer<SearchResponse>("/search", {
        dataset,
        config,
        split,
        query,
        offset: offset ?? 0,
        length: length ?? 100,
      });
    
      return {
        content: [
          {
            type: "text" as const,
            text: JSON.stringify(data, null, 2),
          },
        ],
      };
    }
  • Zod schema for validating the input arguments of the 'search_dataset' tool.
    {
      dataset: z.string().describe("Dataset ID (e.g., 'stanfordnlp/imdb')"),
      config: z.string().describe("Configuration name"),
      split: z.string().describe("Split name (train, test, validation)"),
      query: z.string().describe("Text to search for"),
      offset: z
        .number()
        .int()
        .min(0)
        .optional()
        .describe("Result offset for pagination (default: 0)"),
      length: z
        .number()
        .int()
        .min(1)
        .max(100)
        .optional()
        .describe("Number of results (default: 100, max: 100)"),
    },
  • Registration function for the 'search_dataset' tool within the MCP server.
    export function registerSearchDataset(server: McpServer) {
      server.tool(
        "search_dataset",
        "Full-text search within a dataset split using BM25 ranking",
        {
          dataset: z.string().describe("Dataset ID (e.g., 'stanfordnlp/imdb')"),
          config: z.string().describe("Configuration name"),
          split: z.string().describe("Split name (train, test, validation)"),
          query: z.string().describe("Text to search for"),
          offset: z
            .number()
            .int()
            .min(0)
            .optional()
            .describe("Result offset for pagination (default: 0)"),
          length: z
            .number()
            .int()
            .min(1)
            .max(100)
            .optional()
            .describe("Number of results (default: 100, max: 100)"),
        },
        async ({ dataset, config, split, query, offset, length }) => {
          const data = await fetchDatasetViewer<SearchResponse>("/search", {
            dataset,
            config,
            split,
            query,
            offset: offset ?? 0,
            length: length ?? 100,
          });
    
          return {
            content: [
              {
                type: "text" as const,
                text: JSON.stringify(data, null, 2),
              },
            ],
          };
        }
      );
    }
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Despite having no annotations, description only discloses the BM25 algorithm. It omits critical behavioral details: return format (scores? excerpts? row IDs?), pagination behavior (cursor vs. offset), rate limits, and what happens when queries match no documents.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Single dense sentence front-loads the operation type and scope. However, given zero annotations and six parameters, extreme brevity becomes under-specification rather than efficient communication.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a 6-parameter search tool with no output schema and multiple related siblings, one sentence is insufficient. Missing: return value structure, pagination strategy explanation, and differentiation from 'search_datasets' (plural) which appears in the same toolkit.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 100% schema description coverage, parameters are self-documenting. Description adds semantic grouping by stating 'dataset split' (binding the dataset+config+split trio) and 'Full-text search' (relating to query), meeting the baseline.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description specifies exact action ('Full-text search'), target resource ('dataset split'), and ranking algorithm ('BM25'), clearly distinguishing from sibling 'search_datasets' (global vs. split-scoped) and 'filter_rows' (BM25 text ranking vs. filtering).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

No guidance on when to use full-text search versus 'filter_rows' (structured filtering) or 'get_rows' (direct access), nor prerequisites like needing valid dataset/config/split combinations from 'list_splits' first.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/cfahlgren1/hf-dataset-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server