HF Dataset MCP

search_dataset

Search datasets from Hugging Face Hub using BM25 ranking to find specific content within dataset splits for analysis and exploration.

Instructions

Full-text search within a dataset split using BM25 ranking

Input Schema

TableJSON Schema

Name	Required	Description
`dataset`	Yes	Dataset ID (e.g., 'stanfordnlp/imdb')
`config`	Yes	Configuration name
`split`	Yes	Split name (train, test, validation)
`query`	Yes	Text to search for
`offset`	No	Result offset for pagination (default: 0)
`length`	No	Number of results (default: 100, max: 100)

Implementation Reference

src/tools/search-dataset.ts:44-62 (handler)

The handler function for the 'search_dataset' tool which calls the search endpoint.

async ({ dataset, config, split, query, offset, length }) => {
  const data = await fetchDatasetViewer<SearchResponse>("/search", {
    dataset,
    config,
    split,
    query,
    offset: offset ?? 0,
    length: length ?? 100,
  });

  return {
    content: [
      {
        type: "text" as const,
        text: JSON.stringify(data, null, 2),
      },
    ],
  };
}

src/tools/search-dataset.ts:25-43 (schema)

Zod schema for validating the input arguments of the 'search_dataset' tool.

{
  dataset: z.string().describe("Dataset ID (e.g., 'stanfordnlp/imdb')"),
  config: z.string().describe("Configuration name"),
  split: z.string().describe("Split name (train, test, validation)"),
  query: z.string().describe("Text to search for"),
  offset: z
    .number()
    .int()
    .min(0)
    .optional()
    .describe("Result offset for pagination (default: 0)"),
  length: z
    .number()
    .int()
    .min(1)
    .max(100)
    .optional()
    .describe("Number of results (default: 100, max: 100)"),
},

src/tools/search-dataset.ts:21-64 (registration)

Registration function for the 'search_dataset' tool within the MCP server.

export function registerSearchDataset(server: McpServer) {
  server.tool(
    "search_dataset",
    "Full-text search within a dataset split using BM25 ranking",
    {
      dataset: z.string().describe("Dataset ID (e.g., 'stanfordnlp/imdb')"),
      config: z.string().describe("Configuration name"),
      split: z.string().describe("Split name (train, test, validation)"),
      query: z.string().describe("Text to search for"),
      offset: z
        .number()
        .int()
        .min(0)
        .optional()
        .describe("Result offset for pagination (default: 0)"),
      length: z
        .number()
        .int()
        .min(1)
        .max(100)
        .optional()
        .describe("Number of results (default: 100, max: 100)"),
    },
    async ({ dataset, config, split, query, offset, length }) => {
      const data = await fetchDatasetViewer<SearchResponse>("/search", {
        dataset,
        config,
        split,
        query,
        offset: offset ?? 0,
        length: length ?? 100,
      });

      return {
        content: [
          {
            type: "text" as const,
            text: JSON.stringify(data, null, 2),
          },
        ],
      };
    }
  );
}

Tool Definition Quality

B3.1/5.0

Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Despite having no annotations, description only discloses the BM25 algorithm. It omits critical behavioral details: return format (scores? excerpts? row IDs?), pagination behavior (cursor vs. offset), rate limits, and what happens when queries match no documents.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Single dense sentence front-loads the operation type and scope. However, given zero annotations and six parameters, extreme brevity becomes under-specification rather than efficient communication.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a 6-parameter search tool with no output schema and multiple related siblings, one sentence is insufficient. Missing: return value structure, pagination strategy explanation, and differentiation from 'search_datasets' (plural) which appears in the same toolkit.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 100% schema description coverage, parameters are self-documenting. Description adds semantic grouping by stating 'dataset split' (binding the dataset+config+split trio) and 'Full-text search' (relating to query), meeting the baseline.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description specifies exact action ('Full-text search'), target resource ('dataset split'), and ranking algorithm ('BM25'), clearly distinguishing from sibling 'search_datasets' (global vs. split-scoped) and 'filter_rows' (BM25 text ranking vs. filtering).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

No guidance on when to use full-text search versus 'filter_rows' (structured filtering) or 'get_rows' (direct access), nor prerequisites like needing valid dataset/config/split combinations from 'list_splits' first.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
open source
OpenAI
Tool Definition Quality Score (TDQS)
By punkpeye on April 3, 2026.
mcp
The Hackers Who Tracked My Sleep Cycle
By punkpeye on March 26, 2026.
security

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/cfahlgren1/hf-dataset-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server