Skip to main content
Glama
cfahlgren1

HF Dataset MCP

by cfahlgren1

filter_rows

Filter Hugging Face dataset rows using SQL-like WHERE conditions to extract specific data based on criteria like age, location, or other column values.

Instructions

Filter dataset rows using SQL-like WHERE conditions

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
datasetYesDataset ID (e.g., 'stanfordnlp/imdb')
configYesConfiguration name
splitYesSplit name (train, test, validation)
whereYesFilter condition (e.g., "age">30 AND "city"='Paris'). Column names in double quotes, strings in single quotes.
orderbyNoSort column and direction (e.g., "score" DESC)
offsetNoResult offset (default: 0)
lengthNoNumber of results (default: 100, max: 100)

Implementation Reference

  • The tool handler that calls fetchDatasetViewer with the provided parameters to fetch and return filtered dataset rows.
    async ({ dataset, config, split, where, orderby, offset, length }) => {
      const data = await fetchDatasetViewer<FilterResponse>("/filter", {
        dataset,
        config,
        split,
        where,
        orderby,
        offset: offset ?? 0,
        length: length ?? 100,
      });
    
      return {
        content: [
          {
            type: "text" as const,
            text: JSON.stringify(data, null, 2),
          },
        ],
      };
    }
  • Registration of the 'filter_rows' tool within the McpServer, defining input parameters using Zod.
    server.tool(
      "filter_rows",
      "Filter dataset rows using SQL-like WHERE conditions",
      {
        dataset: z.string().describe("Dataset ID (e.g., 'stanfordnlp/imdb')"),
        config: z.string().describe("Configuration name"),
        split: z.string().describe("Split name (train, test, validation)"),
        where: z
          .string()
          .describe(
            'Filter condition (e.g., "age">30 AND "city"=\'Paris\'). Column names in double quotes, strings in single quotes.'
          ),
        orderby: z
          .string()
          .optional()
          .describe('Sort column and direction (e.g., "score" DESC)'),
        offset: z
          .number()
          .int()
          .min(0)
          .optional()
          .describe("Result offset (default: 0)"),
        length: z
          .number()
          .int()
          .min(1)
          .max(100)
          .optional()
          .describe("Number of results (default: 100, max: 100)"),
      },
      async ({ dataset, config, split, where, orderby, offset, length }) => {
        const data = await fetchDatasetViewer<FilterResponse>("/filter", {
          dataset,
          config,
          split,
          where,
          orderby,
          offset: offset ?? 0,
          length: length ?? 100,
        });
    
        return {
          content: [
            {
              type: "text" as const,
              text: JSON.stringify(data, null, 2),
            },
          ],
        };
      }
    );
  • Schema definition for the response data returned by the filter_rows tool.
    interface FilterResponse {
      features: Array<{
        feature_idx: number;
        name: string;
        type: Record<string, unknown>;
      }>;
      rows: Array<{
        row_idx: number;
        row: Record<string, unknown>;
        truncated_cells: string[];
      }>;
      num_rows_total: number;
      num_rows_per_page: number;
      partial: boolean;
    }
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations provided, so description carries full burden. Omits critical behavioral details: whether operation is read-only (implied but not confirmed), pagination behavior (offset/length exist in schema but result handling isn't described), maximum result limits, or error behavior for invalid SQL syntax.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Extremely concise (6 words) and front-loaded with the core action. No wasted sentences. However, brevity sacrifices structural elements like prerequisite context or usage conditions that would help an agent navigate the 7-parameter interface.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Inadequate for a 7-parameter filtering tool with no output schema. Missing: return value structure, relationship between required parameters (dataset/config/split), error handling for malformed SQL, and guidance on result cardinality. Schema documents parameters but description doesn't integrate them into operational context.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, so baseline is 3. Description adds 'SQL-like' framing which reinforces the syntax requirements for the 'where' parameter, but adds no context for the dataset/config/split relationship or pagination semantics beyond what the schema explicitly documents.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

States specific verb ('Filter'), resource ('dataset rows'), and method ('SQL-like WHERE conditions'). The 'SQL-like' hint helps distinguish from sibling 'get_rows' (likely raw retrieval) and 'search_dataset' (likely text search), though it doesn't explicitly articulate these distinctions.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Provides no guidance on when to use this versus siblings like 'get_rows' or 'search_dataset'. Does not mention prerequisites (e.g., needing dataset/config/split identifiers from other tools) or when filtering is preferable to retrieving full datasets.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/cfahlgren1/hf-dataset-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server