Skip to main content
Glama
cfahlgren1

HF Dataset MCP

by cfahlgren1

get_dataset_size

Retrieve row counts and byte sizes for all configurations and splits of a Hugging Face dataset to analyze its structure and storage requirements.

Instructions

Get row counts and byte sizes for all configs and splits

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
datasetYesDataset ID (e.g., 'stanfordnlp/imdb')

Implementation Reference

  • The handler function that executes the "get_dataset_size" tool logic.
    async ({ dataset }) => {
      const data = await fetchDatasetViewer<SizeResponse>("/size", {
        dataset,
      });
    
      return {
        content: [
          {
            type: "text" as const,
            text: JSON.stringify(data, null, 2),
          },
        ],
      };
    }
  • Interface defining the structure of the dataset size response.
    interface SizeResponse {
      size: {
        dataset: {
          num_bytes_original_files?: number;
          num_bytes_parquet_files?: number;
          num_bytes_memory?: number;
          num_rows?: number;
        };
        configs: Array<{
          dataset: string;
          config: string;
          num_bytes_original_files?: number;
          num_bytes_parquet_files?: number;
          num_bytes_memory?: number;
          num_rows?: number;
        }>;
        splits: Array<{
          dataset: string;
          config: string;
          split: string;
          num_bytes_original_files?: number;
          num_bytes_parquet_files?: number;
          num_bytes_memory?: number;
          num_rows?: number;
        }>;
      };
      pending: unknown[];
      failed: unknown[];
      partial: boolean;
    }
  • Registration function for the "get_dataset_size" tool.
    export function registerGetDatasetSize(server: McpServer) {
      server.tool(
        "get_dataset_size",
        "Get row counts and byte sizes for all configs and splits",
        {
          dataset: z.string().describe("Dataset ID (e.g., 'stanfordnlp/imdb')"),
        },
        async ({ dataset }) => {
          const data = await fetchDatasetViewer<SizeResponse>("/size", {
            dataset,
          });
    
          return {
            content: [
              {
                type: "text" as const,
                text: JSON.stringify(data, null, 2),
              },
            ],
          };
        }
      );
    }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/cfahlgren1/hf-dataset-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server