Skip to main content
Glama
michaelwaves

Hugging Face Hub MCP Server

by michaelwaves

hf_get_dataset_parquet

Retrieve auto-converted parquet files for a specific dataset, subset, or split from the Hugging Face Hub. Access structured data files efficiently for machine learning workflows.

Instructions

Get the list of auto-converted parquet files for a dataset. Can specify subset (config) and split to get specific files.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
nNoOptional shard number to get the nth parquet file
repo_idYesDataset repository ID
splitNoOptional dataset split (train, test, validation, etc.)
subsetNoOptional dataset subset/config name

Implementation Reference

  • The handler function that performs argument validation using isDatasetParquetArgs and delegates to HuggingFaceClient.getDatasetParquet to fetch parquet file information, formatting the result as MCP CallToolResult.
    export async function handleGetDatasetParquet(client: HuggingFaceClient, args: unknown): Promise<CallToolResult> {
        try {
            if (!isDatasetParquetArgs(args)) {
                throw new Error("Invalid arguments for hf_get_dataset_parquet");
            }
    
            const { repo_id, subset, split, n } = args;
            const results = await client.getDatasetParquet(repo_id, subset, split, n);
            
            return {
                content: [{ type: "text", text: results }],
                isError: false,
            };
        } catch (error) {
            return {
                content: [
                    {
                        type: "text",
                        text: `Error: ${error instanceof Error ? error.message : String(error)}`,
                    },
                ],
                isError: true,
            };
        }
    }
  • The Tool definition including name, description, and inputSchema for validating tool arguments.
    export const getDatasetParquetToolDefinition: Tool = {
        name: "hf_get_dataset_parquet", 
        description:
            "Get the list of auto-converted parquet files for a dataset. Can specify subset (config) and split to get specific files.",
        inputSchema: {
            type: "object",
            properties: {
                repo_id: {
                    type: "string",
                    description: "Dataset repository ID"
                },
                subset: {
                    type: "string",
                    description: "Optional dataset subset/config name"
                },
                split: {
                    type: "string",
                    description: "Optional dataset split (train, test, validation, etc.)"
                },
                n: {
                    type: "number",
                    description: "Optional shard number to get the nth parquet file"
                }
            },
            required: ["repo_id"]
        }
    };
  • Core utility method in HuggingFaceClient that constructs the HF API endpoint for parquet files and performs the HTTP GET request using axios.
    async getDatasetParquet(repoId: string, subset?: string, split?: string, n?: number): Promise<string> {
        try {
            let endpoint = `/api/datasets/${repoId}/parquet`;
            if (subset) {
                endpoint += `/${subset}`;
                if (split) {
                    endpoint += `/${split}`;
                    if (n !== undefined) {
                        endpoint += `/${n}.parquet`;
                    }
                }
            }
            const response: AxiosResponse = await this.httpClient.get(endpoint);
            return JSON.stringify(response.data, null, 2);
        } catch (error) {
            throw new Error(`Failed to fetch dataset parquet: ${error instanceof Error ? error.message : String(error)}`);
        }
    }
  • src/server.ts:87-88 (registration)
    Registration of the tool handler in the MCP server's CallToolRequestSchema switch statement.
    case 'hf_get_dataset_parquet':
        return handleGetDatasetParquet(this.client, args);
  • Type guard helper function for validating input arguments match DatasetParquetArgs.
    function isDatasetParquetArgs(args: unknown): args is DatasetParquetArgs {
        return (
            typeof args === "object" &&
            args !== null &&
            "repo_id" in args &&
            typeof (args as { repo_id: string }).repo_id === "string"
        );
    }
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It mentions retrieving 'auto-converted parquet files' and optional filtering, but fails to disclose critical behaviors such as whether this is a read-only operation, potential rate limits, authentication requirements, or the format and structure of the returned list. This leaves significant gaps for an AI agent to understand how to use the tool effectively.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is extremely concise with two sentences that directly convey the tool's function and optional capabilities. Every word earns its place, and it's front-loaded with the core purpose, making it efficient and easy to parse.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool has no annotations and no output schema, the description is incomplete. It doesn't explain what the returned 'list of auto-converted parquet files' looks like (e.g., format, structure, or example output), nor does it cover behavioral aspects like error handling or performance characteristics. For a tool with 4 parameters and no structured output documentation, this leaves too much undefined.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the input schema already documents all parameters thoroughly. The description adds minimal value by mentioning 'subset (config) and split' as optional filters, but doesn't provide additional semantic context beyond what's in the schema descriptions. This meets the baseline for high schema coverage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the action ('Get the list of auto-converted parquet files') and resource ('for a dataset'), making the purpose understandable. However, it doesn't explicitly differentiate from sibling tools like 'hf_get_dataset_info' or 'hf_get_croissant' that might also retrieve dataset-related information, preventing a perfect score.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage by mentioning optional parameters ('Can specify subset (config) and split to get specific files'), suggesting when to use these features. However, it lacks explicit guidance on when to choose this tool over alternatives like 'hf_get_dataset_info' or 'hf_list_datasets', which might provide different types of dataset metadata or listings.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/michaelwaves/hf-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server