Skip to main content
Glama
michaelwaves

Hugging Face Hub MCP Server

by michaelwaves

hf_list_datasets

Search, filter, and retrieve detailed metadata for datasets on the Hugging Face Hub, including downloads, likes, and tags. Refine results by author, search terms, or tags for targeted exploration.

Instructions

Get information from all datasets in the Hub. Supports filtering by search terms, authors, tags, and more. Returns paginated results with dataset metadata including downloads, likes, and tags.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
authorNoFilter datasets by author or organization (e.g., 'huggingface', 'microsoft')
configNoWhether to also fetch the repo config
directionNoSort direction: '-1' for descending, anything else for ascending
filterNoFilter based on tags (e.g., 'task_categories:text-classification', 'languages:en')
fullNoWhether to fetch most dataset data including all tags and files
limitNoLimit the number of datasets fetched
searchNoFilter based on substrings for repos and their usernames (e.g., 'pets', 'microsoft')
sortNoProperty to use when sorting (e.g., 'downloads', 'author')

Implementation Reference

  • The MCP tool handler for 'hf_list_datasets': validates arguments using isDatasetSearchArgs, calls client.getDatasets(), and formats the CallToolResult.
    export async function handleListDatasets(client: HuggingFaceClient, args: unknown): Promise<CallToolResult> {
        try {
            if (!isDatasetSearchArgs(args)) {
                throw new Error("Invalid arguments for hf_list_datasets");
            }
    
            const results = await client.getDatasets(args as Record<string, any>);
            
            return {
                content: [{ type: "text", text: results }],
                isError: false,
            };
        } catch (error) {
            return {
                content: [
                    {
                        type: "text",
                        text: `Error: ${error instanceof Error ? error.message : String(error)}`,
                    },
                ],
                isError: true,
            };
        }
    }
  • The tool definition for 'hf_list_datasets' including name, description, and detailed inputSchema for filtering and pagination parameters.
    export const listDatasetsToolDefinition: Tool = {
        name: "hf_list_datasets",
        description:
            "Get information from all datasets in the Hub. Supports filtering by search terms, authors, tags, and more. " +
            "Returns paginated results with dataset metadata including downloads, likes, and tags.",
        inputSchema: {
            type: "object", 
            properties: {
                search: {
                    type: "string",
                    description: "Filter based on substrings for repos and their usernames (e.g., 'pets', 'microsoft')"
                },
                author: {
                    type: "string",
                    description: "Filter datasets by author or organization (e.g., 'huggingface', 'microsoft')" 
                },
                filter: {
                    type: "string",
                    description: "Filter based on tags (e.g., 'task_categories:text-classification', 'languages:en')"
                },
                sort: {
                    type: "string",
                    description: "Property to use when sorting (e.g., 'downloads', 'author')"
                },
                direction: {
                    type: "string",
                    description: "Sort direction: '-1' for descending, anything else for ascending"
                },
                limit: {
                    type: "number", 
                    description: "Limit the number of datasets fetched"
                },
                full: {
                    type: "boolean",
                    description: "Whether to fetch most dataset data including all tags and files"
                },
                config: {
                    type: "boolean",
                    description: "Whether to also fetch the repo config"
                }
            },
            required: []
        }
    };
  • Core helper method in HuggingFaceClient: performs HTTP GET to Hugging Face Hub API '/api/datasets' endpoint with query params, returns pretty-printed JSON string of the response data.
    async getDatasets(params: Record<string, any> = {}): Promise<string> {
        try {
            const response: AxiosResponse = await this.httpClient.get('/api/datasets', { params });
            return JSON.stringify(response.data, null, 2);
        } catch (error) {
            throw new Error(`Failed to fetch datasets: ${error instanceof Error ? error.message : String(error)}`);
        }
    }
  • src/server.ts:81-82 (registration)
    Registration and dispatch of 'hf_list_datasets' handler in the main HuggingFaceServer's CallToolRequestHandler switch statement.
    case 'hf_list_datasets':
        return handleListDatasets(this.client, args);
  • src/server.ts:55-66 (registration)
    Registration of 'hf_list_datasets' tool definition (listDatasetsToolDefinition) in the ListToolsRequestHandler for tool discovery.
    this.server.setRequestHandler(ListToolsRequestSchema, async () => ({
        tools: [
            listModelsToolDefinition,
            getModelInfoToolDefinition,
            getModelTagsToolDefinition,
            listDatasetsToolDefinition,
            getDatasetInfoToolDefinition,
            getDatasetParquetToolDefinition,
            getCroissantToolDefinition,
            getDatasetTagsToolDefinition
        ],
    }));
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It usefully mentions pagination and the types of metadata returned (downloads, likes, tags), which goes beyond the input schema. However, it doesn't address important behavioral aspects like rate limits, authentication requirements, error conditions, or what happens when no filters are applied (does it return all datasets?).

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is efficiently structured in two sentences: the first states the core purpose and filtering capabilities, the second describes the return format. Every element serves a purpose with no wasted words, though it could be slightly more front-loaded by mentioning pagination earlier given its importance.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a tool with 8 parameters, no annotations, and no output schema, the description provides adequate but incomplete context. It covers the basic purpose and return format but lacks details about authentication, rate limits, error handling, and how results are structured beyond 'paginated results with dataset metadata.' The absence of an output schema increases the need for more behavioral transparency.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The description mentions filtering by 'search terms, authors, tags, and more' which aligns with parameters like search, author, and filter. However, with 100% schema description coverage, the input schema already documents all 8 parameters thoroughly. The description adds minimal value beyond what's in the schema, meeting the baseline expectation for high schema coverage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Get information from all datasets in the Hub' with specific filtering capabilities. It distinguishes itself from siblings like hf_get_dataset_info (single dataset) and hf_list_models (different resource type), though it doesn't explicitly contrast with hf_get_dataset_tags or hf_get_dataset_parquet which serve more specialized purposes.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage for listing datasets with filtering options, but provides no explicit guidance on when to choose this tool over alternatives like hf_get_dataset_info (for single dataset details) or hf_list_models (for models instead of datasets). It mentions filtering capabilities but doesn't clarify trade-offs or specific use cases.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/michaelwaves/hf-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server