get_statistics

get_statistics

Retrieve statistical insights from Hugging Face datasets to analyze data distribution, identify patterns, and assess dataset quality for machine learning projects.

Instructions

Get statistics about a Hugging Face dataset

Input Schema

TableJSON Schema

Name	Required	Description
`dataset`	Yes	Hugging Face dataset identifier in the format owner/dataset
`config`	Yes	Dataset configuration/subset name. Use get_info to list available configs
`split`	Yes	Dataset split name. Splits partition the data for training/evaluation
`auth_token`	No	Hugging Face auth token for private/gated datasets

Implementation Reference

src/dataset_viewer/server.py:547-557 (handler)
MCP tool handler in @server.call_tool() that extracts arguments (dataset, config, split), instantiates DatasetViewerAPI with auth_token, calls its get_statistics method, formats the result as JSON text content, and returns it.
elif name == "get_statistics": dataset = arguments["dataset"] config = arguments["config"] split = arguments["split"] stats = await DatasetViewerAPI(auth_token=auth_token).get_statistics(dataset, config=config, split=split) return [ types.TextContent( type="text", text=json.dumps(stats, indent=2) ) ]
src/dataset_viewer/server.py:82-91 (helper)
Core implementation in DatasetViewerAPI class that constructs parameters and makes asynchronous HTTP GET request to the Hugging Face dataset viewer /statistics endpoint, returning the JSON response.
async def get_statistics(self, dataset: str, config: str, split: str) -> dict: """Get statistics about a dataset""" params = { "dataset": dataset, "config": config, "split": split } response = await self.client.get("/statistics", params=params) response.raise_for_status() return response.json()
src/dataset_viewer/server.py:388-414 (schema)
Input schema definition for the get_statistics tool, specifying required parameters (dataset, config, split) with types, descriptions, patterns, examples, and optional auth_token.
inputSchema={ "type": "object", "properties": { "dataset": { "type": "string", "description": "Hugging Face dataset identifier in the format owner/dataset", "pattern": "^[^/]+/[^/]+$", "examples": ["ylecun/mnist", "stanfordnlp/imdb"] }, "config": { "type": "string", "description": "Dataset configuration/subset name. Use get_info to list available configs", "examples": ["default", "en", "es"] }, "split": { "type": "string", "description": "Dataset split name. Splits partition the data for training/evaluation", "examples": ["train", "validation", "test"] }, "auth_token": { "type": "string", "description": "Hugging Face auth token for private/gated datasets", "optional": True } }, "required": ["dataset", "config", "split"], }
src/dataset_viewer/server.py:385-415 (registration)
Registration of the get_statistics tool in the @server.list_tools() handler, including name, description, and full input schema.
types.Tool( name="get_statistics", description="Get statistics about a Hugging Face dataset", inputSchema={ "type": "object", "properties": { "dataset": { "type": "string", "description": "Hugging Face dataset identifier in the format owner/dataset", "pattern": "^[^/]+/[^/]+$", "examples": ["ylecun/mnist", "stanfordnlp/imdb"] }, "config": { "type": "string", "description": "Dataset configuration/subset name. Use get_info to list available configs", "examples": ["default", "en", "es"] }, "split": { "type": "string", "description": "Dataset split name. Splits partition the data for training/evaluation", "examples": ["train", "validation", "test"] }, "auth_token": { "type": "string", "description": "Hugging Face auth token for private/gated datasets", "optional": True } }, "required": ["dataset", "config", "split"], } ),

Dataset Viewer MCP Server

Instructions

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API