get_parquet
Export Hugging Face dataset splits as Parquet files for data analysis and processing workflows.
Instructions
Export Hugging Face dataset split as Parquet file
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| dataset | Yes | Hugging Face dataset identifier in the format owner/dataset | |
| auth_token | No | Hugging Face auth token for private/gated datasets |
Implementation Reference
- src/dataset_viewer/server.py:559-574 (handler)Executes the get_parquet tool: fetches parquet data from HF Dataset Viewer API using DatasetViewerAPI and saves it to a local .parquet file, returning the file path.elif name == "get_parquet": dataset = arguments["dataset"] parquet_data = await DatasetViewerAPI(auth_token=auth_token).get_parquet(dataset) # Save to a temporary file with .parquet extension filename = f"{dataset.replace('/', '_')}.parquet" filepath = os.path.join(os.getcwd(), filename) with open(filepath, "wb") as f: f.write(parquet_data) return [ types.TextContent( type="text", text=f"Dataset exported to: {filepath}" ) ]
- src/dataset_viewer/server.py:416-436 (registration)Registers the get_parquet tool in the MCP server's list_tools() handler, defining its name, description, and input schema.types.Tool( name="get_parquet", description="Export Hugging Face dataset split as Parquet file", inputSchema={ "type": "object", "properties": { "dataset": { "type": "string", "description": "Hugging Face dataset identifier in the format owner/dataset", "pattern": "^[^/]+/[^/]+$", "examples": ["ylecun/mnist", "stanfordnlp/imdb"] }, "auth_token": { "type": "string", "description": "Hugging Face auth token for private/gated datasets", "optional": True } }, "required": ["dataset"], } ),
- src/dataset_viewer/server.py:153-157 (helper)DatasetViewerAPI helper method that performs the HTTP request to retrieve the full dataset as parquet bytes from the HF datasets-server.async def get_parquet(self, dataset: str) -> bytes: """Get entire dataset in Parquet format""" response = await self.client.get("/parquet", params={"dataset": dataset}) response.raise_for_status() return response.content