validate
Verify Hugging Face dataset availability and accessibility by checking dataset existence and permissions with optional authentication for private datasets.
Instructions
Check if a Hugging Face dataset exists and is accessible
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| dataset | Yes | Hugging Face dataset identifier in the format owner/dataset | |
| auth_token | No | Hugging Face auth token for private/gated datasets |
Implementation Reference
- src/dataset_viewer/server.py:576-627 (handler)Handler for the 'validate' tool in the call_tool method. Validates dataset format with regex and checks existence via GET /is-valid endpoint on datasets-server.huggingface.co, returning JSON result or error messages as TextContent.elif name == "validate": dataset = arguments["dataset"] try: # First check format if not re.match(r"^[^/]+/[^/]+$", dataset): return [ types.TextContent( type="text", text="Dataset must be in the format 'owner/dataset'" ) ] # Then check if dataset exists and is accessible response = await DatasetViewerAPI(auth_token=auth_token).client.get("/is-valid", params={"dataset": dataset}) response.raise_for_status() result = response.json() return [ types.TextContent( type="text", text=json.dumps(result, indent=2) ) ] except httpx.NetworkError as e: return [ types.TextContent( type="text", text=str(e) ) ] except httpx.HTTPStatusError as e: if e.response.status_code == 404: return [ types.TextContent( type="text", text=f"Dataset '{dataset}' not found" ) ] elif e.response.status_code == 403: return [ types.TextContent( type="text", text=f"Dataset '{dataset}' requires authentication" ) ] else: return [ types.TextContent( type="text", text=str(e) ) ]
- src/dataset_viewer/server.py:437-457 (registration)Registration of the 'validate' MCP tool in list_tools(), including name, description, and inputSchema defining 'dataset' as required string with pattern and optional auth_token.types.Tool( name="validate", description="Check if a Hugging Face dataset exists and is accessible", inputSchema={ "type": "object", "properties": { "dataset": { "type": "string", "description": "Hugging Face dataset identifier in the format owner/dataset", "pattern": "^[^/]+/[^/]+$", "examples": ["ylecun/mnist", "stanfordnlp/imdb"] }, "auth_token": { "type": "string", "description": "Hugging Face auth token for private/gated datasets", "optional": True } }, "required": ["dataset"], } ),
- src/dataset_viewer/server.py:440-456 (schema)Input schema for the 'validate' tool, specifying object with required 'dataset' string (pattern ^[^/]+/[^/]+$) and optional 'auth_token'.inputSchema={ "type": "object", "properties": { "dataset": { "type": "string", "description": "Hugging Face dataset identifier in the format owner/dataset", "pattern": "^[^/]+/[^/]+$", "examples": ["ylecun/mnist", "stanfordnlp/imdb"] }, "auth_token": { "type": "string", "description": "Hugging Face auth token for private/gated datasets", "optional": True } }, "required": ["dataset"], }
- src/dataset_viewer/server.py:37-56 (helper)Helper method 'validate_dataset' in DatasetViewerAPI class that validates dataset format and existence using HEAD /is-valid, raising specific errors. Similar logic to tool handler but raises exceptions instead of returning content.async def validate_dataset(self, dataset: str) -> None: """Validate dataset ID format and check if it exists""" # Validate format (username/dataset-name) if not re.match(r"^[^/]+/[^/]+$", dataset): raise ValueError("Dataset ID must be in the format 'owner/dataset'") # Check if dataset exists and is accessible try: response = await self.client.head(f"/is-valid?dataset={dataset}") response.raise_for_status() except httpx.NetworkError as e: raise ConnectionError(f"Network error while validating dataset: {e}") except httpx.HTTPStatusError as e: if e.response.status_code == 404: raise ValueError(f"Dataset '{dataset}' not found") elif e.response.status_code == 403: raise ValueError(f"Dataset '{dataset}' exists but requires authentication") else: raise RuntimeError(f"Error validating dataset: {e}")