search_dataset
Search for specific text within Hugging Face datasets by specifying the dataset identifier, configuration, split, and query. Access and filter data efficiently for analysis or exploration.
Instructions
Search for text within a Hugging Face dataset
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| auth_token | No | Hugging Face auth token for private/gated datasets | |
| config | Yes | Dataset configuration/subset name. Use get_info to list available configs | |
| dataset | Yes | Hugging Face dataset identifier in the format owner/dataset | |
| query | Yes | Text to search for in the dataset | |
| split | Yes | Dataset split name. Splits partition the data for training/evaluation |
Implementation Reference
- src/dataset_viewer/server.py:519-530 (handler)Handler in @server.call_tool() that extracts arguments and calls DatasetViewerAPI.search(), returning JSON-formatted results.elif name == "search_dataset": dataset = arguments["dataset"] config = arguments["config"] split = arguments["split"] query = arguments["query"] search_result = await DatasetViewerAPI(auth_token=auth_token).search(dataset, config=config, split=split, query=query) return [ types.TextContent( type="text", text=json.dumps(search_result, indent=2) ) ]
- src/dataset_viewer/server.py:306-336 (schema)Input schema definition for the search_dataset tool, specifying parameters like dataset, config, split, query.name="search_dataset", description="Search for text within a Hugging Face dataset", inputSchema={ "type": "object", "properties": { "dataset": { "type": "string", "description": "Hugging Face dataset identifier in the format owner/dataset", "pattern": "^[^/]+/[^/]+$", "examples": ["ylecun/mnist", "stanfordnlp/imdb"] }, "config": { "type": "string", "description": "Dataset configuration/subset name. Use get_info to list available configs", "examples": ["default", "en", "es"] }, "split": { "type": "string", "description": "Dataset split name. Splits partition the data for training/evaluation", "examples": ["train", "validation", "test"] }, "query": {"type": "string", "description": "Text to search for in the dataset"}, "auth_token": { "type": "string", "description": "Hugging Face auth token for private/gated datasets", "optional": True } }, "required": ["dataset", "config", "split", "query"], } ),
- src/dataset_viewer/server.py:305-337 (registration)Registration of the search_dataset tool in the @server.list_tools() return list.types.Tool( name="search_dataset", description="Search for text within a Hugging Face dataset", inputSchema={ "type": "object", "properties": { "dataset": { "type": "string", "description": "Hugging Face dataset identifier in the format owner/dataset", "pattern": "^[^/]+/[^/]+$", "examples": ["ylecun/mnist", "stanfordnlp/imdb"] }, "config": { "type": "string", "description": "Dataset configuration/subset name. Use get_info to list available configs", "examples": ["default", "en", "es"] }, "split": { "type": "string", "description": "Dataset split name. Splits partition the data for training/evaluation", "examples": ["train", "validation", "test"] }, "query": {"type": "string", "description": "Text to search for in the dataset"}, "auth_token": { "type": "string", "description": "Hugging Face auth token for private/gated datasets", "optional": True } }, "required": ["dataset", "config", "split", "query"], } ), types.Tool(
- src/dataset_viewer/server.py:104-114 (helper)Core search method in DatasetViewerAPI class that queries the Hugging Face dataset viewer /search endpoint.async def search(self, dataset: str, config: str, split: str, query: str) -> dict: """Search for text within a dataset split""" params = { "dataset": dataset, "config": config, "split": split, "query": query } response = await self.client.get("/search", params=params) response.raise_for_status() return response.json()