Skip to main content
Glama
privetin

Dataset Viewer MCP Server

by privetin

search_dataset

Search for specific text within Hugging Face datasets to find relevant data entries for analysis or training.

Instructions

Search for text within a Hugging Face dataset

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
datasetYesHugging Face dataset identifier in the format owner/dataset
configYesDataset configuration/subset name. Use get_info to list available configs
splitYesDataset split name. Splits partition the data for training/evaluation
queryYesText to search for in the dataset
auth_tokenNoHugging Face auth token for private/gated datasets

Implementation Reference

  • The handler for the 'search_dataset' tool within the @server.call_tool() function. Extracts arguments (dataset, config, split, query), instantiates DatasetViewerAPI if needed, calls its search method, and returns the JSON-formatted results as TextContent.
    elif name == "search_dataset":
        dataset = arguments["dataset"]
        config = arguments["config"]
        split = arguments["split"]
        query = arguments["query"]
        search_result = await DatasetViewerAPI(auth_token=auth_token).search(dataset, config=config, split=split, query=query)
        return [
            types.TextContent(
                type="text",
                text=json.dumps(search_result, indent=2)
            )
        ]
  • The input schema definition for the 'search_dataset' tool, registered in @server.list_tools(). Specifies required parameters: dataset, config, split, query, and optional auth_token.
    types.Tool(
        name="search_dataset",
        description="Search for text within a Hugging Face dataset",
        inputSchema={
            "type": "object",
            "properties": {
                "dataset": {
                    "type": "string",
                    "description": "Hugging Face dataset identifier in the format owner/dataset",
                    "pattern": "^[^/]+/[^/]+$",
                    "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                },
                "config": {
                    "type": "string",
                    "description": "Dataset configuration/subset name. Use get_info to list available configs",
                    "examples": ["default", "en", "es"]
                },
                "split": {
                    "type": "string",
                    "description": "Dataset split name. Splits partition the data for training/evaluation",
                    "examples": ["train", "validation", "test"]
                },
                "query": {"type": "string", "description": "Text to search for in the dataset"},
                "auth_token": {
                    "type": "string",
                    "description": "Hugging Face auth token for private/gated datasets",
                    "optional": True
                }
            },
            "required": ["dataset", "config", "split", "query"],
        }
    ),
  • Helper method in DatasetViewerAPI class that performs the core search functionality by making an HTTP GET request to the Hugging Face dataset viewer API's /search endpoint with the provided parameters.
    async def search(self, dataset: str, config: str, split: str, query: str) -> dict:
        """Search for text within a dataset split"""
        params = {
            "dataset": dataset,
            "config": config,
            "split": split,
            "query": query
        }
        response = await self.client.get("/search", params=params)
        response.raise_for_status()
        return response.json()
  • The @server.list_tools() handler that registers all tools, including 'search_dataset', by returning a list of Tool objects with their schemas and descriptions.
    @server.list_tools()
    async def handle_list_tools() -> list[types.Tool]:
        """List available dataset tools for Hugging Face datasets"""
        return [
            types.Tool(
                name="get_info",
                description="Get detailed information about a Hugging Face dataset including description, features, splits, and statistics. Run validate first to check if the dataset exists and is accessible.",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "dataset": {
                            "type": "string",
                            "description": "Hugging Face dataset identifier in the format owner/dataset",
                            "pattern": "^[^/]+/[^/]+$",
                            "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                        },
                        "auth_token": {
                            "type": "string",
                            "description": "Hugging Face auth token for private/gated datasets",
                            "optional": True
                        }
                    },
                    "required": ["dataset"],
                }
            ),
            types.Tool(
                name="get_rows",
                description="Get paginated rows from a Hugging Face dataset",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "dataset": {
                            "type": "string",
                            "description": "Hugging Face dataset identifier in the format owner/dataset",
                            "pattern": "^[^/]+/[^/]+$",
                            "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                        },
                        "config": {
                            "type": "string",
                            "description": "Dataset configuration/subset name. Use get_info to list available configs",
                            "examples": ["default", "en", "es"]
                        },
                        "split": {
                            "type": "string",
                            "description": "Dataset split name. Splits partition the data for training/evaluation",
                            "examples": ["train", "validation", "test"]
                        },
                        "page": {"type": "integer", "description": "Page number (0-based), returns 100 rows per page", "default": 0},
                        "auth_token": {
                            "type": "string",
                            "description": "Hugging Face auth token for private/gated datasets",
                            "optional": True
                        }
                    },
                    "required": ["dataset", "config", "split"],
                }
            ),
            types.Tool(
                name="get_first_rows",
                description="Get first rows from a Hugging Face dataset split",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "dataset": {
                            "type": "string",
                            "description": "Hugging Face dataset identifier in the format owner/dataset",
                            "pattern": "^[^/]+/[^/]+$",
                            "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                        },
                        "config": {
                            "type": "string",
                            "description": "Dataset configuration/subset name. Use get_info to list available configs",
                            "examples": ["default", "en", "es"]
                        },
                        "split": {
                            "type": "string",
                            "description": "Dataset split name. Splits partition the data for training/evaluation",
                            "examples": ["train", "validation", "test"]
                        },
                        "auth_token": {
                            "type": "string",
                            "description": "Hugging Face auth token for private/gated datasets",
                            "optional": True
                        }
                    },
                    "required": ["dataset", "config", "split"],
                }
            ),
            types.Tool(
                name="search_dataset",
                description="Search for text within a Hugging Face dataset",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "dataset": {
                            "type": "string",
                            "description": "Hugging Face dataset identifier in the format owner/dataset",
                            "pattern": "^[^/]+/[^/]+$",
                            "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                        },
                        "config": {
                            "type": "string",
                            "description": "Dataset configuration/subset name. Use get_info to list available configs",
                            "examples": ["default", "en", "es"]
                        },
                        "split": {
                            "type": "string",
                            "description": "Dataset split name. Splits partition the data for training/evaluation",
                            "examples": ["train", "validation", "test"]
                        },
                        "query": {"type": "string", "description": "Text to search for in the dataset"},
                        "auth_token": {
                            "type": "string",
                            "description": "Hugging Face auth token for private/gated datasets",
                            "optional": True
                        }
                    },
                    "required": ["dataset", "config", "split", "query"],
                }
            ),
            types.Tool(
                name="filter",
                description="Filter rows in a Hugging Face dataset using SQL-like conditions",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "dataset": {
                            "type": "string",
                            "description": "Hugging Face dataset identifier in the format owner/dataset",
                            "pattern": "^[^/]+/[^/]+$",
                            "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                        },
                        "config": {
                            "type": "string",
                            "description": "Dataset configuration/subset name. Use get_info to list available configs",
                            "examples": ["default", "en", "es"]
                        },
                        "split": {
                            "type": "string",
                            "description": "Dataset split name. Splits partition the data for training/evaluation",
                            "examples": ["train", "validation", "test"]
                        },
                        "where": {
                            "type": "string",
                            "description": "SQL-like WHERE clause to filter rows",
                            "examples": ["column = \"value\"", "score > 0.5", "text LIKE \"%query%\""]
                        },
                        "orderby": {
                            "type": "string",
                            "description": "SQL-like ORDER BY clause to sort results",
                            "optional": True,
                            "examples": ["column ASC", "score DESC", "name ASC, id DESC"]
                        },
                        "page": {
                            "type": "integer",
                            "description": "Page number for paginated results (100 rows per page)",
                            "default": 0,
                            "minimum": 0
                        },
                        "auth_token": {
                            "type": "string",
                            "description": "Hugging Face auth token for private/gated datasets",
                            "optional": True
                        }
                    },
                    "required": ["dataset", "config", "split", "where"],
                }
            ),
            types.Tool(
                name="get_statistics",
                description="Get statistics about a Hugging Face dataset",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "dataset": {
                            "type": "string",
                            "description": "Hugging Face dataset identifier in the format owner/dataset",
                            "pattern": "^[^/]+/[^/]+$",
                            "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                        },
                        "config": {
                            "type": "string",
                            "description": "Dataset configuration/subset name. Use get_info to list available configs",
                            "examples": ["default", "en", "es"]
                        },
                        "split": {
                            "type": "string",
                            "description": "Dataset split name. Splits partition the data for training/evaluation",
                            "examples": ["train", "validation", "test"]
                        },
                        "auth_token": {
                            "type": "string",
                            "description": "Hugging Face auth token for private/gated datasets",
                            "optional": True
                        }
                    },
                    "required": ["dataset", "config", "split"],
                }
            ),
            types.Tool(
                name="get_parquet",
                description="Export Hugging Face dataset split as Parquet file",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "dataset": {
                            "type": "string",
                            "description": "Hugging Face dataset identifier in the format owner/dataset",
                            "pattern": "^[^/]+/[^/]+$",
                            "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                        },
                        "auth_token": {
                            "type": "string",
                            "description": "Hugging Face auth token for private/gated datasets",
                            "optional": True
                        }
                    },
                    "required": ["dataset"],
                }
            ),
            types.Tool(
                name="validate",
                description="Check if a Hugging Face dataset exists and is accessible",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "dataset": {
                            "type": "string", 
                            "description": "Hugging Face dataset identifier in the format owner/dataset",
                            "pattern": "^[^/]+/[^/]+$",
                            "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                        },
                        "auth_token": {
                            "type": "string",
                            "description": "Hugging Face auth token for private/gated datasets",
                            "optional": True
                        }
                    },
                    "required": ["dataset"],
                }
            ),
        ]

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/privetin/dataset-viewer'

If you have feedback or need assistance with the MCP directory API, please join our Discord server