Skip to main content
Glama
privetin

Dataset Viewer MCP Server

by privetin

search_dataset

Search for specific text within Hugging Face datasets to find relevant data entries for analysis or training.

Instructions

Search for text within a Hugging Face dataset

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
datasetYesHugging Face dataset identifier in the format owner/dataset
configYesDataset configuration/subset name. Use get_info to list available configs
splitYesDataset split name. Splits partition the data for training/evaluation
queryYesText to search for in the dataset
auth_tokenNoHugging Face auth token for private/gated datasets

Implementation Reference

  • The handler for the 'search_dataset' tool within the @server.call_tool() function. Extracts arguments (dataset, config, split, query), instantiates DatasetViewerAPI if needed, calls its search method, and returns the JSON-formatted results as TextContent.
    elif name == "search_dataset":
        dataset = arguments["dataset"]
        config = arguments["config"]
        split = arguments["split"]
        query = arguments["query"]
        search_result = await DatasetViewerAPI(auth_token=auth_token).search(dataset, config=config, split=split, query=query)
        return [
            types.TextContent(
                type="text",
                text=json.dumps(search_result, indent=2)
            )
        ]
  • The input schema definition for the 'search_dataset' tool, registered in @server.list_tools(). Specifies required parameters: dataset, config, split, query, and optional auth_token.
    types.Tool(
        name="search_dataset",
        description="Search for text within a Hugging Face dataset",
        inputSchema={
            "type": "object",
            "properties": {
                "dataset": {
                    "type": "string",
                    "description": "Hugging Face dataset identifier in the format owner/dataset",
                    "pattern": "^[^/]+/[^/]+$",
                    "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                },
                "config": {
                    "type": "string",
                    "description": "Dataset configuration/subset name. Use get_info to list available configs",
                    "examples": ["default", "en", "es"]
                },
                "split": {
                    "type": "string",
                    "description": "Dataset split name. Splits partition the data for training/evaluation",
                    "examples": ["train", "validation", "test"]
                },
                "query": {"type": "string", "description": "Text to search for in the dataset"},
                "auth_token": {
                    "type": "string",
                    "description": "Hugging Face auth token for private/gated datasets",
                    "optional": True
                }
            },
            "required": ["dataset", "config", "split", "query"],
        }
    ),
  • Helper method in DatasetViewerAPI class that performs the core search functionality by making an HTTP GET request to the Hugging Face dataset viewer API's /search endpoint with the provided parameters.
    async def search(self, dataset: str, config: str, split: str, query: str) -> dict:
        """Search for text within a dataset split"""
        params = {
            "dataset": dataset,
            "config": config,
            "split": split,
            "query": query
        }
        response = await self.client.get("/search", params=params)
        response.raise_for_status()
        return response.json()
  • The @server.list_tools() handler that registers all tools, including 'search_dataset', by returning a list of Tool objects with their schemas and descriptions.
    @server.list_tools()
    async def handle_list_tools() -> list[types.Tool]:
        """List available dataset tools for Hugging Face datasets"""
        return [
            types.Tool(
                name="get_info",
                description="Get detailed information about a Hugging Face dataset including description, features, splits, and statistics. Run validate first to check if the dataset exists and is accessible.",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "dataset": {
                            "type": "string",
                            "description": "Hugging Face dataset identifier in the format owner/dataset",
                            "pattern": "^[^/]+/[^/]+$",
                            "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                        },
                        "auth_token": {
                            "type": "string",
                            "description": "Hugging Face auth token for private/gated datasets",
                            "optional": True
                        }
                    },
                    "required": ["dataset"],
                }
            ),
            types.Tool(
                name="get_rows",
                description="Get paginated rows from a Hugging Face dataset",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "dataset": {
                            "type": "string",
                            "description": "Hugging Face dataset identifier in the format owner/dataset",
                            "pattern": "^[^/]+/[^/]+$",
                            "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                        },
                        "config": {
                            "type": "string",
                            "description": "Dataset configuration/subset name. Use get_info to list available configs",
                            "examples": ["default", "en", "es"]
                        },
                        "split": {
                            "type": "string",
                            "description": "Dataset split name. Splits partition the data for training/evaluation",
                            "examples": ["train", "validation", "test"]
                        },
                        "page": {"type": "integer", "description": "Page number (0-based), returns 100 rows per page", "default": 0},
                        "auth_token": {
                            "type": "string",
                            "description": "Hugging Face auth token for private/gated datasets",
                            "optional": True
                        }
                    },
                    "required": ["dataset", "config", "split"],
                }
            ),
            types.Tool(
                name="get_first_rows",
                description="Get first rows from a Hugging Face dataset split",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "dataset": {
                            "type": "string",
                            "description": "Hugging Face dataset identifier in the format owner/dataset",
                            "pattern": "^[^/]+/[^/]+$",
                            "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                        },
                        "config": {
                            "type": "string",
                            "description": "Dataset configuration/subset name. Use get_info to list available configs",
                            "examples": ["default", "en", "es"]
                        },
                        "split": {
                            "type": "string",
                            "description": "Dataset split name. Splits partition the data for training/evaluation",
                            "examples": ["train", "validation", "test"]
                        },
                        "auth_token": {
                            "type": "string",
                            "description": "Hugging Face auth token for private/gated datasets",
                            "optional": True
                        }
                    },
                    "required": ["dataset", "config", "split"],
                }
            ),
            types.Tool(
                name="search_dataset",
                description="Search for text within a Hugging Face dataset",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "dataset": {
                            "type": "string",
                            "description": "Hugging Face dataset identifier in the format owner/dataset",
                            "pattern": "^[^/]+/[^/]+$",
                            "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                        },
                        "config": {
                            "type": "string",
                            "description": "Dataset configuration/subset name. Use get_info to list available configs",
                            "examples": ["default", "en", "es"]
                        },
                        "split": {
                            "type": "string",
                            "description": "Dataset split name. Splits partition the data for training/evaluation",
                            "examples": ["train", "validation", "test"]
                        },
                        "query": {"type": "string", "description": "Text to search for in the dataset"},
                        "auth_token": {
                            "type": "string",
                            "description": "Hugging Face auth token for private/gated datasets",
                            "optional": True
                        }
                    },
                    "required": ["dataset", "config", "split", "query"],
                }
            ),
            types.Tool(
                name="filter",
                description="Filter rows in a Hugging Face dataset using SQL-like conditions",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "dataset": {
                            "type": "string",
                            "description": "Hugging Face dataset identifier in the format owner/dataset",
                            "pattern": "^[^/]+/[^/]+$",
                            "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                        },
                        "config": {
                            "type": "string",
                            "description": "Dataset configuration/subset name. Use get_info to list available configs",
                            "examples": ["default", "en", "es"]
                        },
                        "split": {
                            "type": "string",
                            "description": "Dataset split name. Splits partition the data for training/evaluation",
                            "examples": ["train", "validation", "test"]
                        },
                        "where": {
                            "type": "string",
                            "description": "SQL-like WHERE clause to filter rows",
                            "examples": ["column = \"value\"", "score > 0.5", "text LIKE \"%query%\""]
                        },
                        "orderby": {
                            "type": "string",
                            "description": "SQL-like ORDER BY clause to sort results",
                            "optional": True,
                            "examples": ["column ASC", "score DESC", "name ASC, id DESC"]
                        },
                        "page": {
                            "type": "integer",
                            "description": "Page number for paginated results (100 rows per page)",
                            "default": 0,
                            "minimum": 0
                        },
                        "auth_token": {
                            "type": "string",
                            "description": "Hugging Face auth token for private/gated datasets",
                            "optional": True
                        }
                    },
                    "required": ["dataset", "config", "split", "where"],
                }
            ),
            types.Tool(
                name="get_statistics",
                description="Get statistics about a Hugging Face dataset",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "dataset": {
                            "type": "string",
                            "description": "Hugging Face dataset identifier in the format owner/dataset",
                            "pattern": "^[^/]+/[^/]+$",
                            "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                        },
                        "config": {
                            "type": "string",
                            "description": "Dataset configuration/subset name. Use get_info to list available configs",
                            "examples": ["default", "en", "es"]
                        },
                        "split": {
                            "type": "string",
                            "description": "Dataset split name. Splits partition the data for training/evaluation",
                            "examples": ["train", "validation", "test"]
                        },
                        "auth_token": {
                            "type": "string",
                            "description": "Hugging Face auth token for private/gated datasets",
                            "optional": True
                        }
                    },
                    "required": ["dataset", "config", "split"],
                }
            ),
            types.Tool(
                name="get_parquet",
                description="Export Hugging Face dataset split as Parquet file",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "dataset": {
                            "type": "string",
                            "description": "Hugging Face dataset identifier in the format owner/dataset",
                            "pattern": "^[^/]+/[^/]+$",
                            "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                        },
                        "auth_token": {
                            "type": "string",
                            "description": "Hugging Face auth token for private/gated datasets",
                            "optional": True
                        }
                    },
                    "required": ["dataset"],
                }
            ),
            types.Tool(
                name="validate",
                description="Check if a Hugging Face dataset exists and is accessible",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "dataset": {
                            "type": "string", 
                            "description": "Hugging Face dataset identifier in the format owner/dataset",
                            "pattern": "^[^/]+/[^/]+$",
                            "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                        },
                        "auth_token": {
                            "type": "string",
                            "description": "Hugging Face auth token for private/gated datasets",
                            "optional": True
                        }
                    },
                    "required": ["dataset"],
                }
            ),
        ]

Tool Definition Quality

Score is being calculated. Check back soon.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/privetin/dataset-viewer'

If you have feedback or need assistance with the MCP directory API, please join our Discord server