Skip to main content
Glama
privetin

Dataset Viewer MCP Server

by privetin

filter

Filter rows in Hugging Face datasets using SQL-like conditions to extract specific data subsets based on column values and criteria.

Instructions

Filter rows in a Hugging Face dataset using SQL-like conditions

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
datasetYesHugging Face dataset identifier in the format owner/dataset
configYesDataset configuration/subset name. Use get_info to list available configs
splitYesDataset split name. Splits partition the data for training/evaluation
whereYesSQL-like WHERE clause to filter rows
orderbyNoSQL-like ORDER BY clause to sort results
pageNoPage number for paginated results (100 rows per page)
auth_tokenNoHugging Face auth token for private/gated datasets

Implementation Reference

  • Core handler function in DatasetViewerAPI that performs the actual filtering by calling the Hugging Face dataset viewer API /filter endpoint with validated parameters.
    async def filter(self, dataset: str, config: str, split: str, where: str, orderby: str | None = None, page: int = 0) -> dict:
        """Filter dataset rows based on conditions"""
        # Validate page number
        if page < 0:
            raise ValueError("Page number must be non-negative")
            
        # Basic SQL clause validation
        if not where.strip():
            raise ValueError("WHERE clause cannot be empty")
        if orderby and not orderby.strip():
            raise ValueError("ORDER BY clause cannot be empty")
            
        params = {
            "dataset": dataset,
            "config": config,
            "split": split,
            "where": where,
            "offset": page * 100,  # 100 rows per page
            "length": 100
        }
        if orderby:
            params["orderby"] = orderby
            
        try:
            response = await self.client.get("/filter", params=params)
            response.raise_for_status()
            return response.json()
        except httpx.NetworkError as e:
            raise ConnectionError(f"Network error while filtering dataset: {e}")
        except httpx.HTTPStatusError as e:
            if e.response.status_code == 400:
                raise ValueError(f"Invalid filter query: {e.response.text}")
            elif e.response.status_code == 404:
                raise ValueError(f"Dataset, config or split not found: {dataset}/{config}/{split}")
            else:
                raise RuntimeError(f"Error filtering dataset: {e}")
  • MCP tool registration including name, description, and detailed input schema for the 'filter' tool.
    types.Tool(
        name="filter",
        description="Filter rows in a Hugging Face dataset using SQL-like conditions",
        inputSchema={
            "type": "object",
            "properties": {
                "dataset": {
                    "type": "string",
                    "description": "Hugging Face dataset identifier in the format owner/dataset",
                    "pattern": "^[^/]+/[^/]+$",
                    "examples": ["ylecun/mnist", "stanfordnlp/imdb"]
                },
                "config": {
                    "type": "string",
                    "description": "Dataset configuration/subset name. Use get_info to list available configs",
                    "examples": ["default", "en", "es"]
                },
                "split": {
                    "type": "string",
                    "description": "Dataset split name. Splits partition the data for training/evaluation",
                    "examples": ["train", "validation", "test"]
                },
                "where": {
                    "type": "string",
                    "description": "SQL-like WHERE clause to filter rows",
                    "examples": ["column = \"value\"", "score > 0.5", "text LIKE \"%query%\""]
                },
                "orderby": {
                    "type": "string",
                    "description": "SQL-like ORDER BY clause to sort results",
                    "optional": True,
                    "examples": ["column ASC", "score DESC", "name ASC, id DESC"]
                },
                "page": {
                    "type": "integer",
                    "description": "Page number for paginated results (100 rows per page)",
                    "default": 0,
                    "minimum": 0
                },
                "auth_token": {
                    "type": "string",
                    "description": "Hugging Face auth token for private/gated datasets",
                    "optional": True
                }
            },
            "required": ["dataset", "config", "split", "where"],
        }
    ),
  • MCP server @server.call_tool() dispatch branch that parses arguments and invokes the DatasetViewerAPI.filter method, returning JSON-formatted results.
    elif name == "filter":
        dataset = arguments["dataset"]
        config = arguments["config"]
        split = arguments["split"]
        where = arguments["where"]
        orderby = arguments.get("orderby")
        page = arguments.get("page", 0)
        filtered = await DatasetViewerAPI(auth_token=auth_token).filter(dataset, config=config, split=split, where=where, orderby=orderby, page=page)
        return [
            types.TextContent(
                type="text",
                text=json.dumps(filtered, indent=2)
            )
        ]

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/privetin/dataset-viewer'

If you have feedback or need assistance with the MCP directory API, please join our Discord server