Skip to main content
Glama
bolinocroustibat

datagouv-mcp

search_datasets

Search for datasets on France's national open data platform using keywords to find relevant public data with metadata like title, organization, and tags.

Instructions

Search for datasets on data.gouv.fr by keywords.

This is typically the first step in exploring data.gouv.fr. Returns a list of datasets matching the search query with their metadata, including title, description, organization, tags, and resource count.

After finding relevant datasets, use get_dataset_info to get more details, or list_dataset_resources to see what files are available in a dataset.

Args: query: Search query string (searches in title, description, tags) page: Page number (default: 1) page_size: Number of results per page (default: 20, max: 100)

Returns: Formatted text with dataset information, including dataset IDs for further queries

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
queryYes
pageNo
page_sizeNo

Implementation Reference

  • The @mcp.tool() decorated handler function implementing the core logic for the search_datasets tool: cleans query, calls API client, handles fallback, formats results as text.
    async def search_datasets(query: str, page: int = 1, page_size: int = 20) -> str:
        """
        Search for datasets on data.gouv.fr by keywords.
    
        This is typically the first step in exploring data.gouv.fr. Returns a list of
        datasets matching the search query with their metadata, including title,
        description, organization, tags, and resource count.
    
        The upstream API uses strict AND logic: too many generic words can lead to
        zero results. The tool automatically strips generic stop words (e.g. "données",
        "fichier", "csv") and, if needed, retries with the original query. Prefer short,
        specific queries; if you get no results, try again with fewer generic terms.
    
        After finding relevant datasets, use get_dataset_info to get more details, or
        list_dataset_resources to see what files are available in a dataset.
    
        Args:
            query: Search query string (searches in title, description, tags)
            page: Page number (default: 1)
            page_size: Number of results per page (default: 20, max: 100)
    
        Returns:
            Formatted text with dataset information, including dataset IDs for further queries
        """
        # Clean the query to remove generic stop words that break AND-based searches
        cleaned_query = clean_search_query(query)
    
        # Try with cleaned query first
        result = await datagouv_api_client.search_datasets(
            query=cleaned_query, page=page, page_size=page_size
        )
    
        # Format the result as text content
        datasets = result.get("data", [])
    
        # Fallback: if cleaned query returns no results and it differs from original,
        # try with the original query
        if not datasets and cleaned_query != query:
            logger.debug(
                "No results with cleaned query '%s', trying original query '%s'",
                cleaned_query,
                query,
            )
            result = await datagouv_api_client.search_datasets(
                query=query, page=page, page_size=page_size
            )
            datasets = result.get("data", [])
    
        if not datasets:
            return f"No datasets found for query: '{query}'"
    
        content_parts = [
            f"Found {result.get('total', len(datasets))} dataset(s) for query: '{query}'",
            f"Page {result.get('page', 1)} of results:\n",
        ]
        for i, ds in enumerate(datasets, 1):
            content_parts.append(f"{i}. {ds.get('title', 'Untitled')}")
            content_parts.append(f"   ID: {ds.get('id')}")
            if ds.get("description_short"):
                desc = ds.get("description_short", "")[:200]
                content_parts.append(f"   Description: {desc}...")
            if ds.get("organization"):
                content_parts.append(f"   Organization: {ds.get('organization')}")
            if ds.get("tags"):
                tags = ", ".join(ds.get("tags", [])[:5])
                content_parts.append(f"   Tags: {tags}")
            content_parts.append(f"   Resources: {ds.get('resources_count', 0)}")
            content_parts.append(f"   URL: {ds.get('url')}")
            content_parts.append("")
    
        return "\n".join(content_parts)
  • Registration of the search_datasets tool: imports register_search_datasets_tool and calls it in register_tools(mcp).
    from tools.search_datasets import register_search_datasets_tool
    
    
    def register_tools(mcp: FastMCP) -> None:
        """Register all MCP tools with the provided FastMCP instance."""
        register_search_datasets_tool(mcp)
  • Helper function to clean search queries by removing common French stop words (e.g., 'données', 'fichier') that interfere with the API's strict AND search logic.
    def clean_search_query(query: str) -> str:
        """
        Clean search query by removing generic stop words that are not typically
        present in dataset metadata but are often added by users.
    
        The API uses strict AND logic, so adding generic words like "données"
        that don't appear in metadata causes searches to return zero results.
    
        Args:
            query: Original search query
    
        Returns:
            Cleaned query with stop words removed
        """
        # Stop words that are generic and often not in dataset metadata
        # These are words users commonly add but that break AND-based searches
        stop_words = {
            "données",
            "donnee",
            "donnees",
            "fichier",
            "fichiers",
            "fichier de",
            "fichiers de",
            "tableau",
            "tableaux",
            "csv",
            "excel",
            "xlsx",
            "json",
            "xml",
        }
    
        # Split query into words, preserving spacing
        words = query.split()
        # Filter out stop words (case-insensitive)
        cleaned_words = [word for word in words if word.lower().strip() not in stop_words]
    
        # Rejoin words, preserving original spacing pattern
        cleaned_query = " ".join(cleaned_words)
        # Clean up multiple spaces
        cleaned_query = " ".join(cleaned_query.split())
    
        if cleaned_query != query:
            logger.debug("Cleaned search query: '%s' -> '%s'", query, cleaned_query)
    
        return cleaned_query
  • Core helper method in DataGouvAPIClient that makes the HTTP request to data.gouv.fr API v1/datasets/, processes and structures the response data for datasets.
    async def search_datasets(
        query: str,
        page: int = 1,
        page_size: int = 20,
        session: httpx.AsyncClient | None = None,
    ) -> dict[str, Any]:
        """
        Search for datasets on data.gouv.fr.
    
        Args:
            query: Search query string (searches in title, description, tags)
            page: Page number (default: 1)
            page_size: Number of results per page (default: 20, max: 100)
    
        Returns:
            dict with 'data' (list of datasets), 'page', 'page_size', and 'total'
        """
        own = session is None
        if own:
            session = httpx.AsyncClient()
        assert session is not None
        try:
            base_url: str = env_config.get_base_url("datagouv_api")
            # Use API v1 for dataset search
            url = f"{base_url}1/datasets/"
            params = {
                "q": query,
                "page": page,
                "page_size": min(page_size, 100),  # API limit
            }
            resp = await session.get(url, params=params, timeout=15.0)
            resp.raise_for_status()
            data = resp.json()
    
            datasets: list[dict[str, Any]] = data.get("data", [])
            # Extract relevant fields for each dataset
            results: list[dict[str, Any]] = []
            for ds in datasets:
                # Handle tags - can be strings or objects with "name" field
                tags: list[str] = []
                for tag in ds.get("tags", []):
                    if isinstance(tag, str):
                        tags.append(tag)
                    elif isinstance(tag, dict):
                        tags.append(tag.get("name", ""))
    
                results.append(
                    {
                        "id": ds.get("id"),
                        "title": ds.get("title") or ds.get("name", ""),
                        "description": ds.get("description", ""),
                        "description_short": ds.get("description_short", ""),
                        "slug": ds.get("slug", ""),
                        "organization": ds.get("organization", {}).get("name")
                        if ds.get("organization")
                        else None,
                        "tags": tags,
                        "resources_count": len(ds.get("resources", [])),
                        "url": f"{env_config.get_base_url('site')}datasets/{ds.get('slug', ds.get('id', ''))}",
                    }
                )
    
            return {
                "data": results,
                "page": page,
                "page_size": len(results),
                "total": data.get("total", len(results)),
            }
        finally:
            if own:
                await session.aclose()

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/bolinocroustibat/datagouv-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server