search_datasets

Search for datasets on France's national open data platform using keywords to find relevant public data with metadata like title, organization, and tags.

Instructions

Search for datasets on data.gouv.fr by keywords.

This is typically the first step in exploring data.gouv.fr. Returns a list of datasets matching the search query with their metadata, including title, description, organization, tags, and resource count.

After finding relevant datasets, use get_dataset_info to get more details, or list_dataset_resources to see what files are available in a dataset.

Args: query: Search query string (searches in title, description, tags) page: Page number (default: 1) page_size: Number of results per page (default: 20, max: 100)

Returns: Formatted text with dataset information, including dataset IDs for further queries

Input Schema

TableJSON Schema

Name	Required	Description	Default
`query`	Yes
`page`	No
`page_size`	No

Output Schema

TableJSON Schema

Name	Required	Description	Default
`result`	Yes

Implementation Reference

tools/search_datasets.py:61-132 (handler)

The @mcp.tool() decorated handler function implementing the core logic for the search_datasets tool: cleans query, calls API client, handles fallback, formats results as text.

async def search_datasets(query: str, page: int = 1, page_size: int = 20) -> str:
    """
    Search for datasets on data.gouv.fr by keywords.

    This is typically the first step in exploring data.gouv.fr. Returns a list of
    datasets matching the search query with their metadata, including title,
    description, organization, tags, and resource count.

    The upstream API uses strict AND logic: too many generic words can lead to
    zero results. The tool automatically strips generic stop words (e.g. "données",
    "fichier", "csv") and, if needed, retries with the original query. Prefer short,
    specific queries; if you get no results, try again with fewer generic terms.

    After finding relevant datasets, use get_dataset_info to get more details, or
    list_dataset_resources to see what files are available in a dataset.

    Args:
        query: Search query string (searches in title, description, tags)
        page: Page number (default: 1)
        page_size: Number of results per page (default: 20, max: 100)

    Returns:
        Formatted text with dataset information, including dataset IDs for further queries
    """
    # Clean the query to remove generic stop words that break AND-based searches
    cleaned_query = clean_search_query(query)

    # Try with cleaned query first
    result = await datagouv_api_client.search_datasets(
        query=cleaned_query, page=page, page_size=page_size
    )

    # Format the result as text content
    datasets = result.get("data", [])

    # Fallback: if cleaned query returns no results and it differs from original,
    # try with the original query
    if not datasets and cleaned_query != query:
        logger.debug(
            "No results with cleaned query '%s', trying original query '%s'",
            cleaned_query,
            query,
        )
        result = await datagouv_api_client.search_datasets(
            query=query, page=page, page_size=page_size
        )
        datasets = result.get("data", [])

    if not datasets:
        return f"No datasets found for query: '{query}'"

    content_parts = [
        f"Found {result.get('total', len(datasets))} dataset(s) for query: '{query}'",
        f"Page {result.get('page', 1)} of results:\n",
    ]
    for i, ds in enumerate(datasets, 1):
        content_parts.append(f"{i}. {ds.get('title', 'Untitled')}")
        content_parts.append(f"   ID: {ds.get('id')}")
        if ds.get("description_short"):
            desc = ds.get("description_short", "")[:200]
            content_parts.append(f"   Description: {desc}...")
        if ds.get("organization"):
            content_parts.append(f"   Organization: {ds.get('organization')}")
        if ds.get("tags"):
            tags = ", ".join(ds.get("tags", [])[:5])
            content_parts.append(f"   Tags: {tags}")
        content_parts.append(f"   Resources: {ds.get('resources_count', 0)}")
        content_parts.append(f"   URL: {ds.get('url')}")
        content_parts.append("")

    return "\n".join(content_parts)

tools/__init__.py:11-16 (registration)

Registration of the search_datasets tool: imports register_search_datasets_tool and calls it in register_tools(mcp).

from tools.search_datasets import register_search_datasets_tool


def register_tools(mcp: FastMCP) -> None:
    """Register all MCP tools with the provided FastMCP instance."""
    register_search_datasets_tool(mcp)

tools/search_datasets.py:10-56 (helper)

Helper function to clean search queries by removing common French stop words (e.g., 'données', 'fichier') that interfere with the API's strict AND search logic.

def clean_search_query(query: str) -> str:
    """
    Clean search query by removing generic stop words that are not typically
    present in dataset metadata but are often added by users.

    The API uses strict AND logic, so adding generic words like "données"
    that don't appear in metadata causes searches to return zero results.

    Args:
        query: Original search query

    Returns:
        Cleaned query with stop words removed
    """
    # Stop words that are generic and often not in dataset metadata
    # These are words users commonly add but that break AND-based searches
    stop_words = {
        "données",
        "donnee",
        "donnees",
        "fichier",
        "fichiers",
        "fichier de",
        "fichiers de",
        "tableau",
        "tableaux",
        "csv",
        "excel",
        "xlsx",
        "json",
        "xml",
    }

    # Split query into words, preserving spacing
    words = query.split()
    # Filter out stop words (case-insensitive)
    cleaned_words = [word for word in words if word.lower().strip() not in stop_words]

    # Rejoin words, preserving original spacing pattern
    cleaned_query = " ".join(cleaned_words)
    # Clean up multiple spaces
    cleaned_query = " ".join(cleaned_query.split())

    if cleaned_query != query:
        logger.debug("Cleaned search query: '%s' -> '%s'", query, cleaned_query)

    return cleaned_query

helpers/datagouv_api_client.py:149-220 (helper)

Core helper method in DataGouvAPIClient that makes the HTTP request to data.gouv.fr API v1/datasets/, processes and structures the response data for datasets.

async def search_datasets(
    query: str,
    page: int = 1,
    page_size: int = 20,
    session: httpx.AsyncClient | None = None,
) -> dict[str, Any]:
    """
    Search for datasets on data.gouv.fr.

    Args:
        query: Search query string (searches in title, description, tags)
        page: Page number (default: 1)
        page_size: Number of results per page (default: 20, max: 100)

    Returns:
        dict with 'data' (list of datasets), 'page', 'page_size', and 'total'
    """
    own = session is None
    if own:
        session = httpx.AsyncClient()
    assert session is not None
    try:
        base_url: str = env_config.get_base_url("datagouv_api")
        # Use API v1 for dataset search
        url = f"{base_url}1/datasets/"
        params = {
            "q": query,
            "page": page,
            "page_size": min(page_size, 100),  # API limit
        }
        resp = await session.get(url, params=params, timeout=15.0)
        resp.raise_for_status()
        data = resp.json()

        datasets: list[dict[str, Any]] = data.get("data", [])
        # Extract relevant fields for each dataset
        results: list[dict[str, Any]] = []
        for ds in datasets:
            # Handle tags - can be strings or objects with "name" field
            tags: list[str] = []
            for tag in ds.get("tags", []):
                if isinstance(tag, str):
                    tags.append(tag)
                elif isinstance(tag, dict):
                    tags.append(tag.get("name", ""))

            results.append(
                {
                    "id": ds.get("id"),
                    "title": ds.get("title") or ds.get("name", ""),
                    "description": ds.get("description", ""),
                    "description_short": ds.get("description_short", ""),
                    "slug": ds.get("slug", ""),
                    "organization": ds.get("organization", {}).get("name")
                    if ds.get("organization")
                    else None,
                    "tags": tags,
                    "resources_count": len(ds.get("resources", [])),
                    "url": f"{env_config.get_base_url('site')}datasets/{ds.get('slug', ds.get('id', ''))}",
                }
            )

        return {
            "data": results,
            "page": page,
            "page_size": len(results),
            "total": data.get("total", len(results)),
        }
    finally:
        if own:
            await session.aclose()

datagouv-mcp

search_datasets

Instructions

Input Schema

Output Schema

Implementation Reference

Tool Definition Quality

Other Tools

Latest Blog Posts

MCP directory API