search_datasets
Search for datasets on France's national open data platform using keywords to find relevant public data with metadata like title, organization, and tags.
Instructions
Search for datasets on data.gouv.fr by keywords.
This is typically the first step in exploring data.gouv.fr. Returns a list of datasets matching the search query with their metadata, including title, description, organization, tags, and resource count.
After finding relevant datasets, use get_dataset_info to get more details, or list_dataset_resources to see what files are available in a dataset.
Args: query: Search query string (searches in title, description, tags) page: Page number (default: 1) page_size: Number of results per page (default: 20, max: 100)
Returns: Formatted text with dataset information, including dataset IDs for further queries
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| query | Yes | ||
| page | No | ||
| page_size | No |
Implementation Reference
- tools/search_datasets.py:61-132 (handler)The @mcp.tool() decorated handler function implementing the core logic for the search_datasets tool: cleans query, calls API client, handles fallback, formats results as text.async def search_datasets(query: str, page: int = 1, page_size: int = 20) -> str: """ Search for datasets on data.gouv.fr by keywords. This is typically the first step in exploring data.gouv.fr. Returns a list of datasets matching the search query with their metadata, including title, description, organization, tags, and resource count. The upstream API uses strict AND logic: too many generic words can lead to zero results. The tool automatically strips generic stop words (e.g. "données", "fichier", "csv") and, if needed, retries with the original query. Prefer short, specific queries; if you get no results, try again with fewer generic terms. After finding relevant datasets, use get_dataset_info to get more details, or list_dataset_resources to see what files are available in a dataset. Args: query: Search query string (searches in title, description, tags) page: Page number (default: 1) page_size: Number of results per page (default: 20, max: 100) Returns: Formatted text with dataset information, including dataset IDs for further queries """ # Clean the query to remove generic stop words that break AND-based searches cleaned_query = clean_search_query(query) # Try with cleaned query first result = await datagouv_api_client.search_datasets( query=cleaned_query, page=page, page_size=page_size ) # Format the result as text content datasets = result.get("data", []) # Fallback: if cleaned query returns no results and it differs from original, # try with the original query if not datasets and cleaned_query != query: logger.debug( "No results with cleaned query '%s', trying original query '%s'", cleaned_query, query, ) result = await datagouv_api_client.search_datasets( query=query, page=page, page_size=page_size ) datasets = result.get("data", []) if not datasets: return f"No datasets found for query: '{query}'" content_parts = [ f"Found {result.get('total', len(datasets))} dataset(s) for query: '{query}'", f"Page {result.get('page', 1)} of results:\n", ] for i, ds in enumerate(datasets, 1): content_parts.append(f"{i}. {ds.get('title', 'Untitled')}") content_parts.append(f" ID: {ds.get('id')}") if ds.get("description_short"): desc = ds.get("description_short", "")[:200] content_parts.append(f" Description: {desc}...") if ds.get("organization"): content_parts.append(f" Organization: {ds.get('organization')}") if ds.get("tags"): tags = ", ".join(ds.get("tags", [])[:5]) content_parts.append(f" Tags: {tags}") content_parts.append(f" Resources: {ds.get('resources_count', 0)}") content_parts.append(f" URL: {ds.get('url')}") content_parts.append("") return "\n".join(content_parts)
- tools/__init__.py:11-16 (registration)Registration of the search_datasets tool: imports register_search_datasets_tool and calls it in register_tools(mcp).from tools.search_datasets import register_search_datasets_tool def register_tools(mcp: FastMCP) -> None: """Register all MCP tools with the provided FastMCP instance.""" register_search_datasets_tool(mcp)
- tools/search_datasets.py:10-56 (helper)Helper function to clean search queries by removing common French stop words (e.g., 'données', 'fichier') that interfere with the API's strict AND search logic.def clean_search_query(query: str) -> str: """ Clean search query by removing generic stop words that are not typically present in dataset metadata but are often added by users. The API uses strict AND logic, so adding generic words like "données" that don't appear in metadata causes searches to return zero results. Args: query: Original search query Returns: Cleaned query with stop words removed """ # Stop words that are generic and often not in dataset metadata # These are words users commonly add but that break AND-based searches stop_words = { "données", "donnee", "donnees", "fichier", "fichiers", "fichier de", "fichiers de", "tableau", "tableaux", "csv", "excel", "xlsx", "json", "xml", } # Split query into words, preserving spacing words = query.split() # Filter out stop words (case-insensitive) cleaned_words = [word for word in words if word.lower().strip() not in stop_words] # Rejoin words, preserving original spacing pattern cleaned_query = " ".join(cleaned_words) # Clean up multiple spaces cleaned_query = " ".join(cleaned_query.split()) if cleaned_query != query: logger.debug("Cleaned search query: '%s' -> '%s'", query, cleaned_query) return cleaned_query
- Core helper method in DataGouvAPIClient that makes the HTTP request to data.gouv.fr API v1/datasets/, processes and structures the response data for datasets.async def search_datasets( query: str, page: int = 1, page_size: int = 20, session: httpx.AsyncClient | None = None, ) -> dict[str, Any]: """ Search for datasets on data.gouv.fr. Args: query: Search query string (searches in title, description, tags) page: Page number (default: 1) page_size: Number of results per page (default: 20, max: 100) Returns: dict with 'data' (list of datasets), 'page', 'page_size', and 'total' """ own = session is None if own: session = httpx.AsyncClient() assert session is not None try: base_url: str = env_config.get_base_url("datagouv_api") # Use API v1 for dataset search url = f"{base_url}1/datasets/" params = { "q": query, "page": page, "page_size": min(page_size, 100), # API limit } resp = await session.get(url, params=params, timeout=15.0) resp.raise_for_status() data = resp.json() datasets: list[dict[str, Any]] = data.get("data", []) # Extract relevant fields for each dataset results: list[dict[str, Any]] = [] for ds in datasets: # Handle tags - can be strings or objects with "name" field tags: list[str] = [] for tag in ds.get("tags", []): if isinstance(tag, str): tags.append(tag) elif isinstance(tag, dict): tags.append(tag.get("name", "")) results.append( { "id": ds.get("id"), "title": ds.get("title") or ds.get("name", ""), "description": ds.get("description", ""), "description_short": ds.get("description_short", ""), "slug": ds.get("slug", ""), "organization": ds.get("organization", {}).get("name") if ds.get("organization") else None, "tags": tags, "resources_count": len(ds.get("resources", [])), "url": f"{env_config.get_base_url('site')}datasets/{ds.get('slug', ds.get('id', ''))}", } ) return { "data": results, "page": page, "page_size": len(results), "total": data.get("total", len(results)), } finally: if own: await session.aclose()