MCP Data Server
Provides SQL querying capabilities over cloud-native geospatial parquet datasets using DuckDB, with H3 spatial indexing for efficient operations.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@MCP Data ServerWhat fraction of Australia is protected area?"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
MCP Data Server
Documentation · The bigger picture
An open Model Context Protocol (MCP) server that connects AI agents to cloud-native data: it grounds the agent in STAC metadata so it finds the right dataset and reads its schema, and confines it to validated cloud-native engines so it queries terabyte-scale data over S3 without downloading it, misreading it, or silently failing at scale. Today it serves SQL over Parquet via DuckDB with H3 spatial indexing; see the roadmap for array (Zarr) and hardware-accelerated engines.
It is one of three open-source components — with data-workflows (which produces the AI-ready data and metadata) and jupyter-geoagent — that together make the cloud-native stack reachable by the AI tools researchers already use. Runs locally for sensitive data or on autoscaling Kubernetes for scale.
Quick Start
Add the hosted MCP endpoint to your LLM client, like so:
Using VSCode
create a .vscode/mcp.json like this: (as in this repo)
{
"servers": {
"duckdb-geo": {
"url": "https://duckdb-mcp.nrp-nautilus.io/mcp"
}
}
}Now simply ask your chat client a question about the datasets and it should answer by querying the database in SQL:
Examples:
What fraction of Australia is protected area?

Using Claude Code (CLI)
Run this command once in your terminal:
claude mcp add --transport http duckdb-geo https://duckdb-mcp.nrp-nautilus.io/mcpTo make it available across all your projects, add --scope user:
claude mcp add --transport http --scope user duckdb-geo https://duckdb-mcp.nrp-nautilus.io/mcpUsing Claude Desktop
Add to your Claude Desktop configuration file:
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Linux: ~/.config/Claude/claude_desktop_config.json
{
"mcpServers": {
"duckdb-geo": {
"url": "https://duckdb-mcp.nrp-nautilus.io/mcp"
}
}
}After adding the configuration, restart Claude Desktop.
Related MCP server: DuckPond MCP Server
Features
Zero-Configuration SQL Access: Query petabytes of geospatial data without database setup
H3 Geospatial Indexing: Efficient spatial operations using Uber's H3 hexagonal grid system
Isolated Execution: Each query runs in a fresh DuckDB instance for security
Stateless HTTP Mode: Fully horizontally scalable for cloud deployment
Rich Dataset Catalog: Access to 10+ curated environmental and biodiversity datasets
MCP Resources & Prompts: Browse datasets and get query guidance through MCP protocol
Available Datasets
The example configuration provides access to the following datasets via S3:
GLWD - Global Lakes and Wetlands Database
Vulnerable Carbon - Conservation International carbon vulnerability data
NCP - Nature Contributions to People biodiversity scores
Countries & Regions - Global administrative boundaries (Overture Maps)
WDPA - World Database on Protected Areas
Ramsar Sites - Wetlands of International Importance
HydroBASINS - Global watershed boundaries (levels 3-6)
iNaturalist - Species occurrence range maps
Corruption Index 2024 - Transparency International data
Datasets are discovered dynamically from the STAC catalog via the list_datasets and get_dataset tools.
Local Development
You can also run the server locally
Or install dependencies and run directly:
pip install -r requirements.txt
python server.pyYou can now connect to the server over localhost (note http not https here), e.g. in VSCode:
{
"servers": {
"duckdb-geo": {
"url": "http://localhost:8000/mcp"
},
}
}You can adjust the instructions to the LLM in the corresponding .md files (e.g. query-optimization.md, h3-guide.md). You will need to adjust query-setup.md to run the server locally, as it uses endpoint and thread count that only work from inside our k8s cluster.
Running locally means your local CPU+network resources will be used for the computation, which will likely be much slower than the hosted k8s endpoint.
Architecture
We have a fully-hosted version
Core Components
server.py - Main MCP server with FastMCP framework
stac.py - STAC catalog integration for dynamic dataset discovery
Runtime Prompt Files
The .md files in this repo are not documentation — they are curated prompt content loaded by server.py at startup and injected directly into MCP tool descriptions and prompts at runtime. The agent (LLM) reads them as instructions, not humans.
File | How it is used |
| SQL parsed and executed in every fresh DuckDB connection before a query runs |
| Injected verbatim into the |
| Injected verbatim into the |
| Served as the |
Editing these files changes what the agent is told to do. They must be written for a stateless LLM — short, concrete, and unambiguous. See AGENTS.md for editing rules.
Key Design Patterns
Stateless transport: FastMCP runs in stateless streamable-HTTP mode (
stateless_http=Trueinserver.py). EveryPOST /mcpis a complete, independent request/response — noMcp-Session-Id, no per-pod session cache, no in-memory state that survives across requests. Replicas behind the load balancer are interchangeable on a per-request basis. (The protocol's stateful SSE mode is not used.)Isolation Engine: Each query runs in a fresh
duckdb.connect(":memory:")— no DuckDB connection, credential, or query state survives between requestsContext Injection: Prompt files are embedded into tool descriptions so even MCP clients that don't support
prompts/listreceive the guidancePartition Pruning: H3 resolution columns (
h0) enable DuckDB to skip S3 partitions, giving 5–20× speedups on large datasets
Kubernetes Deployment
Deploy to Kubernetes using the provided manifests:
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
kubectl apply -f k8s/ingress.yamlThe deployment:
Runs multiple replicas for high availability (prod: 6, dev: 2)
Allocates up to 160 Gi memory / 16 CPU per pod for large queries
Bakes the application code and dependencies into the image (no runtime clone)
Includes
/healthzreadiness + liveness probes for safe rollouts
Releases and production rollouts
Application code is baked into the image (COPY . /app in the Dockerfile); pods no longer git clone at startup. CI (.github/workflows/docker.yml) builds on every push to main and on vX.Y.Z tags:
dev pins the moving
:maintag (imagePullPolicy: Always) and tracks the latestmain.prod pins an immutable
vX.Y.Z@sha256:<digest>(imagePullPolicy: IfNotPresent) — every replica is identical by construction.
The convention is every release tag is a GitHub Release — when you cut a version, push the tag (CI builds :vX.Y.Z), publish a release with notes (gh release create vX.Y.Z --generate-notes), then pin prod to that build's digest. The full step-by-step (including reading the digest) lives in AGENTS.md → Rollout workflow. The latest GitHub Release is the source of truth for "what should be running in prod."
To confirm prod is on the intended release:
# Digest the prod manifest pins:
kubectl -n biodiversity get deploy duckdb-mcp \
-o jsonpath='{.spec.template.spec.containers[0].image}'
# Every running pod should report that same digest:
kubectl -n biodiversity get pods -l app=duckdb-mcp \
-o custom-columns=NAME:.metadata.name,IMAGE:.status.containerStatuses[0].imageIDIf every pod's imageID matches the pinned digest, prod is current and consistent.
MCP Protocol Features
Tools
browse_stac_catalog(catalog_url?, catalog_token?)- List available datasets from the STAC catalogget_stac_details(dataset_id, catalog_url?, catalog_token?)- Get S3 paths and schema for a datasetquery(sql_query, s3_key?, s3_secret?, s3_endpoint?, s3_scope?)- Execute DuckDB SQL against S3 parquet files
Resources
NOTE: Some MCP clients, like in VSCode, do not recognize "resources" and "prompts". Newer clients (Claude code, Continue.dev, Antigravity do)
catalog://list- List all available datasetscatalog://{name}- Get detailed schema for a specific dataset
Prompts
geospatial-analyst- Load complete context for geospatial analysis persona
Query Optimization Tips
Always include h0 in joins - Enables partition pruning for 5-20x speedup
Use APPROX_COUNT_DISTINCT(h8) - Fast area calculations with H3 hexagons
Filter small tables first - Create CTEs to reduce join cardinality
Set THREADS=100 - Parallel S3 reads are I/O bound, not CPU bound
Enable object cache - Reduces redundant S3 requests
See query-optimization.md for detailed guidance.
H3 Spatial Operations
All datasets use Uber's H3 hexagonal grid system for spatial indexing:
Resolution 8 (h8): ~0.737 km² per hex
Resolution 0-4 (h0-h4): Coarser resolutions for global analysis
Use
h3_cell_to_parent()to join datasets at different resolutionsUse
APPROX_COUNT_DISTINCT(h8) * 0.737327598to calculate areas in km²
Testing
# Run all tests
pytest tests/
# Run specific test file
pytest tests/test_server.py
# Run with coverage
pytest --cov=. tests/Configuration
Environment Variables
THREADS- DuckDB thread count (default: 100 for S3 workloads)PORT- HTTP server port (default: 8000)
DuckDB Settings
Required settings are documented in query-setup.md and automatically injected into query tool descriptions.
Private Data Access
The server supports private STAC catalogs and private S3 buckets. Credentials are supplied per-call by the client and are scoped to that request only — they are never logged, cached, or shared between clients.
Private STAC catalog
If your STAC catalog requires authentication, pass a bearer token alongside the catalog URL:
{ "tool": "list_datasets", "arguments": {
"catalog_url": "https://your-app.example.org/stac/catalog.json",
"catalog_token": "YOUR_BEARER_TOKEN"
}}The token is forwarded as Authorization: Bearer <token> when fetching catalog JSON. Pass the same catalog_url and catalog_token to get_dataset as well.
Serving a private catalog: The catalog endpoint needs to accept bearer token authentication for machine-to-machine access. If you are using oauth2-proxy for human (browser) access, add a parallel nginx
auth_requestbypass for the/stac/path that accepts a static shared token via theAuthorizationheader. This allows the MCP server to fetch catalog metadata without requiring a browser OAuth session.
Private S3 data
Pass S3 credentials directly to the query tool. The server injects them as a scoped DuckDB secret for the duration of that query, then destroys the connection:
{ "tool": "query", "arguments": {
"sql_query": "SELECT * FROM read_parquet('s3://my-private-bucket/data/**') LIMIT 10",
"s3_key": "YOUR_ACCESS_KEY_ID",
"s3_secret": "YOUR_SECRET_ACCESS_KEY",
"s3_endpoint": "minio.example.org"
}}s3_endpoint defaults to s3-west.nrp-nautilus.io if omitted. SSL is enabled automatically for non-Ceph endpoints.
Security properties
Concern | How it is handled |
Credential bleed between clients | Each request uses a separate |
Credentials in server logs |
|
Credentials in transit | All traffic is TLS-terminated at the ingress |
Credential persistence |
|
Deploying private apps without a separate server
Rather than maintaining a forked server deployment per app, private geo-agent apps can share the public MCP server endpoint and pass their credentials per-call. This reduces idle deployments and ensures all apps benefit from server improvements automatically.
Security
Stateless Design: No persistent database or user data
Query Isolation: Each request gets a fresh DuckDB instance; client credentials cannot bleed across requests
DNS Rebinding Protection: Disabled for MCP HTTP mode
License
BSD-3-Clause License — see LICENSE.
Contributing
Contributions welcome! Key areas:
Additional dataset integrations
Query optimization patterns
STAC catalog enhancements
Documentation improvements
References
Support
For issues and questions:
GitHub Issues: boettiger-lab/mcp-data-server
Dataset questions: Use the
browse_stac_catalogtool or browse the public STAC catalog
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/boettiger-lab/mcp-data-server'
If you have feedback or need assistance with the MCP directory API, please join our Discord server