GDAL MCP

0019-parallel-processing-strategy-dask.md•8.32 KiB

# ADR-0019: Parallel Processing Strategy for Large Geospatial Datasets **Status**: Proposed (Future Implementation) **Date**: 2025-09-30 **Deciders**: Jordan Godau **Tags**: #performance #dask #rioxarray #scalability #future ## Context As GDAL MCP expands beyond MVP metadata operations into complex processing workflows, **processing time becomes a critical UX challenge**. MCP interactions feel frustrating when operations take minutes instead of seconds. **Current constraints**: - MCP clients expect sub-minute response times for good UX - Geospatial processing is inherently compute/IO intensive - Single-threaded Python is insufficient for large datasets (>1GB rasters, >1M vector features) - Users will expect cloud-native operations on COGs, S3 data, etc. **Problem statement**: How do we enable fast processing of large geospatial datasets within MCP's interactive time constraints? ## Decision **Adopt Dask for parallel geospatial processing** when datasets exceed interactive thresholds: 1. **Raster operations** → `dask` + `rioxarray` (xarray with rasterio backend) 2. **Vector operations** → `dask-geopandas` (partitioned GeoDataFrames) 3. **Threshold-based activation** → Use parallel path only for large datasets 4. **Local Dask client** → Start in-process for simplicity, scale to distributed later ## Strategy ### Raster Processing: dask + rioxarray **Current (MVP)**: Rasterio loads entire raster into memory ```python with rasterio.open(uri) as src: data = src.read() # Entire array in RAM stats = compute_stats(data) ``` **Future (Dask)**: Lazy evaluation with chunked processing ```python import rioxarray as rxr import dask.array as da # Open lazily (no read yet) raster = rxr.open_rasterio(uri, chunks={'x': 2048, 'y': 2048}) # Compute on chunks in parallel stats = raster.mean().compute() # Dask scheduler parallelizes # COG/Cloud-native support raster = rxr.open_rasterio('s3://bucket/cog.tif', chunks='auto') ``` **Benefits**: - ✅ Out-of-core processing (doesn't require full dataset in RAM) - ✅ Parallel computation across chunks - ✅ Native COG/cloud support (reads only needed tiles) - ✅ Lazy evaluation (only compute what's needed) ### Vector Processing: dask-geopandas **Current (MVP)**: GeoPandas loads entire dataset ```python gdf = gpd.read_file(uri) # Entire dataset in RAM gdf_buffered = gdf.buffer(100) ``` **Future (Dask)**: Partitioned parallel processing ```python import dask_geopandas as dgpd # Read in partitions dgdf = dgpd.read_file(uri, npartitions=8) # Parallel operations dgdf_buffered = dgdf.buffer(100) # Computed per partition result = dgdf_buffered.compute() # Gather results ``` **Benefits**: - ✅ Handles datasets too large for RAM - ✅ Parallel spatial operations - ✅ Compatible with existing geopandas code - ✅ Automatic task graph optimization ### Threshold-Based Activation **Not all operations need Dask** - overhead isn't worth it for small data: | Dataset Size | Approach | Rationale | |--------------|----------|-----------| | **< 100 MB raster** | Pure rasterio | Fast enough, less overhead | | **< 10K vector features** | Pure geopandas | In-memory is faster | | **> 100 MB raster** | dask + rioxarray | Parallel chunks, out-of-core | | **> 10K vector features** | dask-geopandas | Partitioned processing | | **Cloud/COG** | Always dask | Lazy loading, tile-based reads | **Implementation pattern**: ```python async def process_raster_smart(uri: str, operation: str) -> Result: """Automatically choose single-threaded or parallel based on size.""" size_mb = get_raster_size_mb(uri) if size_mb < 100 and not is_cloud_uri(uri): # Fast path: pure rasterio return await process_with_rasterio(uri, operation) else: # Parallel path: dask + rioxarray return await process_with_dask(uri, operation) ``` ### Local Dask Client Strategy **Start simple**: In-process client with thread pool ```python from dask.distributed import Client, LocalCluster # Start local cluster (per-request or singleton) client = Client(LocalCluster(n_workers=4, threads_per_worker=2)) # Process with client result = dask_computation.compute() ``` **Scale later**: Distributed cluster when needed - External Dask scheduler for multi-machine - Kubernetes-based Dask cluster - Cloud-native Dask (AWS Fargate, GCP Cloud Run) ## Rationale **Why Dask over alternatives**: | Alternative | Pros | Cons | Verdict | |-------------|------|------|---------| | **Multiprocessing** | ✅ Python stdlib ✅ Simple | ❌ Pickling overhead ❌ No lazy evaluation ❌ Hard to scale | ❌ Insufficient | | **Ray** | ✅ Powerful ✅ ML/AI ecosystem | ❌ Heavy dependency ❌ No native geospatial ❌ Overkill for MCP | ❌ Too complex | | **Dask (chosen)** | ✅ Lazy evaluation ✅ Native xarray/pandas ✅ Out-of-core ✅ Scales to cluster ✅ Geospatial ecosystem | ⚠️ Learning curve ⚠️ Scheduler overhead | ✅ **Best fit** | **Why rioxarray specifically**: - Built on xarray (N-dimensional arrays with labels) - Uses rasterio backend (GDAL under the hood) - Native COG and cloud support - Integrates seamlessly with Dask - Used by NASA, USGS, climate science community **UX impact**: ``` Without Dask: - Process 10GB raster → 5-10 minutes → Frustrating ❌ With Dask (8 workers): - Process 10GB raster → 30-60 seconds → Acceptable ✅ ``` ## Implementation Timeline **Phase 1 (Post-MVP)**: Proof of concept - Add dask, rioxarray, dask-geopandas as optional dependencies - Implement one raster tool with Dask variant (e.g., `raster.stats_large`) - Benchmark performance gains - Document patterns **Phase 2**: Threshold-based smart routing - Implement size/type detection - Auto-select serial vs parallel path - Add progress reporting for long operations **Phase 3**: Scale to distributed - Support external Dask scheduler - Kubernetes deployment guide - Cloud-native examples (S3, GCS) ## Consequences **Positive**: - ✅ **Sub-minute processing** for multi-GB datasets - ✅ **Cloud-native support** (COG, S3) without full downloads - ✅ **Out-of-core** operations for datasets larger than RAM - ✅ **Future-proof** for distributed/cluster scaling - ✅ **Good citizen** in MCP ecosystem (responsive tools) **Negative**: - ⚠️ **Added complexity** - Dask has learning curve - ⚠️ **Dependency weight** - ~50MB for dask + rioxarray - ⚠️ **Scheduler overhead** - Not worth it for small data - ⚠️ **Memory management** - Need to tune chunk sizes **Neutral**: - Optional dependencies (don't force on all users) - Transparent to MCP clients (same tool interface) - Can start with local client, scale to distributed later ## Alternatives Considered **1. Stream processing (no Dask)** - Process data in chunks manually - ❌ Rejected: Reinventing Dask, more code **2. External workers (separate processes)** - Tools submit jobs to worker pool - ❌ Rejected: More infrastructure, harder deployment **3. Always use CLI tools (gdalwarp, ogr2ogr)** - GDAL CLI has built-in threading - ❌ Rejected: Loss of Python-native benefits (ADR-0017) ## Dependencies **Add to pyproject.toml**: ```toml [project.optional-dependencies] parallel = [ "dask[complete]>=2024.1", # Parallel processing framework "rioxarray>=0.15", # Xarray with rasterio backend "dask-geopandas>=0.3", # Partitioned GeoDataFrames "distributed>=2024.1", # Dask distributed scheduler ] ``` **Installation**: ```bash # MVP (no parallel) uv pip install gdal-mcp # With parallel processing uv pip install "gdal-mcp[parallel]" ``` ## Performance Targets | Operation | Dataset Size | Serial Time | Dask Time (8 workers) | Target | |-----------|--------------|-------------|-----------------------|--------| | Raster stats | 1 GB | 60s | 15s | < 30s ✅ | | Raster reproject | 5 GB | 300s | 60s | < 90s ✅ | | Vector buffer | 1M features | 120s | 30s | < 60s ✅ | | COG stats (S3) | 10 GB | N/A (OOM) | 45s | < 60s ✅ | ## Related - **ADR-0017**: Python-native over CLI - Enables this approach - **ADR-0018**: Hybrid vector stack - GeoPandas → Dask-GeoPandas path - **ADR-0015**: Benchmark suite - Will validate performance gains ## References - [Dask Documentation](https://docs.dask.org) - [rioxarray Documentation](https://corteva.github.io/rioxarray/) - [Dask-GeoPandas Documentation](https://dask-geopandas.readthedocs.io) - [Pangeo (Dask + Geospatial)](https://pangeo.io)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Wayfinder-Foundry/gdal-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

0019-parallel-processing-strategy-dask.md•8.32 KiB