The Dataproc MCP Server enables AI assistants to manage Google Cloud Dataproc clusters and jobs through a standardized Model Context Protocol interface.
Cluster Management: List, create, delete, and get details of Dataproc clusters with configurable parameters like instance count, machine type, and disk size
Job Management: Submit various job types (Spark, PySpark, Spark SQL, Hive, Pig, Hadoop) to clusters, list jobs with filtering, get job details, and cancel running jobs
Batch Operations: Create, list, get details of, and delete serverless Dataproc batch jobs with support for network configurations and service accounts
Multiple Transport Support: Operate via STDIO, HTTP, or SSE protocols for different client integration scenarios
Authentication Integration: Supports Google Cloud authentication methods including service account keys, application default credentials, and compute engine service accounts
Provides tools for managing Google Cloud Dataproc clusters and jobs, including cluster creation/deletion, job submission (Spark, PySpark, Hive, Hadoop), and serverless batch operations.
Dataproc MCP Server
A Model Context Protocol (MCP) server that provides tools for managing Google Cloud Dataproc clusters and jobs. This server enables AI assistants to interact with Dataproc resources through a standardized interface.
Features
Cluster Management
List Clusters: View all clusters in a project and region
Create Cluster: Provision new Dataproc clusters with custom configurations
Delete Cluster: Remove existing clusters
Get Cluster: Retrieve detailed information about specific clusters
Job Management
Submit Jobs: Run Spark, PySpark, Spark SQL, Hive, Pig, and Hadoop jobs
List Jobs: View jobs across clusters with filtering options
Get Job: Retrieve detailed job information and status
Cancel Job: Stop running jobs
Batch Operations
Create Batch Jobs: Submit serverless Dataproc batch jobs
List Batch Jobs: View all batch jobs in a region
Get Batch Job: Retrieve detailed batch job information
Delete Batch Job: Remove batch jobs
Related MCP server: GCP MCP Server
Installation
Prerequisites
Python 3.11 or higher (Python 3.13+ recommended)
Google Cloud SDK configured with appropriate permissions
Dataproc API enabled in your Google Cloud project
Install from Source
Alternative Installation Methods
Troubleshooting Installation
If you encounter issues:
Python version errors: Ensure you have Python 3.11+ installed
python --version # Should be 3.11 or higherExternally managed environment errors: Use a virtual environment
python3 -m venv .venv source .venv/bin/activateMissing module errors: Make sure dependencies are installed
pip install -e .
Configuration
Authentication
The server supports multiple authentication methods:
Service Account Key (Recommended for production):
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"Application Default Credentials:
gcloud auth application-default loginCompute Engine Service Account (when running on GCE)
Required Permissions
Ensure your service account or user has the following IAM roles:
roles/dataproc.editor- For cluster and job managementroles/storage.objectViewer- For accessing job files in Cloud Storageroles/compute.networkUser- For VPC network access (if using custom networks)
Usage
Running the Server
First, activate your virtual environment (if using one):
The server supports multiple transport protocols:
Transport Configuration
STDIO (default): Standard input/output communication for command-line tools and MCP clients
HTTP: REST API over HTTP using streamable-http transport
Server URL:
http://localhost:8000/mcpAccessible via web clients and HTTP-based MCP clients
SSE: Server-Sent Events for real-time bidirectional communication
Server URL:
http://localhost:8000/sseSupports streaming responses and live updates
Environment Variables
MCP Client Configuration
Add to your MCP client configuration:
Testing with MCP Inspector
You can test the server using the official MCP Inspector:
The MCP Inspector provides a web interface to:
Browse available tools and resources
Test tool calls with custom parameters
View real-time protocol messages
Debug server responses
Example Tool Usage
Create a Cluster
Submit a PySpark Job
Create a Batch Job
Development
Setup Development Environment
Running Tests
Code Quality
Project Structure
Troubleshooting
Common Issues
Authentication Errors:
Verify
GOOGLE_APPLICATION_CREDENTIALSis set correctlyEnsure service account has required permissions
Check that Dataproc API is enabled
Network Errors:
Verify VPC/subnet configurations for custom networks
Check firewall rules for cluster communication
Ensure clusters are in the correct region
Job Submission Failures:
Verify file paths in Cloud Storage are accessible
Check cluster has sufficient resources
Validate job configuration parameters
Debug Mode
Enable debug logging:
API Reference
Tools
Cluster Management
list_clusters(project_id, region)- List all clusterscreate_cluster(project_id, region, cluster_name, ...)- Create clusterdelete_cluster(project_id, region, cluster_name)- Delete clusterget_cluster(project_id, region, cluster_name)- Get cluster details
Job Management
submit_job(project_id, region, cluster_name, job_type, main_file, ...)- Submit joblist_jobs(project_id, region, cluster_name?, job_states?)- List jobsget_job(project_id, region, job_id)- Get job detailscancel_job(project_id, region, job_id)- Cancel job
Batch Operations
create_batch_job(project_id, region, batch_id, job_type, main_file, ...)- Create batch joblist_batch_jobs(project_id, region, page_size?)- List batch jobsget_batch_job(project_id, region, batch_id)- Get batch job detailsdelete_batch_job(project_id, region, batch_id)- Delete batch job
Resources
dataproc://clusters- Access cluster informationdataproc://jobs- Access job information
Contributing
Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Run the test suite and linting
Submit a pull request
License
MIT License - see LICENSE file for details.
Support
For issues and questions:
Check the troubleshooting section
Review Google Cloud Dataproc documentation
Open an issue in the repository