Skip to main content
Glama

Dataproc MCP Server

by warrenzhu25

Dataproc MCP Server

A Model Context Protocol (MCP) server that provides tools for managing Google Cloud Dataproc clusters and jobs. This server enables AI assistants to interact with Dataproc resources through a standardized interface.

Features

Cluster Management

  • List Clusters: View all clusters in a project and region
  • Create Cluster: Provision new Dataproc clusters with custom configurations
  • Delete Cluster: Remove existing clusters
  • Get Cluster: Retrieve detailed information about specific clusters

Job Management

  • Submit Jobs: Run Spark, PySpark, Spark SQL, Hive, Pig, and Hadoop jobs
  • List Jobs: View jobs across clusters with filtering options
  • Get Job: Retrieve detailed job information and status
  • Cancel Job: Stop running jobs

Batch Operations

  • Create Batch Jobs: Submit serverless Dataproc batch jobs
  • List Batch Jobs: View all batch jobs in a region
  • Get Batch Job: Retrieve detailed batch job information
  • Delete Batch Job: Remove batch jobs

Installation

Prerequisites

  • Python 3.11 or higher
  • Google Cloud SDK configured with appropriate permissions
  • Dataproc API enabled in your Google Cloud project

Install from Source

# Clone the repository git clone <repository-url> cd dataproc-mcp # Install with uv (recommended) uv pip install --system -e . # Or install with pip pip install -e . # Install development dependencies uv pip install --system -e ".[dev]"

Configuration

Authentication

The server supports multiple authentication methods:

  1. Service Account Key (Recommended for production):
    export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
  2. Application Default Credentials:
    gcloud auth application-default login
  3. Compute Engine Service Account (when running on GCE)

Required Permissions

Ensure your service account or user has the following IAM roles:

  • roles/dataproc.editor - For cluster and job management
  • roles/storage.objectViewer - For accessing job files in Cloud Storage
  • roles/compute.networkUser - For VPC network access (if using custom networks)

Usage

Running the Server

# Run with Python module python -m dataproc_mcp_server # Run with entry point script dataproc-mcp-server # Run with custom transport (if implemented) DATAPROC_MCP_TRANSPORT=sse python -m dataproc_mcp_server

MCP Client Configuration

Add to your MCP client configuration:

{ "mcpServers": { "dataproc": { "command": "python", "args": ["-m", "dataproc_mcp_server"], "env": { "GOOGLE_APPLICATION_CREDENTIALS": "/path/to/service-account.json" } } } }

Example Tool Usage

Create a Cluster
{ "name": "create_cluster", "arguments": { "project_id": "my-project", "region": "us-central1", "cluster_name": "my-cluster", "num_instances": 3, "machine_type": "n1-standard-4", "disk_size_gb": 100, "image_version": "2.1-debian11" } }
Submit a PySpark Job
{ "name": "submit_job", "arguments": { "project_id": "my-project", "region": "us-central1", "cluster_name": "my-cluster", "job_type": "pyspark", "main_file": "gs://my-bucket/my-script.py", "args": ["--input", "gs://my-bucket/input", "--output", "gs://my-bucket/output"], "properties": { "spark.executor.memory": "4g", "spark.executor.instances": "3" } } }
Create a Batch Job
{ "name": "create_batch_job", "arguments": { "project_id": "my-project", "region": "us-central1", "batch_id": "my-batch-job", "job_type": "pyspark", "main_file": "gs://my-bucket/batch-script.py", "service_account": "my-service-account@my-project.iam.gserviceaccount.com" } }

Development

Setup Development Environment

# Install development dependencies uv pip install --system -e ".[dev]" # Or with pip pip install -e ".[dev]"

Running Tests

# Run all tests pytest # Run with coverage python -m pytest --cov=src/dataproc_mcp_server tests/ # Run specific test file pytest tests/test_dataproc_client.py -v

Code Quality

# Format code ruff format src/ tests/ # Lint code ruff check src/ tests/ # Type checking (with VS Code + Pylance or mypy) mypy src/

Project Structure

dataproc-mcp/ ├── src/dataproc_mcp_server/ │ ├── __init__.py │ ├── __main__.py # Entry point │ ├── server.py # MCP server implementation │ ├── dataproc_client.py # Dataproc cluster/job operations │ └── batch_client.py # Dataproc batch operations ├── tests/ │ ├── __init__.py │ ├── test_server.py │ └── test_dataproc_client.py ├── examples/ │ ├── mcp_server_config.json │ └── example_usage.py ├── pyproject.toml ├── CLAUDE.md # Development guide └── README.md

Troubleshooting

Common Issues

  1. Authentication Errors:
    • Verify GOOGLE_APPLICATION_CREDENTIALS is set correctly
    • Ensure service account has required permissions
    • Check that Dataproc API is enabled
  2. Network Errors:
    • Verify VPC/subnet configurations for custom networks
    • Check firewall rules for cluster communication
    • Ensure clusters are in the correct region
  3. Job Submission Failures:
    • Verify file paths in Cloud Storage are accessible
    • Check cluster has sufficient resources
    • Validate job configuration parameters

Debug Mode

Enable debug logging:

export PYTHONPATH=/path/to/dataproc-mcp/src python -c " import logging logging.basicConfig(level=logging.DEBUG) from dataproc_mcp_server import __main__ import asyncio asyncio.run(__main__.main()) "

API Reference

Tools

Cluster Management
  • list_clusters(project_id, region) - List all clusters
  • create_cluster(project_id, region, cluster_name, ...) - Create cluster
  • delete_cluster(project_id, region, cluster_name) - Delete cluster
  • get_cluster(project_id, region, cluster_name) - Get cluster details
Job Management
  • submit_job(project_id, region, cluster_name, job_type, main_file, ...) - Submit job
  • list_jobs(project_id, region, cluster_name?, job_states?) - List jobs
  • get_job(project_id, region, job_id) - Get job details
  • cancel_job(project_id, region, job_id) - Cancel job
Batch Operations
  • create_batch_job(project_id, region, batch_id, job_type, main_file, ...) - Create batch job
  • list_batch_jobs(project_id, region, page_size?) - List batch jobs
  • get_batch_job(project_id, region, batch_id) - Get batch job details
  • delete_batch_job(project_id, region, batch_id) - Delete batch job

Resources

  • dataproc://clusters - Access cluster information
  • dataproc://jobs - Access job information

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite and linting
  6. Submit a pull request

License

MIT License - see LICENSE file for details.

Support

For issues and questions:

  1. Check the troubleshooting section
  2. Review Google Cloud Dataproc documentation
  3. Open an issue in the repository
-
security - not tested
A
license - permissive license
-
quality - not tested

remote-capable server

The server can be hosted and run remotely because it primarily relies on remote services or has no dependency on the local environment.

Enables AI assistants to manage Google Cloud Dataproc clusters and jobs through a standardized interface. Supports cluster creation/deletion, job submission (Spark, PySpark, Hive, etc.), and serverless batch operations.

  1. Features
    1. Cluster Management
    2. Job Management
    3. Batch Operations
  2. Installation
    1. Prerequisites
    2. Install from Source
  3. Configuration
    1. Authentication
    2. Required Permissions
  4. Usage
    1. Running the Server
    2. MCP Client Configuration
    3. Example Tool Usage
  5. Development
    1. Setup Development Environment
    2. Running Tests
    3. Code Quality
    4. Project Structure
  6. Troubleshooting
    1. Common Issues
    2. Debug Mode
  7. API Reference
    1. Tools
    2. Resources
  8. Contributing
    1. License
      1. Support

        Related MCP Servers

        • A
          security
          F
          license
          A
          quality
          Enables managing Google Cloud Platform resources through natural language commands in Claude Desktop, supporting comprehensive operations across compute, storage, databases, networking, monitoring, and IAM without manual credential setup.
          Last updated -
          56
          21
          Python
          • Apple
        • -
          security
          A
          license
          -
          quality
          Enables AI assistants to interact with and manage Google Cloud Platform resources including Compute Engine, Cloud Run, Storage, BigQuery, and other GCP services through a standardized MCP interface.
          Last updated -
          3
          Python
          MIT License
          • Linux
          • Apple
        • -
          security
          A
          license
          -
          quality
          Enables interactions with Google Cloud Tasks queues and tasks through natural language, allowing users to list, manage, pause/resume queues and handle tasks via Claude Desktop.
          Last updated -
          JavaScript
          MIT License
        • -
          security
          F
          license
          -
          quality
          An MCP Server that enables interaction with Google Cloud SQL Admin API, allowing users to manage Cloud SQL database instances through natural language commands.
          Last updated -
          Python
          • Linux
          • Apple

        View all related MCP servers

        MCP directory API

        We provide all the information about MCP servers via our MCP API.

        curl -X GET 'https://glama.ai/api/mcp/v1/servers/warrenzhu25/dataproc-mcp'

        If you have feedback or need assistance with the MCP directory API, please join our Discord server