# š Quick Start Guide
Get your Dataproc MCP Server up and running in under 5 minutes!
## Prerequisites
- **Node.js 18.0.0 or higher** ([Download](https://nodejs.org/))
- **Google Cloud Project** with Dataproc API enabled
- **Service account** with appropriate permissions
- **MCP Client** (Claude Desktop, Roo, or other MCP-compatible client)
### š Required GCP APIs
Enable these APIs in your Google Cloud Project:
```bash
gcloud services enable dataproc.googleapis.com
gcloud services enable compute.googleapis.com
gcloud services enable storage.googleapis.com
gcloud services enable iam.googleapis.com
```
### š Service Account Permissions
Your service account needs these roles:
- `roles/dataproc.editor` - For cluster management
- `roles/storage.objectViewer` - For accessing job outputs
- `roles/iam.serviceAccountUser` - For impersonation (if used)
## Installation
### Option 1: Clone and Build (Current)
```bash
git clone https://github.com/dipseth/dataproc-mcp.git
cd dataproc-mcp
npm install
npm run build
```
### Option 2: NPM Install (Future)
```bash
# When published to npm
npm install -g @dataproc/mcp-server
```
## š ļø Setup
### 1. Interactive Setup (Recommended)
```bash
npm run setup
```
**What this does:**
- ā
Creates necessary directories (`config/`, `state/`, `output/`)
- ā
Guides you through project configuration
- ā
Sets up authentication with your service account
- ā
Creates MCP client configuration template
- ā
Validates your setup
**Example interaction:**
```
š Dataproc MCP Server Setup
=============================
š Creating necessary directories...
ā
Created config/
ā
Created state/
ā
Created output/
š§ Setting up default parameters...
Enter your GCP Project ID: my-dataproc-project
Enter your preferred region (default: us-central1): us-central1
Enter your environment name (default: production): production
š Setting up authentication...
Do you want to use service account impersonation? (y/n): y
Enter the service account email to impersonate: dataproc-sa@my-project.iam.gserviceaccount.com
Enter the path to your source service account key file: /path/to/source-key.json
```
### 2. Manual Setup (Alternative)
```bash
# Create directories
mkdir -p config profiles state output
# Copy configuration templates
cp templates/default-params.json.template config/default-params.json
cp templates/server.json.template config/server.json
cp templates/mcp-settings.json.template mcp-settings.json
# Edit configurations with your details
nano config/default-params.json
nano config/server.json
```
### 3. Validate Setup
```bash
npm run validate
```
This checks:
- ā
Directory structure
- ā
Configuration files
- ā
Service account credentials
- ā
Build status
- ā
Profile availability
## Configuration
### Default Parameters (`config/default-params.json`)
```json
{
"defaultEnvironment": "production",
"parameters": [
{"name": "projectId", "type": "string", "required": true},
{"name": "region", "type": "string", "required": true, "defaultValue": "us-central1"}
],
"environments": [
{
"environment": "production",
"parameters": {
"projectId": "your-project-id",
"region": "us-central1"
}
}
]
}
```
### Authentication (`config/server.json`)
```json
{
"authentication": {
"impersonateServiceAccount": "your-sa@your-project.iam.gserviceaccount.com",
"fallbackKeyPath": "/path/to/your/service-account-key.json",
"preferImpersonation": true,
"useApplicationDefaultFallback": false
}
}
```
## Add to MCP Client
Add this configuration to your MCP client settings:
```json
{
"dataproc-server": {
"command": "node",
"args": ["/path/to/dataproc-mcp-server/build/index.js"],
"disabled": false,
"timeout": 60,
"alwaysAllow": ["*"],
"env": {
"LOG_LEVEL": "error"
}
}
}
```
## Validation
Verify your setup:
```bash
npm run validate
```
## Test with MCP Inspector
```bash
npm run inspector
```
## First Cluster
Once configured, you can create your first cluster:
```bash
# Using the MCP client or inspector
{
"tool": "start_dataproc_cluster",
"arguments": {
"clusterName": "my-first-cluster"
}
}
```
The server will automatically use your configured project ID and region!
## šÆ Common Use Cases
### 1. Quick Data Analysis Cluster
Create a small cluster for data exploration:
```json
{
"tool": "create_cluster_from_profile",
"arguments": {
"profileName": "development/small",
"clusterName": "analysis-cluster-001"
}
}
```
**What this creates:**
- 1 master node (n1-standard-2)
- 2 worker nodes (n1-standard-2)
- Preemptible instances for cost savings
- Standard Spark/Hadoop configuration
### 2. Production ETL Pipeline
For production workloads with high memory requirements:
```json
{
"tool": "create_cluster_from_profile",
"arguments": {
"profileName": "production/high-memory/analysis",
"clusterName": "etl-production-cluster"
}
}
```
**Features:**
- High-memory instances
- Persistent disks
- Auto-scaling enabled
- Production-grade networking
### 3. Run Hive Query
Execute SQL queries on your data:
```json
{
"tool": "submit_hive_query",
"arguments": {
"clusterName": "analysis-cluster-001",
"query": "SELECT COUNT(*) FROM my_table WHERE date >= '2024-01-01'"
}
}
```
### 4. Monitor Job Progress
Check the status of running jobs:
```json
{
"tool": "get_job_status",
"arguments": {
"jobId": "your-job-id-here"
}
}
```
### 5. Get Query Results
Retrieve results from completed queries:
```json
{
"tool": "get_job_results",
"arguments": {
"jobId": "your-job-id-here",
"maxResults": 100
}
}
```
## Available Tools
The server provides 16 comprehensive tools:
### Cluster Management
- `start_dataproc_cluster` - Create a new cluster
- `list_clusters` - List all clusters
- `get_cluster` - Get cluster details
- `delete_cluster` - Delete a cluster
### Job Execution
- `submit_hive_query` - Run Hive queries
- `submit_dataproc_job` - Submit any Dataproc job
- `get_job_status` - Check job status
- `get_job_results` - Get job results
### Profile Management
- `create_cluster_from_profile` - Use predefined profiles
- `list_profiles` - See available profiles
- `get_profile` - Get profile details
### And more!
## š§ Troubleshooting
### Common Issues & Solutions
#### 1. Authentication Problems
**Error:** `Authentication failed` or `Permission denied`
**Solutions:**
```bash
# Check service account permissions
gcloud projects get-iam-policy YOUR_PROJECT_ID
# Verify API is enabled
gcloud services list --enabled | grep dataproc
# Test authentication
gcloud auth application-default login
```
**Required permissions:**
- `dataproc.clusters.create`
- `dataproc.clusters.delete`
- `dataproc.jobs.create`
- `compute.instances.create`
#### 2. Profile Not Found
**Error:** `Profile 'development/small' not found`
**Solutions:**
```bash
# List available profiles
npm run validate
# Check profile directory
ls -la profiles/
# Verify profile syntax
cat profiles/development/small.yaml
```
#### 3. Cluster Creation Fails
**Error:** `Cluster creation failed` or `Quota exceeded`
**Solutions:**
```bash
# Check quotas
gcloud compute project-info describe --project=YOUR_PROJECT
# Verify region availability
gcloud compute zones list --filter="region:us-central1"
# Check firewall rules
gcloud compute firewall-rules list
```
#### 4. Build Issues
**Error:** TypeScript compilation errors
**Solutions:**
```bash
# Clean and rebuild
rm -rf build/ node_modules/
npm install
npm run build
# Check Node.js version
node --version # Should be >= 18.0.0
# Update dependencies
npm update
```
#### 5. Rate Limiting
**Error:** `Rate limit exceeded`
**Solutions:**
```bash
# Wait for rate limit reset (1 minute)
# Or adjust rate limits in configuration
# Check current limits
grep -r "rate" config/
```
#### 6. Network Connectivity
**Error:** `Connection timeout` or `Network unreachable`
**Solutions:**
```bash
# Test connectivity
curl -I https://dataproc.googleapis.com/
# Check proxy settings
echo $HTTP_PROXY $HTTPS_PROXY
# Verify DNS resolution
nslookup dataproc.googleapis.com
```
### Debug Mode
Enable detailed logging for troubleshooting:
```bash
# Set debug log level
export LOG_LEVEL=debug
# Run with verbose output
npm start 2>&1 | tee debug.log
```
### Configuration Validation
Run comprehensive validation:
```bash
npm run validate
```
**What it checks:**
- ā
Node.js version compatibility
- ā
Required dependencies
- ā
Directory structure
- ā
Configuration file syntax
- ā
Service account credentials
- ā
Profile availability
- ā
Build status
### Emergency Procedures
#### Stop All Clusters
```bash
# List all clusters
{
"tool": "list_clusters",
"arguments": {}
}
# Stop specific cluster
{
"tool": "delete_cluster",
"arguments": {
"clusterName": "your-cluster-name"
}
}
# Emergency stop all server instances
npm run stop
```
#### Reset Configuration
```bash
# Backup current config
cp -r config/ config.backup/
# Reset to defaults
rm -rf config/
npm run setup
```
### Get Help
- š [Full Documentation](README.md)
- š§ [Configuration Guide](docs/CONFIGURATION_GUIDE.md)
- š [Security Guide](docs/SECURITY_GUIDE.md)
- š [Report Issues](https://github.com/dipseth/dataproc-mcp/issues)
- š¬ [Community Support](docs/COMMUNITY_SUPPORT.md)
## Next Steps
- Explore the [example profiles](profiles/)
- Read the [Configuration Guide](docs/CONFIGURATION_GUIDE.md)
- Check out the Production Readiness Plan document in the project root
Happy clustering! š