StreamSets MCP Server

A comprehensive Model Context Protocol (MCP) server that provides seamless integration with StreamSets Control Hub APIs, enabling complete data pipeline management and creation through conversational AI.

🚀 Features

Pipeline Management (Read Operations)

Job Management: List, start, stop, and monitor job execution
Pipeline Operations: Browse, search, and analyze pipeline configurations
Connection Management: Manage data connections and integrations
Metrics & Analytics: Comprehensive performance and usage analytics
Enterprise Integration: Deployment management, security audits, and alerts

Pipeline Building (Write Operations) 🆕

Interactive Pipeline Creation: Build pipelines through conversation
Stage Library: Access to 25+ StreamSets stages (Origins, Processors, Destinations, Executors)
Visual Flow Management: Connect stages with data and event streams
Persistent Sessions: Pipeline builders persist across conversations
Smart Validation: Automatic validation of pipeline logic and connections

📊 API Coverage

44 Tools covering 9 StreamSets Services:

Job Runner API (11 tools) - Job lifecycle management
Pipeline Repository API (7 tools) - Pipeline CRUD operations
Connection API (4 tools) - Data connection management
Provisioning API (5 tools) - Infrastructure and deployment
Notification API (2 tools) - Alert and notification management
Topology API (1 tool) - System topology information
Metrics APIs (7 tools) - Performance and usage analytics
Security API (1 tool) - Security audit trails
Pipeline Builder (6 tools) - Interactive pipeline creation

🏗️ Pipeline Builder Capabilities

Create Complete Data Pipelines

# 1. Initialize a new pipeline builder sdc_create_pipeline_builder title="My ETL Pipeline" engine_type="data_collector" # 2. Browse available stages sdc_list_available_stages category="origins" # 3. Add stages to your pipeline sdc_add_pipeline_stage pipeline_id="pipeline_builder_1" stage_label="Dev Raw Data Source" sdc_add_pipeline_stage pipeline_id="pipeline_builder_1" stage_label="Expression Evaluator" sdc_add_pipeline_stage pipeline_id="pipeline_builder_1" stage_label="Trash" # 4. Connect stages with data flows sdc_connect_pipeline_stages pipeline_id="pipeline_builder_1" source_stage_id="stage_1" target_stage_id="stage_2" sdc_connect_pipeline_stages pipeline_id="pipeline_builder_1" source_stage_id="stage_2" target_stage_id="stage_3" # 5. Visualize your pipeline flow sdc_get_pipeline_flow pipeline_id="pipeline_builder_1" # 6. Build and publish (coming soon) # sdc_build_pipeline pipeline_id="pipeline_builder_1" # sdc_publish_pipeline pipeline_id="pipeline_builder_1"

Persistent Pipeline Sessions

Cross-Conversation: Continue building pipelines across multiple conversations
Auto-Save: All changes automatically saved to disk
Session Management: List, view, and delete pipeline builder sessions
Storage Location: ~/.streamsets_mcp/pipeline_builders/

🛠️ Installation

Prerequisites

Python 3.8+
StreamSets Control Hub account with API credentials
Claude Desktop (for MCP integration)

Setup

Clone the repository
git clone https://github.com/yourusername/streamsets-mcp-server.git cd streamsets-mcp-server
Install dependencies
pip install -r requirements.txt
Configure environment variables
export STREAMSETS_HOST_PREFIX="https://your-instance.streamsets.com" export STREAMSETS_CRED_ID="your-credential-id" export STREAMSETS_CRED_TOKEN="your-auth-token"
Test the server
python streamsets_server.py

Docker Deployment

Setup for MCP Integration

# Build the image docker build -t streamsets-mcp-server . # Create persistent volume for pipeline builders docker volume create streamsets-pipeline-data

Manual Testing

# Test run with volume persistence docker run --rm -it \ -e STREAMSETS_HOST_PREFIX="https://your-instance.streamsets.com" \ -e STREAMSETS_CRED_ID="your-credential-id" \ -e STREAMSETS_CRED_TOKEN="your-auth-token" \ -v streamsets-pipeline-data:/data \ streamsets-mcp-server

Claude Desktop Integration

Option 1: Direct Python (Local Development)

{ "mcpServers": { "streamsets": { "command": "python", "args": ["/path/to/streamsets_server.py"], "env": { "STREAMSETS_HOST_PREFIX": "https://your-instance.streamsets.com", "STREAMSETS_CRED_ID": "your-credential-id", "STREAMSETS_CRED_TOKEN": "your-auth-token" } } } }

Option 2: Docker with Persistence (Production)

{ "mcpServers": { "streamsets": { "command": "docker", "args": [ "run", "--rm", "-i", "-v", "streamsets-pipeline-data:/data", "-e", "STREAMSETS_HOST_PREFIX=https://your-instance.streamsets.com", "-e", "STREAMSETS_CRED_ID=your-credential-id", "-e", "STREAMSETS_CRED_TOKEN=your-auth-token", "streamsets-mcp-server" ] } } }

📖 Usage Examples

Job Management

# List all jobs sdc_list_jobs organization="your-org" status="ACTIVE" # Get detailed job information sdc_get_job_details job_id="your-job-id" # Start/stop jobs sdc_start_job job_id="your-job-id" sdc_stop_job job_id="your-job-id" # Bulk operations sdc_start_multiple_jobs job_ids="job1,job2,job3"

Pipeline Operations

# Search pipelines sdc_search_pipelines search_query="name==ETL*" # Get pipeline details sdc_get_pipeline_details pipeline_id="your-pipeline-id" # Export/import pipelines sdc_export_pipelines commit_ids="commit1,commit2"

Metrics & Analytics

# Job performance metrics sdc_get_job_metrics job_id="your-job-id" # System health overview sdc_get_job_count_by_status # Executor infrastructure metrics sdc_get_executor_metrics executor_type="COLLECTOR" label="prod" # Security audit trails sdc_get_security_audit_metrics org_id="your-org" audit_type="login"

🔧 Configuration

Environment Variables

Required (StreamSets Authentication)

STREAMSETS_HOST_PREFIX - StreamSets Control Hub URL
STREAMSETS_CRED_ID - API Credential ID
STREAMSETS_CRED_TOKEN - Authentication Token

Optional (Pipeline Builder Persistence)

PIPELINE_STORAGE_PATH - Custom storage directory for pipeline builders

Pipeline Builder Storage

Pipeline builders are automatically persisted across conversations and container restarts:

Storage Locations (Priority Order)

Custom Path: PIPELINE_STORAGE_PATH environment variable
Docker Volume: /data/pipeline_builders (when running in Docker)
Default Path: ~/.streamsets_mcp/pipeline_builders/

Configuration Options

Format: Pickle files for session persistence
Management: Automatic file management with error handling
Fallback: Memory-only mode if no writable storage available

Docker Persistence

When using Docker, pipeline builders persist in named volumes:

# Data persists in Docker volume 'streamsets-pipeline-data' docker volume create streamsets-pipeline-data # Run with persistent volume docker run --rm -it -v streamsets-pipeline-data:/data streamsets-mcp-server

Troubleshooting

No Persistence: Check storage directory permissions
Docker Issues: Ensure volume mounts are configured correctly
Memory Mode: Server logs will indicate if persistence is disabled

📚 Documentation

API Reference: See CLAUDE.md for detailed tool documentation
Stage Library: Built-in documentation for 25+ StreamSets stages
Configuration: custom.yaml for MCP server registry
Swagger Specs: API specifications in /swagger/ directory

🧪 Development

Project Structure

streamsets-mcp-server/ ├── streamsets_server.py # Main MCP server implementation ├── custom.yaml # MCP server configuration ├── CLAUDE.md # Comprehensive documentation ├── requirements.txt # Python dependencies ├── Dockerfile # Container deployment ├── swagger/ # API specifications └── README.md # This file

Adding New Tools

Define tool function with @mcp.tool() decorator
Add comprehensive error handling and logging
Update custom.yaml with tool metadata
Document in CLAUDE.md

Testing

# Syntax validation python -m py_compile streamsets_server.py # Tool count verification grep -c "@mcp.tool()" streamsets_server.py

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

StreamSets for the comprehensive Control Hub APIs
Anthropic for the Model Context Protocol framework
FastMCP for the Python MCP server implementation

📧 Support

For issues and questions:

Create an issue on GitHub
Check the documentation in CLAUDE.md
Review the API specifications in /swagger/

Transform your data pipeline workflows with conversational AI! 🚀