Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Spark History MCP ServerAnalyze application_1234567890_0001 and suggest optimizations"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Agentic Spark Job Optimization System - Production Ready
Overview
A fully agentic system that uses LLM-powered agents to analyze Spark applications and provide configuration and code optimization recommendations without running calculations in code.
Features
✅ MCP Server: Exposes Spark History Server data as MCP tools
✅ Specialized Agents: 6 expert agents for different analysis domains
✅ Production-Ready: Comprehensive error handling, logging, and configuration
✅ Extensive Testing: 11+ unit tests, 7 edge case tests, integration tests
✅ Golden Test Cases: 7 test jobs covering common performance issues
Quick Start
Installation
Run Analysis
Architecture
Components
Spark History MCP Server (
src/server.py)Exposes 10 MCP tools for Spark metrics
Stateless, read-only design
Optimization Engine (
src/optimizer/engine.py)Orchestrates agent analysis
Gathers context from Spark History Server
Builds structured optimization reports
Specialized Agents (
src/optimizer/agents.py)ExecutionAnalysisAgent: Long-running stages, executor imbalance
ShuffleSpillAgent: Shuffle efficiency, memory pressure
SkewDetectionAgent: Data skew detection
SQLPlanAgent: SQL plan inefficiencies
ConfigRecommendationAgent: Spark configuration tuning
CodeRecommendationAgent: Code-level optimizations
Data Flow
Test Coverage
Unit Tests (11 tests)
tests/test_optimizer.py: Core functionalityClient methods with mocked HTTP
Agent orchestration
Metric extraction
Spill detection
Edge Case Tests (7 tests)
tests/test_edge_cases.py: Error handlingEmpty stages
Malformed LLM responses
Network errors
Missing data
High GC overhead
Golden Test Jobs (7 jobs)
examples/job_skew.py: Data skew (90% keys to one partition)examples/job_spill.py: Memory spill (low memory config)examples/job_cartesian.py: Cartesian join (1M rows)examples/job_broadcast_misuse.py: Broadcast on large tableexamples/job_small_files.py: Small file explosionexamples/job_missing_predicate.py: Missing predicate pushdownexamples/job_gc_pressure.py: GC pressure from many objects
Run All Tests
Configuration
Environment Variables
See .env.example for full configuration options.
Production Deployment
See docs/DEPLOYMENT.md for comprehensive deployment guide including:
Resource requirements
Rate limiting configuration
Monitoring and health checks
Scaling strategies
Troubleshooting
API Reference
CLI
MCP Server
Exposes tools:
get_application_summary(app_id)get_jobs(app_id)get_stages(app_id)get_stage_details(app_id, stage_id)get_executors(app_id)get_sql_executions(app_id)get_sql_plan(app_id, execution_id)get_rdd_storage(app_id)get_environment(app_id)get_event_timeline(app_id)
Python API
Report Structure
Key Improvements from Initial Version
Metric Extraction: Added quantiles parameter to fetch task distribution
Spill Detection: Extract actual disk/memory spill from stage data
Error Handling: Comprehensive error handling throughout
Logging: Structured logging with configurable levels
Configuration: Centralized config with environment variable support
Testing: 18 tests covering unit, edge cases, and integration
Documentation: Deployment guide, API docs, troubleshooting
Performance
Analysis Time: ~30-60 seconds per application
Memory Usage: ~500MB typical, ~2GB for large apps
API Calls: 6 LLM calls per analysis (one per agent)
Rate Limiting: Auto-retry with exponential backoff
Limitations
Skew Detection: Requires quantile data from Spark History Server
Code Analysis: Limited to static analysis (no execution)
API Dependency: Requires Gemini API access
Historical Data: Only analyzes completed applications
Future Enhancements
Real-time analysis of running applications
Cost optimization recommendations
Regression risk detection
Support for Iceberg/Delta-specific rules
Web UI for interactive analysis
Scheduled batch analysis
License
[Your License Here]
Contributors
[Your Team Here]
Support
For issues and questions:
GitHub Issues: [Your Repo]
Documentation:
docs/Examples:
examples/