Spark History MCP Server

TEST_RESULTS.md•2.86 KiB

# Golden Test Cases - Results Summary ## Test Jobs Executed I created and ran three test jobs on your local Spark cluster to validate the Agentic Spark Optimization System: 1. **Skew Job** (`application_1768320005356_0008`): Intentional data skew with 90% of keys mapped to a single partition 2. **Spill Job** (`application_1768320005356_0009`): Low memory configuration to force shuffle spill 3. **Cartesian Job** (`application_1768320005356_0010`): Cross join producing 1M rows ## Analysis Results ### ✅ What Works Well 1. **Code Recommendations**: The system successfully identified code-level issues: - Hardcoded partitioning in skew job - Use of `collect()` which can cause OOM errors - Specific line-level suggestions with rationale 2. **Configuration Recommendations**: LLM agents provided detailed Spark config tuning suggestions based on the environment and metrics 3. **End-to-End Flow**: The complete pipeline works from job submission → Spark History Server → MCP Client → Agents → Report generation ### ⚠️ Issues Identified 1. **Skew Detection Metrics**: The skew_ratio, max_duration, and median_duration are returning 0.0 instead of actual values - **Root Cause**: The `get_stage_details` API returns summary-level data, but task-level distribution (quantiles) may not be available in the default Spark History Server response - **Impact**: Agents can still detect skew conceptually but lack precise numeric evidence 2. **Spill Metrics**: Similar issue - `total_disk_spill` and `total_memory_spill` show 0 even when spill occurred - **Root Cause**: The stage-level metrics from SHS may aggregate differently than expected ## Recommendations for Improvement ### High Priority 1. **Fix Metric Extraction**: Update `src/client.py` `get_stage_details()` to request task-level metrics with `?quantiles=true` parameter 2. **Add Metric Validation**: Add logging to show what raw data is being fetched from SHS 3. **Enhance Agent Prompts**: Update agent prompts to work with both quantitative and qualitative signals ### Medium Priority 4. **Add More Test Cases**: Create jobs for: - Small file explosion - Broadcast join misuse - Missing AQE opportunities 5. **Improve JSON Parsing**: Some agents failed to produce valid JSON (ConfigRecommendationAgent) - add schema validation ### Low Priority 6. **API Key Management**: The cartesian job analysis failed because GEMINI_API_KEY wasn't found - ensure environment variables persist across commands ## Generated Reports All reports are saved in `/Users/user/Documents/spark_mcp_optimizer/reports/`: - `skew_report.json` - `spill_report.json` - `cartesian_report.json` ## Next Steps 1. Debug the metric extraction by adding verbose logging to `src/client.py` 2. Test with `?quantiles=true` parameter on stage details endpoint 3. Run the optimizer against a real production job to validate recommendations quality

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ravipesala/spark_mcp_optimizer'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

TEST_RESULTS.md•2.86 KiB