We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/ravipesala/spark_mcp_optimizer'
If you have feedback or need assistance with the MCP directory API, please join our Discord server
TEST_RESULTS.md•2.93 kB
# Golden Test Cases - Results Summary
## Test Jobs Executed
I created and ran three test jobs on your local Spark cluster to validate the Agentic Spark Optimization System:
1. **Skew Job** (`application_1768320005356_0008`): Intentional data skew with 90% of keys mapped to a single partition
2. **Spill Job** (`application_1768320005356_0009`): Low memory configuration to force shuffle spill
3. **Cartesian Job** (`application_1768320005356_0010`): Cross join producing 1M rows
## Analysis Results
### ✅ What Works Well
1. **Code Recommendations**: The system successfully identified code-level issues:
- Hardcoded partitioning in skew job
- Use of `collect()` which can cause OOM errors
- Specific line-level suggestions with rationale
2. **Configuration Recommendations**: LLM agents provided detailed Spark config tuning suggestions based on the environment and metrics
3. **End-to-End Flow**: The complete pipeline works from job submission → Spark History Server → MCP Client → Agents → Report generation
### ⚠️ Issues Identified
1. **Skew Detection Metrics**: The skew_ratio, max_duration, and median_duration are returning 0.0 instead of actual values
- **Root Cause**: The `get_stage_details` API returns summary-level data, but task-level distribution (quantiles) may not be available in the default Spark History Server response
- **Impact**: Agents can still detect skew conceptually but lack precise numeric evidence
2. **Spill Metrics**: Similar issue - `total_disk_spill` and `total_memory_spill` show 0 even when spill occurred
- **Root Cause**: The stage-level metrics from SHS may aggregate differently than expected
## Recommendations for Improvement
### High Priority
1. **Fix Metric Extraction**: Update `src/client.py` `get_stage_details()` to request task-level metrics with `?quantiles=true` parameter
2. **Add Metric Validation**: Add logging to show what raw data is being fetched from SHS
3. **Enhance Agent Prompts**: Update agent prompts to work with both quantitative and qualitative signals
### Medium Priority
4. **Add More Test Cases**: Create jobs for:
- Small file explosion
- Broadcast join misuse
- Missing AQE opportunities
5. **Improve JSON Parsing**: Some agents failed to produce valid JSON (ConfigRecommendationAgent) - add schema validation
### Low Priority
6. **API Key Management**: The cartesian job analysis failed because GEMINI_API_KEY wasn't found - ensure environment variables persist across commands
## Generated Reports
All reports are saved in `/Users/user/Documents/spark_mcp_optimizer/reports/`:
- `skew_report.json`
- `spill_report.json`
- `cartesian_report.json`
## Next Steps
1. Debug the metric extraction by adding verbose logging to `src/client.py`
2. Test with `?quantiles=true` parameter on stage details endpoint
3. Run the optimizer against a real production job to validate recommendations quality