ToGMAL MCP Server

NEXT_STEPS_IMPROVEMENTS.md•6.39 kB

# ToGMAL Next Steps: Adaptive Scoring & Nested CV ## Updated: 2025-10-21 This document outlines the immediate next steps to improve ToGMAL's difficulty assessment accuracy and establish a rigorous evaluation framework. --- ## 🎯 Immediate Goals (This Week) ### 1. **Implement Adaptive Uncertainty-Aware Scoring** - **Problem**: Current naive weighted average fails on low-similarity matches - **Example Failure**: "Prove universe is 10,000 years old" → matched to factual recall (similarity ~0.57) → incorrectly rated LOW risk - **Solution**: Add uncertainty penalties when: - Max similarity < 0.7 (weak best match) - High variance in k-NN similarities (diverse, unreliable matches) - Low average similarity (all matches are weak) - **File to modify**: `benchmark_vector_db.py::query_similar_questions()` - **Expected improvement**: 5-15% AUROC gain on low-similarity cases ### 2. **Export Database for Evaluation** - Add `get_all_questions_as_dataframe()` method to export 32K questions - Prepare for train/val/test splitting and nested CV - **File to modify**: `benchmark_vector_db.py` ### 3. **Test Adaptive Scoring** - Create test script with edge cases - Compare baseline vs. adaptive on known failure modes - **New file**: `test_adaptive_scoring.py` --- ## 📊 Evaluation Framework (Next 2-3 Weeks) ### Why Nested Cross-Validation? **Problem with simple train/val/test split:** - Single validation set can be lucky/unlucky (unrepresentative) - Repeated "peeking" at validation during hyperparameter search causes data leakage - Test set gives only ONE performance estimate (high variance) **Nested CV advantages:** - **Outer loop**: 5-fold CV for unbiased generalization estimate - **Inner loop**: 3-fold grid search for hyperparameter tuning - **No leakage**: Test folds never seen during tuning - **Robust**: Multiple performance estimates across 5 different test sets ### Hyperparameters to Tune ```python param_grid = { 'k_neighbors': [3, 5, 7, 10], 'similarity_threshold': [0.6, 0.7, 0.8], 'low_sim_penalty': [0.3, 0.5, 0.7], 'variance_penalty': [1.0, 2.0, 3.0], 'low_avg_penalty': [0.2, 0.4, 0.6] } ``` ### Evaluation Metrics 1. **AUROC** (primary): Discriminative ability (0.5=random, 1.0=perfect) 2. **FPR@TPR95**: False positive rate when catching 95% of risky prompts 3. **AUPR**: Area under precision-recall curve (good for imbalanced data) 4. **Expected Calibration Error (ECE)**: Are predicted probabilities accurate? 5. **Brier Score**: Overall probabilistic prediction accuracy --- ## 🗂️ Implementation Phases ### Phase 1: Adaptive Scoring (This Week) - [x] ✓ 32K vector database with 20 domains, 7 benchmark sources - [ ] Add `_compute_adaptive_difficulty()` method - [ ] Integrate uncertainty penalties into scoring - [ ] Test on known failure cases - [ ] Update `togmal_mcp.py` to use adaptive scoring ### Phase 2: Data Export & Baseline (Week 2) - [ ] Add `get_all_questions_as_dataframe()` export method - [ ] Create simple 70/15/15 train/val/test split - [ ] Run current ToGMAL (baseline) on test set - [ ] Compute baseline metrics: - AUROC - FPR@TPR95 - Expected Calibration Error - Brier Score - [ ] Document failure modes (low similarity, cross-domain, etc.) ### Phase 3: Nested CV Implementation (Week 3) - [ ] Implement `NestedCVEvaluator` class - [ ] Outer CV: 5-fold stratified by (domain × difficulty) - [ ] Inner CV: 3-fold grid search over hyperparameters - [ ] Temporary vector DB creation per fold - [ ] Metrics computation on each outer fold ### Phase 4: Hyperparameter Tuning (Week 4) - [ ] Run full nested CV (5 outer × 3 inner = 15 train-test runs) - [ ] Collect best hyperparameters per fold - [ ] Identify most common optimal parameters - [ ] Compute mean ± std generalization performance - [ ] Compare to baseline ### Phase 5: Final Model & Deployment (Week 5) - [ ] Train final model on ALL 32K questions with best hyperparameters - [ ] Re-index full vector database - [ ] Deploy to MCP server and HTTP facade - [ ] Test with Claude Desktop ### Phase 6: OOD Testing (Week 6) - [ ] Create OOD test sets: - **Adversarial**: "Prove false premises", jailbreaks - **Domain Shift**: Creative writing, coding, real user queries - **Temporal**: New benchmarks (2024+) - [ ] Evaluate on each OOD set - [ ] Analyze performance degradation vs. in-distribution ### Phase 7: Iteration & Documentation (Week 7) - [ ] Analyze failures on OOD sets - [ ] Add new heuristics for missed patterns - [ ] Re-run nested CV with updated features - [ ] Generate calibration plots (reliability diagrams) - [ ] Write technical report --- ## 📈 Expected Improvements Based on OOD detection literature and nested CV best practices: 1. **Adaptive scoring**: +5-15% AUROC on low-similarity cases - Baseline: ~0.75 AUROC (naive weighted average) - Target: ~0.85+ AUROC (adaptive with uncertainty) 2. **Nested CV**: Honest, robust performance estimates - Simple split: Single point estimate (could be lucky/unlucky) - Nested CV: Mean ± std across 5 folds 3. **Domain calibration**: -10-20% false positives - Expected: FPR@TPR95 drops from ~0.25 to ~0.15 4. **Multi-signal fusion**: Better edge case detection - Combine vector similarity + rule-based heuristics - Improved recall on adversarial examples 5. **Calibration**: ECE < 0.05 - Better alignment between predicted risk and actual difficulty --- ## ✅ Validation Checklist (Before Production Deploy) - [ ] Nested CV completed with no data leakage - [ ] Hyperparameters tuned on inner CV folds only - [ ] Generalization performance estimated on outer CV folds - [ ] OOD sets tested (adversarial, domain-shift, temporal) - [ ] Calibration error within acceptable range (ECE < 0.1) - [ ] Failure modes documented with specific examples - [ ] Ablation studies show each component contributes - [ ] Performance: adaptive > baseline on all metrics - [ ] Real-world testing with user queries --- ## 🚀 Quick Start Command See `togmal_improvement_plan.md` for full implementation details including: - Complete code for `NestedCVEvaluator` class - Adaptive scoring implementation - All evaluation metrics with examples - Detailed roadmap with weekly milestones **Next Action**: Implement adaptive scoring in `benchmark_vector_db.py` and test with edge cases.

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/HeTalksInMaths/togmal-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server