Skip to main content
Glama

ToGMAL MCP Server

togmal_improvement_plan.md39.5 kB
# ToGMAL Improvement Plan: Adaptive Scoring & Evaluation Framework ## Executive Summary This plan addresses two critical gaps in togmal's current implementation: 1. **Naive weighted averaging fails when retrieved questions have low similarity** to the prompt 2. **Lack of rigorous evaluation methodology** to measure OOD detection performance --- ## Problem 1: Low-Similarity Scoring Issues ### Current Limitation Your system uses a simple weighted average of difficulty scores from k-nearest neighbors, which produces unreliable risk assessments when: - Maximum similarity < 0.6 (semantically distant matches) - Retrieved questions span multiple unrelated domains - Query is truly novel/out-of-distribution **Example:** "Prove universe is 10,000 years old" matched to factual recall questions about Earth's age (similarity ~0.57), resulting in LOW risk despite being a "prove false premise" pattern. ### Solution: Adaptive Uncertainty-Aware Scoring #### 1. Similarity-Based Confidence Adjustment Implement a **confidence decay function** that increases risk when similarity is low: ```python def compute_adaptive_risk(similarities, difficulties, k=5): """ Adjust risk score based on retrieval confidence """ # Base weighted score weights = np.array(similarities) / sum(similarities) base_score = np.dot(weights, difficulties) # Confidence metrics max_sim = max(similarities) avg_sim = np.mean(similarities) sim_variance = np.var(similarities) # Uncertainty penalty - increase risk when: # - Max similarity is low (< 0.7) # - High variance in similarities (diverse matches) # - Average similarity is low uncertainty_penalty = 0.0 # Low maximum similarity threshold if max_sim < 0.7: uncertainty_penalty += (0.7 - max_sim) * 0.5 # High variance (retrieved questions are dissimilar to each other) if sim_variance > 0.05: uncertainty_penalty += min(sim_variance * 2, 0.3) # Low average similarity if avg_sim < 0.5: uncertainty_penalty += (0.5 - avg_sim) * 0.4 # Adjusted score (higher = more risky) adjusted_score = base_score + uncertainty_penalty # Map to risk levels if adjusted_score < 0.2: return "MINIMAL" elif adjusted_score < 0.4: return "LOW" elif adjusted_score < 0.6: return "MODERATE" elif adjusted_score < 0.8: return "HIGH" else: return "CRITICAL" ``` **Key Insight:** Research shows that cosine similarity thresholds vary by domain and task. Values 0.7-0.8 are commonly recommended starting points for "relevant" matches. Below 0.6, matches become increasingly unreliable. #### 2. Multi-Signal Fusion Combine multiple indicators beyond just k-NN similarity: ```python def compute_risk_with_fusion(prompt, knn_results, heuristics): """ Fuse vector similarity with rule-based heuristics """ # Vector-based score (from k-NN) vector_score = compute_adaptive_risk( knn_results['similarities'], knn_results['difficulties'] ) # Rule-based heuristics (existing togmal patterns) heuristic_score = heuristics.evaluate(prompt) # Domain classifier (is this math/physics/medical?) domain_confidence = classify_domain(prompt) # Combine scores with learned weights final_score = ( 0.4 * vector_score + 0.4 * heuristic_score + 0.2 * domain_uncertainty(domain_confidence) ) return final_score ``` #### 3. Threshold Calibration per Domain Different domains need different thresholds. Implement **domain-specific calibration**: ```python # Learned from validation data DOMAIN_THRESHOLDS = { 'math': {'low': 0.65, 'moderate': 0.75, 'high': 0.85}, 'physics': {'low': 0.60, 'moderate': 0.70, 'high': 0.80}, 'medical': {'low': 0.70, 'moderate': 0.80, 'high': 0.90}, 'general': {'low': 0.60, 'moderate': 0.70, 'high': 0.80} } def get_calibrated_threshold(domain, risk_level): return DOMAIN_THRESHOLDS.get(domain, DOMAIN_THRESHOLDS['general'])[risk_level] ``` --- ## Problem 2: Evaluation & Generalization ### Proposed Evaluation Framework: Nested Cross-Validation (Gold Standard) #### Why Nested CV > Simple Train/Val/Test Split **Problem with simple splits:** - Single validation set can be unrepresentative (lucky/unlucky split) - Repeated "peeking" at validation during hyperparameter search causes leakage - Test set provides only ONE estimate of generalization (high variance) **Nested CV advantages:** - **Outer loop**: K-fold CV for unbiased generalization estimate - **Inner loop**: Hyperparameter search on each training fold - **No leakage**: Test folds never seen during tuning - **Multiple estimates**: Robust performance across K different test sets #### Implementation: Nested Cross-Validation ```python from sklearn.model_selection import StratifiedKFold, GridSearchCV import numpy as np from typing import Dict, List, Any class NestedCVEvaluator: """ Nested cross-validation for ToGMAL hyperparameter tuning and evaluation. Outer CV: 5-fold stratified CV for generalization estimate Inner CV: 3-fold stratified CV for hyperparameter search This prevents data leakage from "peeking" at validation during tuning. """ def __init__( self, benchmark_data, outer_folds: int = 5, inner_folds: int = 3, random_state: int = 42 ): self.data = benchmark_data self.outer_folds = outer_folds self.inner_folds = inner_folds self.random_state = random_state # Stratify by (domain, difficulty) to ensure balanced folds self.stratify_labels = ( benchmark_data['domain'].astype(str) + '_' + benchmark_data['difficulty_label'].astype(str) ) def run_nested_cv( self, param_grid: Dict[str, List[Any]], scoring_metric: str = 'roc_auc' ) -> Dict[str, Any]: """ Run nested cross-validation. Args: param_grid: Hyperparameters to search (e.g., {'k': [3,5,7], 'threshold': [0.6,0.7]}) scoring_metric: Metric for optimization (roc_auc, f1, etc.) Returns: Dictionary with: - outer_scores: Generalization performance on each outer fold - best_params_per_fold: Optimal hyperparameters found in each inner CV - mean_test_score: Average performance across outer folds - std_test_score: Standard deviation (uncertainty estimate) """ # Outer CV: For generalization estimate outer_cv = StratifiedKFold( n_splits=self.outer_folds, shuffle=True, random_state=self.random_state ) outer_scores = [] best_params_per_fold = [] print("Starting Nested Cross-Validation...") print(f"Outer CV: {self.outer_folds} folds") print(f"Inner CV: {self.inner_folds} folds") print(f"Param grid: {param_grid}") print("="*80) for fold_idx, (train_idx, test_idx) in enumerate(outer_cv.split(self.data, self.stratify_labels)): print(f"\nOuter Fold {fold_idx + 1}/{self.outer_folds}") # Split data for this outer fold train_data = self.data.iloc[train_idx] test_data = self.data.iloc[test_idx] # Inner CV: Hyperparameter search on training data ONLY inner_cv = StratifiedKFold( n_splits=self.inner_folds, shuffle=True, random_state=self.random_state ) # Run grid search on inner folds best_params, best_inner_score = self._inner_grid_search( train_data, param_grid, inner_cv, scoring_metric ) print(f" Inner CV best params: {best_params}") print(f" Inner CV best score: {best_inner_score:.4f}") # Build ToGMAL vector DB with ONLY training data vector_db = self._build_vector_db(train_data) # Evaluate on held-out test fold with best hyperparameters test_score = self._evaluate_on_test_fold( vector_db, test_data, best_params, scoring_metric ) print(f" Outer test score: {test_score:.4f}") outer_scores.append(test_score) best_params_per_fold.append(best_params) # Aggregate results mean_score = np.mean(outer_scores) std_score = np.std(outer_scores) print("\n" + "="*80) print("Nested CV Results:") print(f" Outer scores: {[f'{s:.4f}' for s in outer_scores]}") print(f" Mean ± Std: {mean_score:.4f} ± {std_score:.4f}") print("="*80) return { 'outer_scores': outer_scores, 'mean_test_score': mean_score, 'std_test_score': std_score, 'best_params_per_fold': best_params_per_fold, 'most_common_params': self._find_most_common_params(best_params_per_fold) } def _inner_grid_search( self, train_data, param_grid: Dict[str, List[Any]], inner_cv, scoring_metric: str ) -> tuple: """ Grid search over hyperparameters using inner CV folds. Returns (best_params, best_score) """ stratify = ( train_data['domain'].astype(str) + '_' + train_data['difficulty_label'].astype(str) ) best_score = -np.inf best_params = {} # Generate all parameter combinations from itertools import product param_names = list(param_grid.keys()) param_values = list(param_grid.values()) for param_combo in product(*param_values): params = dict(zip(param_names, param_combo)) # Evaluate this parameter combination on inner folds fold_scores = [] for inner_train_idx, inner_val_idx in inner_cv.split(train_data, stratify): inner_train = train_data.iloc[inner_train_idx] inner_val = train_data.iloc[inner_val_idx] # Build vector DB with inner training data inner_db = self._build_vector_db(inner_train) # Evaluate on inner validation score = self._evaluate_on_test_fold( inner_db, inner_val, params, scoring_metric ) fold_scores.append(score) avg_score = np.mean(fold_scores) if avg_score > best_score: best_score = avg_score best_params = params return best_params, best_score def _build_vector_db(self, train_data): """Build vector database from training data.""" from benchmark_vector_db import BenchmarkVectorDB, BenchmarkQuestion from pathlib import Path import tempfile # Create temporary DB for this fold temp_dir = tempfile.mkdtemp() db = BenchmarkVectorDB( db_path=Path(temp_dir) / "fold_db", embedding_model="all-MiniLM-L6-v2" ) # Convert dataframe to BenchmarkQuestion objects questions = [ BenchmarkQuestion( question_id=row['question_id'], source_benchmark=row['source_benchmark'], domain=row['domain'], question_text=row['question_text'], correct_answer=row['correct_answer'], success_rate=row['success_rate'], difficulty_score=row['difficulty_score'], difficulty_label=row['difficulty_label'] ) for _, row in train_data.iterrows() ] db.index_questions(questions) return db def _evaluate_on_test_fold( self, vector_db, test_data, params: Dict[str, Any], metric: str ) -> float: """ Evaluate ToGMAL on test fold with given hyperparameters. Args: vector_db: Vector database built from training data test_data: Held-out test fold params: Hyperparameters (e.g., k, similarity_threshold, weights) metric: Scoring metric (roc_auc, f1, etc.) """ from sklearn.metrics import roc_auc_score, f1_score predictions = [] ground_truth = [] for _, row in test_data.iterrows(): # Query vector DB with test question result = vector_db.query_similar_questions( prompt=row['question_text'], k=params.get('k_neighbors', 5) ) # Apply adaptive scoring with hyperparameters risk_score = self._compute_adaptive_risk( result, params ) predictions.append(risk_score) # Ground truth: is this question hard? (success_rate < 0.5) ground_truth.append(1 if row['success_rate'] < 0.5 else 0) # Compute metric if metric == 'roc_auc': return roc_auc_score(ground_truth, predictions) elif metric == 'f1': # Binarize predictions at 0.5 threshold binary_preds = [1 if p > 0.5 else 0 for p in predictions] return f1_score(ground_truth, binary_preds) else: raise ValueError(f"Unknown metric: {metric}") def _compute_adaptive_risk( self, query_result: Dict[str, Any], params: Dict[str, Any] ) -> float: """ Compute risk score with adaptive uncertainty penalties. Uses hyperparameters from inner CV search. """ similarities = [q['similarity'] for q in query_result['similar_questions']] difficulties = [q['difficulty_score'] for q in query_result['similar_questions']] # Base weighted average weights = np.array(similarities) / sum(similarities) base_score = np.dot(weights, difficulties) # Adaptive uncertainty penalties max_sim = max(similarities) avg_sim = np.mean(similarities) sim_variance = np.var(similarities) uncertainty_penalty = 0.0 # Low similarity threshold (configurable) sim_threshold = params.get('similarity_threshold', 0.7) if max_sim < sim_threshold: uncertainty_penalty += (sim_threshold - max_sim) * params.get('low_sim_penalty', 0.5) # High variance penalty if sim_variance > 0.05: uncertainty_penalty += min(sim_variance * params.get('variance_penalty', 2.0), 0.3) # Low average similarity if avg_sim < 0.5: uncertainty_penalty += (0.5 - avg_sim) * params.get('low_avg_penalty', 0.4) # Final score adjusted_score = base_score + uncertainty_penalty return np.clip(adjusted_score, 0.0, 1.0) def _find_most_common_params(self, params_list: List[Dict]) -> Dict: """Find the most frequently selected hyperparameters across folds.""" from collections import Counter # For each parameter, find the most common value all_param_names = params_list[0].keys() most_common = {} for param_name in all_param_names: values = [p[param_name] for p in params_list] most_common[param_name] = Counter(values).most_common(1)[0][0] return most_common # Example usage if __name__ == "__main__": import pandas as pd from benchmark_vector_db import BenchmarkVectorDB # Load all benchmark questions db = BenchmarkVectorDB(db_path=Path("/Users/hetalksinmaths/togmal/data/benchmark_vector_db")) stats = db.get_statistics() # Get all questions as dataframe (you'll need to implement this) all_questions_df = db.get_all_questions_as_dataframe() # Define hyperparameter search grid param_grid = { 'k_neighbors': [3, 5, 7, 10], 'similarity_threshold': [0.6, 0.7, 0.8], 'low_sim_penalty': [0.3, 0.5, 0.7], 'variance_penalty': [1.0, 2.0, 3.0], 'low_avg_penalty': [0.2, 0.4, 0.6] } # Run nested CV evaluator = NestedCVEvaluator( benchmark_data=all_questions_df, outer_folds=5, # 5-fold outer CV inner_folds=3 # 3-fold inner CV for hyperparameter search ) results = evaluator.run_nested_cv( param_grid=param_grid, scoring_metric='roc_auc' ) print("\nFinal Results:") print(f"Generalization Performance: {results['mean_test_score']:.4f} ± {results['std_test_score']:.4f}") print(f"Most Common Best Params: {results['most_common_params']}") ``` **Key Advantages:** - **No leakage**: Each outer test fold is never seen during hyperparameter tuning - **Robust estimates**: 5 different generalization scores (not just 1) - **Automatic tuning**: Inner CV finds best hyperparameters for each fold - **Confidence intervals**: Standard deviation tells you uncertainty in performance #### Phase 2: Define Evaluation Metrics Use standard **OOD detection metrics** + **calibration metrics**: 1. **AUROC** (Area Under ROC Curve) - Threshold-independent - Measures overall discriminative ability - Gold standard for OOD detection - Interpretation: Probability that a random risky prompt is ranked higher than a random safe prompt 2. **FPR@TPR95** (False Positive Rate at 95% True Positive Rate) - How many safe prompts are incorrectly flagged when catching 95% of risky ones - Common in safety-critical applications - Lower is better (want to minimize false alarms) 3. **AUPR** (Area Under Precision-Recall Curve) - Better for imbalanced datasets - Useful when risky prompts are rare - Focuses on positive class (risky prompts) 4. **Expected Calibration Error (ECE)** - Are your risk probabilities accurate? - If you say 70% risky, is it actually 70% risky? - Measures gap between predicted probabilities and observed frequencies 5. **Brier Score** - Measures accuracy of probabilistic predictions - Lower is better - Combines discrimination and calibration ```python from sklearn.metrics import roc_auc_score, precision_recall_curve, auc, brier_score_loss import numpy as np def compute_fpr_at_tpr(y_true, y_pred_proba, tpr_threshold=0.95): """Compute FPR when TPR is at specified threshold.""" from sklearn.metrics import roc_curve fpr, tpr, thresholds = roc_curve(y_true, y_pred_proba) # Find index where TPR >= threshold idx = np.argmax(tpr >= tpr_threshold) return fpr[idx] def expected_calibration_error(y_true, y_pred_proba, n_bins=10): """ Compute Expected Calibration Error (ECE). Bins predictions into n_bins buckets and measures the gap between predicted probability and observed frequency in each bin. """ bin_boundaries = np.linspace(0, 1, n_bins + 1) bin_lowers = bin_boundaries[:-1] bin_uppers = bin_boundaries[1:] ece = 0.0 for bin_lower, bin_upper in zip(bin_lowers, bin_uppers): # Find predictions in this bin in_bin = (y_pred_proba > bin_lower) & (y_pred_proba <= bin_upper) prop_in_bin = in_bin.mean() if prop_in_bin > 0: # Observed frequency in this bin accuracy_in_bin = y_true[in_bin].mean() # Average predicted probability in this bin avg_confidence_in_bin = y_pred_proba[in_bin].mean() # Contribution to ECE ece += np.abs(avg_confidence_in_bin - accuracy_in_bin) * prop_in_bin return ece def evaluate_togmal(predictions, ground_truth): """ Comprehensive evaluation of ToGMAL performance. Args: predictions: Dict with 'risk_score' (continuous 0-1) and 'risk_level' (categorical) ground_truth: Array of difficulty scores or binary labels (0=easy, 1=hard) Returns: Dictionary with all evaluation metrics """ # Convert ground truth to binary if needed (HIGH/CRITICAL = 1, else = 0) if hasattr(ground_truth, 'success_rate'): y_true = (ground_truth['success_rate'] < 0.5).astype(int) else: y_true = ground_truth y_pred_proba = predictions['risk_score'] # Continuous 0-1 y_pred_binary = (y_pred_proba > 0.5).astype(int) # Binarized # AUROC auroc = roc_auc_score(y_true, y_pred_proba) # FPR@TPR95 fpr_at_95_tpr = compute_fpr_at_tpr(y_true, y_pred_proba, tpr_threshold=0.95) # AUPR precision, recall, _ = precision_recall_curve(y_true, y_pred_proba) aupr = auc(recall, precision) # Calibration error ece = expected_calibration_error(y_true, y_pred_proba, n_bins=10) # Brier score (lower is better) brier = brier_score_loss(y_true, y_pred_proba) # Standard classification metrics (for reference) from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score accuracy = accuracy_score(y_true, y_pred_binary) f1 = f1_score(y_true, y_pred_binary) precision = precision_score(y_true, y_pred_binary) recall = recall_score(y_true, y_pred_binary) return { # Primary OOD detection metrics 'AUROC': auroc, 'FPR@TPR95': fpr_at_95_tpr, 'AUPR': aupr, # Calibration metrics 'ECE': ece, 'Brier_Score': brier, # Standard classification (for reference) 'Accuracy': accuracy, 'F1': f1, 'Precision': precision, 'Recall': recall } def print_evaluation_report(metrics: dict): """Pretty print evaluation metrics.""" print("\n" + "="*80) print("ToGMAL Evaluation Report") print("="*80) print("\nOOD Detection Performance:") print(f" AUROC: {metrics['AUROC']:.4f} (higher is better, 0.5=random, 1.0=perfect)") print(f" FPR@TPR95: {metrics['FPR@TPR95']:.4f} (lower is better, false alarm rate)") print(f" AUPR: {metrics['AUPR']:.4f} (higher is better)") print("\nCalibration:") print(f" ECE: {metrics['ECE']:.4f} (lower is better, 0=perfect calibration)") print(f" Brier Score: {metrics['Brier_Score']:.4f} (lower is better)") print("\nClassification Metrics (for reference):") print(f" Accuracy: {metrics['Accuracy']:.4f}") print(f" F1 Score: {metrics['F1']:.4f}") print(f" Precision: {metrics['Precision']:.4f}") print(f" Recall: {metrics['Recall']:.4f}") print("\n" + "="*80) ``` #### Phase 3: Out-of-Distribution Testing **Critical:** Test on data that's truly OOD from your training benchmarks. **OOD Test Sets to Create:** 1. **Temporal OOD**: New benchmark questions released after your training data cutoff 2. **Domain Shift**: Categories not in MMLU (e.g., creative writing prompts, coding challenges) 3. **Adversarial**: Hand-crafted examples designed to fool the system - "Prove [false scientific claim]" - Jailbreak attempts disguised as innocent questions - Edge cases from your taxonomy submissions ```python ood_test_sets = { 'adversarial_false_premises': load_false_premise_examples(), 'jailbreaks': load_jailbreak_attempts(), 'creative_writing': load_writing_prompts(), 'recent_benchmarks': load_benchmarks_after('2024-01'), 'user_submissions': load_taxonomy_entries() } # Evaluate on each OOD set for name, test_data in ood_test_sets.items(): metrics = evaluate_togmal(model.predict(test_data), test_data.labels) print(f"{name}: AUROC={metrics['AUROC']:.3f}, FPR@95={metrics['FPR@TPR95']:.3f}") ``` #### Phase 4: Hyperparameter Tuning Protocol **Use validation set ONLY** - never touch test set until final evaluation. ```python from sklearn.model_selection import GridSearchCV # Parameters to tune param_grid = { 'similarity_threshold': [0.5, 0.6, 0.7, 0.8], 'k_neighbors': [3, 5, 7, 10], 'uncertainty_penalty_weight': [0.2, 0.4, 0.6], 'heuristic_weight': [0.3, 0.4, 0.5], 'vector_weight': [0.3, 0.4, 0.5] } # Cross-validation on validation set best_params = grid_search_cv( togmal_model, param_grid, val_set, metric='AUROC', cv=5 # 5-fold CV within validation set ) # Train final model with best params on train + val final_model = train_togmal( train_set + val_set, params=best_params ) # Evaluate ONCE on test set final_metrics = evaluate_togmal( final_model.predict(test_set), test_set.labels ) ``` --- ## Implementation Roadmap ### Phase 1: Adaptive Scoring Implementation (Week 1-2) - [x] ✓ Implement basic vector database with 32K questions - [ ] Add adaptive uncertainty-aware scoring function - [ ] Similarity threshold penalties - [ ] Variance penalties for diverse matches - [ ] Low average similarity penalties - [ ] Implement domain-specific threshold calibration - [ ] Add multi-signal fusion (vector + heuristics) - [ ] Integrate into `benchmark_vector_db.py::query_similar_questions()` ### Phase 2: Data Export & Preparation (Week 2) - [ ] Export all 32K questions from ChromaDB to pandas DataFrame - [ ] Add `BenchmarkVectorDB.get_all_questions_as_dataframe()` method - [ ] Include all metadata (domain, difficulty, success_rate, etc.) - [ ] Verify stratification labels (domain × difficulty) - [ ] Create initial train/val/test split (simple 70/15/15) for baseline - [ ] Document dataset statistics per split ### Phase 3: Nested CV Framework (Week 3) - [ ] Implement `NestedCVEvaluator` class - [ ] Outer CV loop (5-fold stratified) - [ ] Inner CV loop (3-fold grid search) - [ ] Temporary vector DB creation per fold - [ ] Define hyperparameter search grid - `k_neighbors`: [3, 5, 7, 10] - `similarity_threshold`: [0.6, 0.7, 0.8] - `low_sim_penalty`: [0.3, 0.5, 0.7] - `variance_penalty`: [1.0, 2.0, 3.0] - `low_avg_penalty`: [0.2, 0.4, 0.6] - [ ] Implement evaluation metrics (AUROC, FPR@TPR95, ECE) ### Phase 4: Baseline Evaluation (Week 3-4) - [ ] Run current ToGMAL (naive weighted average) on simple split - [ ] Compute baseline metrics: - [ ] AUROC on test set - [ ] FPR@TPR95 - [ ] Expected Calibration Error - [ ] Brier Score - [ ] Analyze failure modes: - [ ] Low similarity cases (max_sim < 0.6) - [ ] High variance matches - [ ] Cross-domain queries - [ ] Document baseline performance for comparison ### Phase 5: Nested CV Hyperparameter Tuning (Week 4-5) - [ ] Run full nested CV (5 outer × 3 inner = 15 train-test runs) - [ ] Track computational cost (time per fold) - [ ] Collect best hyperparameters per outer fold - [ ] Identify most common optimal parameters - [ ] Compute mean ± std generalization performance ### Phase 6: Final Model Training (Week 5) - [ ] Train final model on ALL 32K questions with best hyperparameters - [ ] Re-index full vector database - [ ] Update `togmal_mcp.py` to use adaptive scoring - [ ] Deploy to MCP server and HTTP facade ### Phase 7: OOD Testing (Week 6) - [ ] Create OOD test sets: - [ ] **Adversarial**: Hand-crafted edge cases - "Prove [false scientific claim]" - Jailbreak attempts disguised as questions - Taxonomy submissions from users - [ ] **Domain Shift**: Categories not in MMLU - Creative writing prompts - Code generation tasks - Real-world user queries - [ ] **Temporal OOD**: New benchmarks (2024+) - SimpleQA (if available) - Latest MMLU updates - [ ] Evaluate on each OOD set - [ ] Analyze degradation vs. in-distribution performance ### Phase 8: Iteration & Documentation (Week 7) - [ ] Analyze failures on OOD sets - [ ] Add new heuristics for missed patterns - [ ] Re-run nested CV with updated features - [ ] Generate calibration plots (reliability diagrams) - [ ] Write technical report: - [ ] Methodology (nested CV protocol) - [ ] Results (baseline vs. adaptive) - [ ] Ablation studies (each penalty component) - [ ] OOD generalization analysis - [ ] Failure mode documentation --- ## Expected Improvements Based on OOD detection literature and nested CV best practices: 1. **Adaptive scoring** should improve AUROC by 5-15% on low-similarity cases - Baseline: ~0.75 AUROC (naive weighted average) - Target: ~0.85+ AUROC (adaptive with uncertainty) 2. **Nested CV** will give honest performance estimates - Simple train/test: Single point estimate (could be lucky/unlucky) - Nested CV: Mean ± std across 5 folds (robust estimate) 3. **Domain calibration** should reduce false positives by 10-20% - Expected: FPR@TPR95 drops from ~0.25 to ~0.15 4. **Multi-signal fusion** should catch edge cases like "prove false premise" - Combine vector similarity + rule-based heuristics - Expected: Improved recall on adversarial examples 5. **Calibration improvements** - Expected Calibration Error (ECE) < 0.05 - Better alignment between predicted risk and actual difficulty --- ## Validation Checklist Before deploying to production: - ✓ Nested CV completed with no data leakage - ✓ Hyperparameters tuned on inner CV folds only - ✓ Generalization performance estimated on outer CV folds - ✓ OOD sets tested (adversarial, domain-shift, temporal) - ✓ Calibration error measured and within acceptable range (ECE < 0.1) - ✓ Failure modes documented with specific examples - ✓ Ablation studies show each component contributes positively - ✓ Performance comparison: adaptive > baseline on all metrics - ✓ Real-world testing with user queries from taxonomy submissions --- ## Key References 1. **Similarity Thresholds**: Cosine similarity 0.7-0.8 recommended as starting point for "relevant" matches; lower values increasingly unreliable 2. **OOD Metrics**: AUROC, FPR@TPR95 are standard; conformal prediction provides probabilistic guarantees 3. **Adaptive Methods**: Uncertainty-aware thresholds outperform fixed thresholds in retrieval tasks 4. **Holdout Validation**: 60-20-20 or 70-15-15 splits common; stratification by domain/difficulty essential 5. **Calibration**: Expected Calibration Error (ECE) measures if predicted probabilities match observed frequencies 6. **Nested CV**: Gold standard for hyperparameter tuning; prevents leakage from repeated validation peeking 7. **Stratified K-Fold**: Maintains class distribution across folds; essential for imbalanced datasets --- ## Quick Start: Immediate Implementation ### Step 1: Add Adaptive Scoring to `benchmark_vector_db.py` (Today) Replace the naive weighted average in `query_similar_questions()` with adaptive uncertainty-aware scoring: ```python def query_similar_questions( self, prompt: str, k: int = 5, domain_filter: Optional[str] = None, # NEW: Adaptive scoring parameters similarity_threshold: float = 0.7, low_sim_penalty: float = 0.5, variance_penalty: float = 2.0, low_avg_penalty: float = 0.4 ) -> Dict[str, Any]: """Find k most similar benchmark questions with adaptive uncertainty penalties.""" # ... existing code to query ChromaDB ... # Extract similarities and difficulty scores similarities = [] difficulty_scores = [] success_rates = [] for i in range(len(results['ids'][0])): metadata = results['metadatas'][0][i] distance = results['distances'][0][i] # Convert L2 distance to cosine similarity similarity = max(0, 1 - (distance ** 2) / 2) similarities.append(similarity) difficulty_scores.append(metadata['difficulty_score']) success_rates.append(metadata['success_rate']) # IMPROVED: Adaptive uncertainty-aware scoring weighted_difficulty = self._compute_adaptive_difficulty( similarities=similarities, difficulty_scores=difficulty_scores, similarity_threshold=similarity_threshold, low_sim_penalty=low_sim_penalty, variance_penalty=variance_penalty, low_avg_penalty=low_avg_penalty ) # ... rest of existing code ... def _compute_adaptive_difficulty( self, similarities: List[float], difficulty_scores: List[float], similarity_threshold: float = 0.7, low_sim_penalty: float = 0.5, variance_penalty: float = 2.0, low_avg_penalty: float = 0.4 ) -> float: """ Compute difficulty score with adaptive uncertainty penalties. Key insight: When retrieved questions have low similarity to the prompt, we should INCREASE the risk estimate because we're extrapolating. Args: similarities: Cosine similarities of k-NN results difficulty_scores: Difficulty scores (1 - success_rate) of k-NN results similarity_threshold: Below this, apply low similarity penalty (default: 0.7) low_sim_penalty: Weight for low similarity penalty (default: 0.5) variance_penalty: Weight for high variance penalty (default: 2.0) low_avg_penalty: Weight for low average similarity penalty (default: 0.4) Returns: Adjusted difficulty score (0.0 to 1.0, higher = more risky) """ import numpy as np # Base weighted average (original approach) weights = np.array(similarities) / sum(similarities) base_score = np.dot(weights, difficulty_scores) # Compute uncertainty indicators max_sim = max(similarities) avg_sim = np.mean(similarities) sim_variance = np.var(similarities) # Initialize uncertainty penalty uncertainty_penalty = 0.0 # Penalty 1: Low maximum similarity # If best match is weak, we're likely OOD if max_sim < similarity_threshold: penalty = (similarity_threshold - max_sim) * low_sim_penalty uncertainty_penalty += penalty logger.debug(f"Low max similarity penalty: {penalty:.3f} (max_sim={max_sim:.3f})") # Penalty 2: High variance in similarities # If k-NN results are very dissimilar to each other, matches are unreliable variance_threshold = 0.05 if sim_variance > variance_threshold: penalty = min(sim_variance * variance_penalty, 0.3) # Cap at 0.3 uncertainty_penalty += penalty logger.debug(f"High variance penalty: {penalty:.3f} (variance={sim_variance:.3f})") # Penalty 3: Low average similarity # If ALL matches are weak, we're definitely OOD avg_threshold = 0.5 if avg_sim < avg_threshold: penalty = (avg_threshold - avg_sim) * low_avg_penalty uncertainty_penalty += penalty logger.debug(f"Low avg similarity penalty: {penalty:.3f} (avg_sim={avg_sim:.3f})") # Final adjusted score adjusted_score = base_score + uncertainty_penalty # Clip to [0, 1] range adjusted_score = np.clip(adjusted_score, 0.0, 1.0) logger.info( f"Adaptive scoring: base={base_score:.3f}, penalty={uncertainty_penalty:.3f}, " f"adjusted={adjusted_score:.3f}" ) return adjusted_score ``` **Why this helps:** - **"Prove universe is 10,000 years old" example**: max_sim=0.57 triggers low similarity penalty → risk increases from MODERATE to HIGH - **Unrelated k-NN matches**: High variance → additional penalty → correctly flags as uncertain - **Novel domains**: Low average similarity across all matches → strong penalty → CRITICAL risk ### Step 2: Export Database for Evaluation (This Week) Add method to export all questions as DataFrame for nested CV: ```python def get_all_questions_as_dataframe(self) -> 'pd.DataFrame': """ Export all questions from ChromaDB as a pandas DataFrame. Used for train/val/test splitting and nested CV. Returns: DataFrame with columns: - question_id, source_benchmark, domain, question_text, - correct_answer, success_rate, difficulty_score, difficulty_label """ import pandas as pd count = self.collection.count() logger.info(f"Exporting {count} questions from vector database...") # Get all questions from ChromaDB all_data = self.collection.get( limit=count, include=["metadatas", "documents"] ) # Convert to DataFrame rows = [] for i, qid in enumerate(all_data['ids']): metadata = all_data['metadatas'][i] rows.append({ 'question_id': qid, 'question_text': all_data['documents'][i], 'source_benchmark': metadata['source'], 'domain': metadata['domain'], 'success_rate': metadata['success_rate'], 'difficulty_score': metadata['difficulty_score'], 'difficulty_label': metadata['difficulty_label'], 'num_models_tested': metadata.get('num_models', 0) }) df = pd.DataFrame(rows) logger.info(f"Exported {len(df)} questions to DataFrame") logger.info(f" Domains: {df['domain'].nunique()}") logger.info(f" Sources: {df['source_benchmark'].nunique()}") return df ``` ### Step 3: Test Adaptive Scoring Immediately Create a test script to compare baseline vs. adaptive: ```python #!/usr/bin/env python3 """Test adaptive scoring improvements.""" from benchmark_vector_db import BenchmarkVectorDB from pathlib import Path # Initialize database db = BenchmarkVectorDB( db_path=Path("/Users/hetalksinmaths/togmal/data/benchmark_vector_db") ) # Test cases that should trigger uncertainty penalties test_cases = [ # Low similarity - should get penalty "Prove that the universe is exactly 10,000 years old using thermodynamics", # Novel domain - should get penalty "Write a haiku about quantum entanglement in 17th century Japanese", # Should match well - no penalty "What is the capital of France?", # Should match GPQA physics - no penalty "Calculate the quantum correction to the partition function for a 3D harmonic oscillator" ] print("="*80) print("Adaptive Scoring Test") print("="*80) for prompt in test_cases: print(f"\nPrompt: {prompt[:100]}...") result = db.query_similar_questions(prompt, k=5) print(f" Max Similarity: {max(q['similarity'] for q in result['similar_questions']):.3f}") print(f" Avg Similarity: {result['avg_similarity']:.3f}") print(f" Weighted Difficulty: {result['weighted_difficulty_score']:.3f}") print(f" Risk Level: {result['risk_level']}") print(f" Top Match: {result['similar_questions'][0]['domain']} - {result['similar_questions'][0]['source']}") ``` --- ## Next Steps 1. **Immediate**: Implement train/val/test split of benchmark data 2. **This week**: Add similarity-based uncertainty penalties 3. **Next week**: Run validation experiments with different thresholds 4. **End of month**: Complete evaluation on test set + OOD sets 5. **Ongoing**: Build adversarial test set from user submissions

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/HeTalksInMaths/togmal-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server