togmal_improvement_plan.md•39.5 kB
# ToGMAL Improvement Plan: Adaptive Scoring & Evaluation Framework
## Executive Summary
This plan addresses two critical gaps in togmal's current implementation:
1. **Naive weighted averaging fails when retrieved questions have low similarity** to the prompt
2. **Lack of rigorous evaluation methodology** to measure OOD detection performance
---
## Problem 1: Low-Similarity Scoring Issues
### Current Limitation
Your system uses a simple weighted average of difficulty scores from k-nearest neighbors, which produces unreliable risk assessments when:
- Maximum similarity < 0.6 (semantically distant matches)
- Retrieved questions span multiple unrelated domains
- Query is truly novel/out-of-distribution
**Example:** "Prove universe is 10,000 years old" matched to factual recall questions about Earth's age (similarity ~0.57), resulting in LOW risk despite being a "prove false premise" pattern.
### Solution: Adaptive Uncertainty-Aware Scoring
#### 1. Similarity-Based Confidence Adjustment
Implement a **confidence decay function** that increases risk when similarity is low:
```python
def compute_adaptive_risk(similarities, difficulties, k=5):
"""
Adjust risk score based on retrieval confidence
"""
# Base weighted score
weights = np.array(similarities) / sum(similarities)
base_score = np.dot(weights, difficulties)
# Confidence metrics
max_sim = max(similarities)
avg_sim = np.mean(similarities)
sim_variance = np.var(similarities)
# Uncertainty penalty - increase risk when:
# - Max similarity is low (< 0.7)
# - High variance in similarities (diverse matches)
# - Average similarity is low
uncertainty_penalty = 0.0
# Low maximum similarity threshold
if max_sim < 0.7:
uncertainty_penalty += (0.7 - max_sim) * 0.5
# High variance (retrieved questions are dissimilar to each other)
if sim_variance > 0.05:
uncertainty_penalty += min(sim_variance * 2, 0.3)
# Low average similarity
if avg_sim < 0.5:
uncertainty_penalty += (0.5 - avg_sim) * 0.4
# Adjusted score (higher = more risky)
adjusted_score = base_score + uncertainty_penalty
# Map to risk levels
if adjusted_score < 0.2:
return "MINIMAL"
elif adjusted_score < 0.4:
return "LOW"
elif adjusted_score < 0.6:
return "MODERATE"
elif adjusted_score < 0.8:
return "HIGH"
else:
return "CRITICAL"
```
**Key Insight:** Research shows that cosine similarity thresholds vary by domain and task. Values 0.7-0.8 are commonly recommended starting points for "relevant" matches. Below 0.6, matches become increasingly unreliable.
#### 2. Multi-Signal Fusion
Combine multiple indicators beyond just k-NN similarity:
```python
def compute_risk_with_fusion(prompt, knn_results, heuristics):
"""
Fuse vector similarity with rule-based heuristics
"""
# Vector-based score (from k-NN)
vector_score = compute_adaptive_risk(
knn_results['similarities'],
knn_results['difficulties']
)
# Rule-based heuristics (existing togmal patterns)
heuristic_score = heuristics.evaluate(prompt)
# Domain classifier (is this math/physics/medical?)
domain_confidence = classify_domain(prompt)
# Combine scores with learned weights
final_score = (
0.4 * vector_score +
0.4 * heuristic_score +
0.2 * domain_uncertainty(domain_confidence)
)
return final_score
```
#### 3. Threshold Calibration per Domain
Different domains need different thresholds. Implement **domain-specific calibration**:
```python
# Learned from validation data
DOMAIN_THRESHOLDS = {
'math': {'low': 0.65, 'moderate': 0.75, 'high': 0.85},
'physics': {'low': 0.60, 'moderate': 0.70, 'high': 0.80},
'medical': {'low': 0.70, 'moderate': 0.80, 'high': 0.90},
'general': {'low': 0.60, 'moderate': 0.70, 'high': 0.80}
}
def get_calibrated_threshold(domain, risk_level):
return DOMAIN_THRESHOLDS.get(domain, DOMAIN_THRESHOLDS['general'])[risk_level]
```
---
## Problem 2: Evaluation & Generalization
### Proposed Evaluation Framework: Nested Cross-Validation (Gold Standard)
#### Why Nested CV > Simple Train/Val/Test Split
**Problem with simple splits:**
- Single validation set can be unrepresentative (lucky/unlucky split)
- Repeated "peeking" at validation during hyperparameter search causes leakage
- Test set provides only ONE estimate of generalization (high variance)
**Nested CV advantages:**
- **Outer loop**: K-fold CV for unbiased generalization estimate
- **Inner loop**: Hyperparameter search on each training fold
- **No leakage**: Test folds never seen during tuning
- **Multiple estimates**: Robust performance across K different test sets
#### Implementation: Nested Cross-Validation
```python
from sklearn.model_selection import StratifiedKFold, GridSearchCV
import numpy as np
from typing import Dict, List, Any
class NestedCVEvaluator:
"""
Nested cross-validation for ToGMAL hyperparameter tuning and evaluation.
Outer CV: 5-fold stratified CV for generalization estimate
Inner CV: 3-fold stratified CV for hyperparameter search
This prevents data leakage from "peeking" at validation during tuning.
"""
def __init__(
self,
benchmark_data,
outer_folds: int = 5,
inner_folds: int = 3,
random_state: int = 42
):
self.data = benchmark_data
self.outer_folds = outer_folds
self.inner_folds = inner_folds
self.random_state = random_state
# Stratify by (domain, difficulty) to ensure balanced folds
self.stratify_labels = (
benchmark_data['domain'].astype(str) + '_' +
benchmark_data['difficulty_label'].astype(str)
)
def run_nested_cv(
self,
param_grid: Dict[str, List[Any]],
scoring_metric: str = 'roc_auc'
) -> Dict[str, Any]:
"""
Run nested cross-validation.
Args:
param_grid: Hyperparameters to search (e.g., {'k': [3,5,7], 'threshold': [0.6,0.7]})
scoring_metric: Metric for optimization (roc_auc, f1, etc.)
Returns:
Dictionary with:
- outer_scores: Generalization performance on each outer fold
- best_params_per_fold: Optimal hyperparameters found in each inner CV
- mean_test_score: Average performance across outer folds
- std_test_score: Standard deviation (uncertainty estimate)
"""
# Outer CV: For generalization estimate
outer_cv = StratifiedKFold(
n_splits=self.outer_folds,
shuffle=True,
random_state=self.random_state
)
outer_scores = []
best_params_per_fold = []
print("Starting Nested Cross-Validation...")
print(f"Outer CV: {self.outer_folds} folds")
print(f"Inner CV: {self.inner_folds} folds")
print(f"Param grid: {param_grid}")
print("="*80)
for fold_idx, (train_idx, test_idx) in enumerate(outer_cv.split(self.data, self.stratify_labels)):
print(f"\nOuter Fold {fold_idx + 1}/{self.outer_folds}")
# Split data for this outer fold
train_data = self.data.iloc[train_idx]
test_data = self.data.iloc[test_idx]
# Inner CV: Hyperparameter search on training data ONLY
inner_cv = StratifiedKFold(
n_splits=self.inner_folds,
shuffle=True,
random_state=self.random_state
)
# Run grid search on inner folds
best_params, best_inner_score = self._inner_grid_search(
train_data,
param_grid,
inner_cv,
scoring_metric
)
print(f" Inner CV best params: {best_params}")
print(f" Inner CV best score: {best_inner_score:.4f}")
# Build ToGMAL vector DB with ONLY training data
vector_db = self._build_vector_db(train_data)
# Evaluate on held-out test fold with best hyperparameters
test_score = self._evaluate_on_test_fold(
vector_db,
test_data,
best_params,
scoring_metric
)
print(f" Outer test score: {test_score:.4f}")
outer_scores.append(test_score)
best_params_per_fold.append(best_params)
# Aggregate results
mean_score = np.mean(outer_scores)
std_score = np.std(outer_scores)
print("\n" + "="*80)
print("Nested CV Results:")
print(f" Outer scores: {[f'{s:.4f}' for s in outer_scores]}")
print(f" Mean ± Std: {mean_score:.4f} ± {std_score:.4f}")
print("="*80)
return {
'outer_scores': outer_scores,
'mean_test_score': mean_score,
'std_test_score': std_score,
'best_params_per_fold': best_params_per_fold,
'most_common_params': self._find_most_common_params(best_params_per_fold)
}
def _inner_grid_search(
self,
train_data,
param_grid: Dict[str, List[Any]],
inner_cv,
scoring_metric: str
) -> tuple:
"""
Grid search over hyperparameters using inner CV folds.
Returns (best_params, best_score)
"""
stratify = (
train_data['domain'].astype(str) + '_' +
train_data['difficulty_label'].astype(str)
)
best_score = -np.inf
best_params = {}
# Generate all parameter combinations
from itertools import product
param_names = list(param_grid.keys())
param_values = list(param_grid.values())
for param_combo in product(*param_values):
params = dict(zip(param_names, param_combo))
# Evaluate this parameter combination on inner folds
fold_scores = []
for inner_train_idx, inner_val_idx in inner_cv.split(train_data, stratify):
inner_train = train_data.iloc[inner_train_idx]
inner_val = train_data.iloc[inner_val_idx]
# Build vector DB with inner training data
inner_db = self._build_vector_db(inner_train)
# Evaluate on inner validation
score = self._evaluate_on_test_fold(
inner_db,
inner_val,
params,
scoring_metric
)
fold_scores.append(score)
avg_score = np.mean(fold_scores)
if avg_score > best_score:
best_score = avg_score
best_params = params
return best_params, best_score
def _build_vector_db(self, train_data):
"""Build vector database from training data."""
from benchmark_vector_db import BenchmarkVectorDB, BenchmarkQuestion
from pathlib import Path
import tempfile
# Create temporary DB for this fold
temp_dir = tempfile.mkdtemp()
db = BenchmarkVectorDB(
db_path=Path(temp_dir) / "fold_db",
embedding_model="all-MiniLM-L6-v2"
)
# Convert dataframe to BenchmarkQuestion objects
questions = [
BenchmarkQuestion(
question_id=row['question_id'],
source_benchmark=row['source_benchmark'],
domain=row['domain'],
question_text=row['question_text'],
correct_answer=row['correct_answer'],
success_rate=row['success_rate'],
difficulty_score=row['difficulty_score'],
difficulty_label=row['difficulty_label']
)
for _, row in train_data.iterrows()
]
db.index_questions(questions)
return db
def _evaluate_on_test_fold(
self,
vector_db,
test_data,
params: Dict[str, Any],
metric: str
) -> float:
"""
Evaluate ToGMAL on test fold with given hyperparameters.
Args:
vector_db: Vector database built from training data
test_data: Held-out test fold
params: Hyperparameters (e.g., k, similarity_threshold, weights)
metric: Scoring metric (roc_auc, f1, etc.)
"""
from sklearn.metrics import roc_auc_score, f1_score
predictions = []
ground_truth = []
for _, row in test_data.iterrows():
# Query vector DB with test question
result = vector_db.query_similar_questions(
prompt=row['question_text'],
k=params.get('k_neighbors', 5)
)
# Apply adaptive scoring with hyperparameters
risk_score = self._compute_adaptive_risk(
result,
params
)
predictions.append(risk_score)
# Ground truth: is this question hard? (success_rate < 0.5)
ground_truth.append(1 if row['success_rate'] < 0.5 else 0)
# Compute metric
if metric == 'roc_auc':
return roc_auc_score(ground_truth, predictions)
elif metric == 'f1':
# Binarize predictions at 0.5 threshold
binary_preds = [1 if p > 0.5 else 0 for p in predictions]
return f1_score(ground_truth, binary_preds)
else:
raise ValueError(f"Unknown metric: {metric}")
def _compute_adaptive_risk(
self,
query_result: Dict[str, Any],
params: Dict[str, Any]
) -> float:
"""
Compute risk score with adaptive uncertainty penalties.
Uses hyperparameters from inner CV search.
"""
similarities = [q['similarity'] for q in query_result['similar_questions']]
difficulties = [q['difficulty_score'] for q in query_result['similar_questions']]
# Base weighted average
weights = np.array(similarities) / sum(similarities)
base_score = np.dot(weights, difficulties)
# Adaptive uncertainty penalties
max_sim = max(similarities)
avg_sim = np.mean(similarities)
sim_variance = np.var(similarities)
uncertainty_penalty = 0.0
# Low similarity threshold (configurable)
sim_threshold = params.get('similarity_threshold', 0.7)
if max_sim < sim_threshold:
uncertainty_penalty += (sim_threshold - max_sim) * params.get('low_sim_penalty', 0.5)
# High variance penalty
if sim_variance > 0.05:
uncertainty_penalty += min(sim_variance * params.get('variance_penalty', 2.0), 0.3)
# Low average similarity
if avg_sim < 0.5:
uncertainty_penalty += (0.5 - avg_sim) * params.get('low_avg_penalty', 0.4)
# Final score
adjusted_score = base_score + uncertainty_penalty
return np.clip(adjusted_score, 0.0, 1.0)
def _find_most_common_params(self, params_list: List[Dict]) -> Dict:
"""Find the most frequently selected hyperparameters across folds."""
from collections import Counter
# For each parameter, find the most common value
all_param_names = params_list[0].keys()
most_common = {}
for param_name in all_param_names:
values = [p[param_name] for p in params_list]
most_common[param_name] = Counter(values).most_common(1)[0][0]
return most_common
# Example usage
if __name__ == "__main__":
import pandas as pd
from benchmark_vector_db import BenchmarkVectorDB
# Load all benchmark questions
db = BenchmarkVectorDB(db_path=Path("/Users/hetalksinmaths/togmal/data/benchmark_vector_db"))
stats = db.get_statistics()
# Get all questions as dataframe (you'll need to implement this)
all_questions_df = db.get_all_questions_as_dataframe()
# Define hyperparameter search grid
param_grid = {
'k_neighbors': [3, 5, 7, 10],
'similarity_threshold': [0.6, 0.7, 0.8],
'low_sim_penalty': [0.3, 0.5, 0.7],
'variance_penalty': [1.0, 2.0, 3.0],
'low_avg_penalty': [0.2, 0.4, 0.6]
}
# Run nested CV
evaluator = NestedCVEvaluator(
benchmark_data=all_questions_df,
outer_folds=5, # 5-fold outer CV
inner_folds=3 # 3-fold inner CV for hyperparameter search
)
results = evaluator.run_nested_cv(
param_grid=param_grid,
scoring_metric='roc_auc'
)
print("\nFinal Results:")
print(f"Generalization Performance: {results['mean_test_score']:.4f} ± {results['std_test_score']:.4f}")
print(f"Most Common Best Params: {results['most_common_params']}")
```
**Key Advantages:**
- **No leakage**: Each outer test fold is never seen during hyperparameter tuning
- **Robust estimates**: 5 different generalization scores (not just 1)
- **Automatic tuning**: Inner CV finds best hyperparameters for each fold
- **Confidence intervals**: Standard deviation tells you uncertainty in performance
#### Phase 2: Define Evaluation Metrics
Use standard **OOD detection metrics** + **calibration metrics**:
1. **AUROC** (Area Under ROC Curve)
- Threshold-independent
- Measures overall discriminative ability
- Gold standard for OOD detection
- Interpretation: Probability that a random risky prompt is ranked higher than a random safe prompt
2. **FPR@TPR95** (False Positive Rate at 95% True Positive Rate)
- How many safe prompts are incorrectly flagged when catching 95% of risky ones
- Common in safety-critical applications
- Lower is better (want to minimize false alarms)
3. **AUPR** (Area Under Precision-Recall Curve)
- Better for imbalanced datasets
- Useful when risky prompts are rare
- Focuses on positive class (risky prompts)
4. **Expected Calibration Error (ECE)**
- Are your risk probabilities accurate?
- If you say 70% risky, is it actually 70% risky?
- Measures gap between predicted probabilities and observed frequencies
5. **Brier Score**
- Measures accuracy of probabilistic predictions
- Lower is better
- Combines discrimination and calibration
```python
from sklearn.metrics import roc_auc_score, precision_recall_curve, auc, brier_score_loss
import numpy as np
def compute_fpr_at_tpr(y_true, y_pred_proba, tpr_threshold=0.95):
"""Compute FPR when TPR is at specified threshold."""
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_true, y_pred_proba)
# Find index where TPR >= threshold
idx = np.argmax(tpr >= tpr_threshold)
return fpr[idx]
def expected_calibration_error(y_true, y_pred_proba, n_bins=10):
"""
Compute Expected Calibration Error (ECE).
Bins predictions into n_bins buckets and measures the gap between
predicted probability and observed frequency in each bin.
"""
bin_boundaries = np.linspace(0, 1, n_bins + 1)
bin_lowers = bin_boundaries[:-1]
bin_uppers = bin_boundaries[1:]
ece = 0.0
for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
# Find predictions in this bin
in_bin = (y_pred_proba > bin_lower) & (y_pred_proba <= bin_upper)
prop_in_bin = in_bin.mean()
if prop_in_bin > 0:
# Observed frequency in this bin
accuracy_in_bin = y_true[in_bin].mean()
# Average predicted probability in this bin
avg_confidence_in_bin = y_pred_proba[in_bin].mean()
# Contribution to ECE
ece += np.abs(avg_confidence_in_bin - accuracy_in_bin) * prop_in_bin
return ece
def evaluate_togmal(predictions, ground_truth):
"""
Comprehensive evaluation of ToGMAL performance.
Args:
predictions: Dict with 'risk_score' (continuous 0-1) and 'risk_level' (categorical)
ground_truth: Array of difficulty scores or binary labels (0=easy, 1=hard)
Returns:
Dictionary with all evaluation metrics
"""
# Convert ground truth to binary if needed (HIGH/CRITICAL = 1, else = 0)
if hasattr(ground_truth, 'success_rate'):
y_true = (ground_truth['success_rate'] < 0.5).astype(int)
else:
y_true = ground_truth
y_pred_proba = predictions['risk_score'] # Continuous 0-1
y_pred_binary = (y_pred_proba > 0.5).astype(int) # Binarized
# AUROC
auroc = roc_auc_score(y_true, y_pred_proba)
# FPR@TPR95
fpr_at_95_tpr = compute_fpr_at_tpr(y_true, y_pred_proba, tpr_threshold=0.95)
# AUPR
precision, recall, _ = precision_recall_curve(y_true, y_pred_proba)
aupr = auc(recall, precision)
# Calibration error
ece = expected_calibration_error(y_true, y_pred_proba, n_bins=10)
# Brier score (lower is better)
brier = brier_score_loss(y_true, y_pred_proba)
# Standard classification metrics (for reference)
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
accuracy = accuracy_score(y_true, y_pred_binary)
f1 = f1_score(y_true, y_pred_binary)
precision = precision_score(y_true, y_pred_binary)
recall = recall_score(y_true, y_pred_binary)
return {
# Primary OOD detection metrics
'AUROC': auroc,
'FPR@TPR95': fpr_at_95_tpr,
'AUPR': aupr,
# Calibration metrics
'ECE': ece,
'Brier_Score': brier,
# Standard classification (for reference)
'Accuracy': accuracy,
'F1': f1,
'Precision': precision,
'Recall': recall
}
def print_evaluation_report(metrics: dict):
"""Pretty print evaluation metrics."""
print("\n" + "="*80)
print("ToGMAL Evaluation Report")
print("="*80)
print("\nOOD Detection Performance:")
print(f" AUROC: {metrics['AUROC']:.4f} (higher is better, 0.5=random, 1.0=perfect)")
print(f" FPR@TPR95: {metrics['FPR@TPR95']:.4f} (lower is better, false alarm rate)")
print(f" AUPR: {metrics['AUPR']:.4f} (higher is better)")
print("\nCalibration:")
print(f" ECE: {metrics['ECE']:.4f} (lower is better, 0=perfect calibration)")
print(f" Brier Score: {metrics['Brier_Score']:.4f} (lower is better)")
print("\nClassification Metrics (for reference):")
print(f" Accuracy: {metrics['Accuracy']:.4f}")
print(f" F1 Score: {metrics['F1']:.4f}")
print(f" Precision: {metrics['Precision']:.4f}")
print(f" Recall: {metrics['Recall']:.4f}")
print("\n" + "="*80)
```
#### Phase 3: Out-of-Distribution Testing
**Critical:** Test on data that's truly OOD from your training benchmarks.
**OOD Test Sets to Create:**
1. **Temporal OOD**: New benchmark questions released after your training data cutoff
2. **Domain Shift**: Categories not in MMLU (e.g., creative writing prompts, coding challenges)
3. **Adversarial**: Hand-crafted examples designed to fool the system
- "Prove [false scientific claim]"
- Jailbreak attempts disguised as innocent questions
- Edge cases from your taxonomy submissions
```python
ood_test_sets = {
'adversarial_false_premises': load_false_premise_examples(),
'jailbreaks': load_jailbreak_attempts(),
'creative_writing': load_writing_prompts(),
'recent_benchmarks': load_benchmarks_after('2024-01'),
'user_submissions': load_taxonomy_entries()
}
# Evaluate on each OOD set
for name, test_data in ood_test_sets.items():
metrics = evaluate_togmal(model.predict(test_data), test_data.labels)
print(f"{name}: AUROC={metrics['AUROC']:.3f}, FPR@95={metrics['FPR@TPR95']:.3f}")
```
#### Phase 4: Hyperparameter Tuning Protocol
**Use validation set ONLY** - never touch test set until final evaluation.
```python
from sklearn.model_selection import GridSearchCV
# Parameters to tune
param_grid = {
'similarity_threshold': [0.5, 0.6, 0.7, 0.8],
'k_neighbors': [3, 5, 7, 10],
'uncertainty_penalty_weight': [0.2, 0.4, 0.6],
'heuristic_weight': [0.3, 0.4, 0.5],
'vector_weight': [0.3, 0.4, 0.5]
}
# Cross-validation on validation set
best_params = grid_search_cv(
togmal_model,
param_grid,
val_set,
metric='AUROC',
cv=5 # 5-fold CV within validation set
)
# Train final model with best params on train + val
final_model = train_togmal(
train_set + val_set,
params=best_params
)
# Evaluate ONCE on test set
final_metrics = evaluate_togmal(
final_model.predict(test_set),
test_set.labels
)
```
---
## Implementation Roadmap
### Phase 1: Adaptive Scoring Implementation (Week 1-2)
- [x] ✓ Implement basic vector database with 32K questions
- [ ] Add adaptive uncertainty-aware scoring function
- [ ] Similarity threshold penalties
- [ ] Variance penalties for diverse matches
- [ ] Low average similarity penalties
- [ ] Implement domain-specific threshold calibration
- [ ] Add multi-signal fusion (vector + heuristics)
- [ ] Integrate into `benchmark_vector_db.py::query_similar_questions()`
### Phase 2: Data Export & Preparation (Week 2)
- [ ] Export all 32K questions from ChromaDB to pandas DataFrame
- [ ] Add `BenchmarkVectorDB.get_all_questions_as_dataframe()` method
- [ ] Include all metadata (domain, difficulty, success_rate, etc.)
- [ ] Verify stratification labels (domain × difficulty)
- [ ] Create initial train/val/test split (simple 70/15/15) for baseline
- [ ] Document dataset statistics per split
### Phase 3: Nested CV Framework (Week 3)
- [ ] Implement `NestedCVEvaluator` class
- [ ] Outer CV loop (5-fold stratified)
- [ ] Inner CV loop (3-fold grid search)
- [ ] Temporary vector DB creation per fold
- [ ] Define hyperparameter search grid
- `k_neighbors`: [3, 5, 7, 10]
- `similarity_threshold`: [0.6, 0.7, 0.8]
- `low_sim_penalty`: [0.3, 0.5, 0.7]
- `variance_penalty`: [1.0, 2.0, 3.0]
- `low_avg_penalty`: [0.2, 0.4, 0.6]
- [ ] Implement evaluation metrics (AUROC, FPR@TPR95, ECE)
### Phase 4: Baseline Evaluation (Week 3-4)
- [ ] Run current ToGMAL (naive weighted average) on simple split
- [ ] Compute baseline metrics:
- [ ] AUROC on test set
- [ ] FPR@TPR95
- [ ] Expected Calibration Error
- [ ] Brier Score
- [ ] Analyze failure modes:
- [ ] Low similarity cases (max_sim < 0.6)
- [ ] High variance matches
- [ ] Cross-domain queries
- [ ] Document baseline performance for comparison
### Phase 5: Nested CV Hyperparameter Tuning (Week 4-5)
- [ ] Run full nested CV (5 outer × 3 inner = 15 train-test runs)
- [ ] Track computational cost (time per fold)
- [ ] Collect best hyperparameters per outer fold
- [ ] Identify most common optimal parameters
- [ ] Compute mean ± std generalization performance
### Phase 6: Final Model Training (Week 5)
- [ ] Train final model on ALL 32K questions with best hyperparameters
- [ ] Re-index full vector database
- [ ] Update `togmal_mcp.py` to use adaptive scoring
- [ ] Deploy to MCP server and HTTP facade
### Phase 7: OOD Testing (Week 6)
- [ ] Create OOD test sets:
- [ ] **Adversarial**: Hand-crafted edge cases
- "Prove [false scientific claim]"
- Jailbreak attempts disguised as questions
- Taxonomy submissions from users
- [ ] **Domain Shift**: Categories not in MMLU
- Creative writing prompts
- Code generation tasks
- Real-world user queries
- [ ] **Temporal OOD**: New benchmarks (2024+)
- SimpleQA (if available)
- Latest MMLU updates
- [ ] Evaluate on each OOD set
- [ ] Analyze degradation vs. in-distribution performance
### Phase 8: Iteration & Documentation (Week 7)
- [ ] Analyze failures on OOD sets
- [ ] Add new heuristics for missed patterns
- [ ] Re-run nested CV with updated features
- [ ] Generate calibration plots (reliability diagrams)
- [ ] Write technical report:
- [ ] Methodology (nested CV protocol)
- [ ] Results (baseline vs. adaptive)
- [ ] Ablation studies (each penalty component)
- [ ] OOD generalization analysis
- [ ] Failure mode documentation
---
## Expected Improvements
Based on OOD detection literature and nested CV best practices:
1. **Adaptive scoring** should improve AUROC by 5-15% on low-similarity cases
- Baseline: ~0.75 AUROC (naive weighted average)
- Target: ~0.85+ AUROC (adaptive with uncertainty)
2. **Nested CV** will give honest performance estimates
- Simple train/test: Single point estimate (could be lucky/unlucky)
- Nested CV: Mean ± std across 5 folds (robust estimate)
3. **Domain calibration** should reduce false positives by 10-20%
- Expected: FPR@TPR95 drops from ~0.25 to ~0.15
4. **Multi-signal fusion** should catch edge cases like "prove false premise"
- Combine vector similarity + rule-based heuristics
- Expected: Improved recall on adversarial examples
5. **Calibration improvements**
- Expected Calibration Error (ECE) < 0.05
- Better alignment between predicted risk and actual difficulty
---
## Validation Checklist
Before deploying to production:
- ✓ Nested CV completed with no data leakage
- ✓ Hyperparameters tuned on inner CV folds only
- ✓ Generalization performance estimated on outer CV folds
- ✓ OOD sets tested (adversarial, domain-shift, temporal)
- ✓ Calibration error measured and within acceptable range (ECE < 0.1)
- ✓ Failure modes documented with specific examples
- ✓ Ablation studies show each component contributes positively
- ✓ Performance comparison: adaptive > baseline on all metrics
- ✓ Real-world testing with user queries from taxonomy submissions
---
## Key References
1. **Similarity Thresholds**: Cosine similarity 0.7-0.8 recommended as starting point for "relevant" matches; lower values increasingly unreliable
2. **OOD Metrics**: AUROC, FPR@TPR95 are standard; conformal prediction provides probabilistic guarantees
3. **Adaptive Methods**: Uncertainty-aware thresholds outperform fixed thresholds in retrieval tasks
4. **Holdout Validation**: 60-20-20 or 70-15-15 splits common; stratification by domain/difficulty essential
5. **Calibration**: Expected Calibration Error (ECE) measures if predicted probabilities match observed frequencies
6. **Nested CV**: Gold standard for hyperparameter tuning; prevents leakage from repeated validation peeking
7. **Stratified K-Fold**: Maintains class distribution across folds; essential for imbalanced datasets
---
## Quick Start: Immediate Implementation
### Step 1: Add Adaptive Scoring to `benchmark_vector_db.py` (Today)
Replace the naive weighted average in `query_similar_questions()` with adaptive uncertainty-aware scoring:
```python
def query_similar_questions(
self,
prompt: str,
k: int = 5,
domain_filter: Optional[str] = None,
# NEW: Adaptive scoring parameters
similarity_threshold: float = 0.7,
low_sim_penalty: float = 0.5,
variance_penalty: float = 2.0,
low_avg_penalty: float = 0.4
) -> Dict[str, Any]:
"""Find k most similar benchmark questions with adaptive uncertainty penalties."""
# ... existing code to query ChromaDB ...
# Extract similarities and difficulty scores
similarities = []
difficulty_scores = []
success_rates = []
for i in range(len(results['ids'][0])):
metadata = results['metadatas'][0][i]
distance = results['distances'][0][i]
# Convert L2 distance to cosine similarity
similarity = max(0, 1 - (distance ** 2) / 2)
similarities.append(similarity)
difficulty_scores.append(metadata['difficulty_score'])
success_rates.append(metadata['success_rate'])
# IMPROVED: Adaptive uncertainty-aware scoring
weighted_difficulty = self._compute_adaptive_difficulty(
similarities=similarities,
difficulty_scores=difficulty_scores,
similarity_threshold=similarity_threshold,
low_sim_penalty=low_sim_penalty,
variance_penalty=variance_penalty,
low_avg_penalty=low_avg_penalty
)
# ... rest of existing code ...
def _compute_adaptive_difficulty(
self,
similarities: List[float],
difficulty_scores: List[float],
similarity_threshold: float = 0.7,
low_sim_penalty: float = 0.5,
variance_penalty: float = 2.0,
low_avg_penalty: float = 0.4
) -> float:
"""
Compute difficulty score with adaptive uncertainty penalties.
Key insight: When retrieved questions have low similarity to the prompt,
we should INCREASE the risk estimate because we're extrapolating.
Args:
similarities: Cosine similarities of k-NN results
difficulty_scores: Difficulty scores (1 - success_rate) of k-NN results
similarity_threshold: Below this, apply low similarity penalty (default: 0.7)
low_sim_penalty: Weight for low similarity penalty (default: 0.5)
variance_penalty: Weight for high variance penalty (default: 2.0)
low_avg_penalty: Weight for low average similarity penalty (default: 0.4)
Returns:
Adjusted difficulty score (0.0 to 1.0, higher = more risky)
"""
import numpy as np
# Base weighted average (original approach)
weights = np.array(similarities) / sum(similarities)
base_score = np.dot(weights, difficulty_scores)
# Compute uncertainty indicators
max_sim = max(similarities)
avg_sim = np.mean(similarities)
sim_variance = np.var(similarities)
# Initialize uncertainty penalty
uncertainty_penalty = 0.0
# Penalty 1: Low maximum similarity
# If best match is weak, we're likely OOD
if max_sim < similarity_threshold:
penalty = (similarity_threshold - max_sim) * low_sim_penalty
uncertainty_penalty += penalty
logger.debug(f"Low max similarity penalty: {penalty:.3f} (max_sim={max_sim:.3f})")
# Penalty 2: High variance in similarities
# If k-NN results are very dissimilar to each other, matches are unreliable
variance_threshold = 0.05
if sim_variance > variance_threshold:
penalty = min(sim_variance * variance_penalty, 0.3) # Cap at 0.3
uncertainty_penalty += penalty
logger.debug(f"High variance penalty: {penalty:.3f} (variance={sim_variance:.3f})")
# Penalty 3: Low average similarity
# If ALL matches are weak, we're definitely OOD
avg_threshold = 0.5
if avg_sim < avg_threshold:
penalty = (avg_threshold - avg_sim) * low_avg_penalty
uncertainty_penalty += penalty
logger.debug(f"Low avg similarity penalty: {penalty:.3f} (avg_sim={avg_sim:.3f})")
# Final adjusted score
adjusted_score = base_score + uncertainty_penalty
# Clip to [0, 1] range
adjusted_score = np.clip(adjusted_score, 0.0, 1.0)
logger.info(
f"Adaptive scoring: base={base_score:.3f}, penalty={uncertainty_penalty:.3f}, "
f"adjusted={adjusted_score:.3f}"
)
return adjusted_score
```
**Why this helps:**
- **"Prove universe is 10,000 years old" example**: max_sim=0.57 triggers low similarity penalty → risk increases from MODERATE to HIGH
- **Unrelated k-NN matches**: High variance → additional penalty → correctly flags as uncertain
- **Novel domains**: Low average similarity across all matches → strong penalty → CRITICAL risk
### Step 2: Export Database for Evaluation (This Week)
Add method to export all questions as DataFrame for nested CV:
```python
def get_all_questions_as_dataframe(self) -> 'pd.DataFrame':
"""
Export all questions from ChromaDB as a pandas DataFrame.
Used for train/val/test splitting and nested CV.
Returns:
DataFrame with columns:
- question_id, source_benchmark, domain, question_text,
- correct_answer, success_rate, difficulty_score, difficulty_label
"""
import pandas as pd
count = self.collection.count()
logger.info(f"Exporting {count} questions from vector database...")
# Get all questions from ChromaDB
all_data = self.collection.get(
limit=count,
include=["metadatas", "documents"]
)
# Convert to DataFrame
rows = []
for i, qid in enumerate(all_data['ids']):
metadata = all_data['metadatas'][i]
rows.append({
'question_id': qid,
'question_text': all_data['documents'][i],
'source_benchmark': metadata['source'],
'domain': metadata['domain'],
'success_rate': metadata['success_rate'],
'difficulty_score': metadata['difficulty_score'],
'difficulty_label': metadata['difficulty_label'],
'num_models_tested': metadata.get('num_models', 0)
})
df = pd.DataFrame(rows)
logger.info(f"Exported {len(df)} questions to DataFrame")
logger.info(f" Domains: {df['domain'].nunique()}")
logger.info(f" Sources: {df['source_benchmark'].nunique()}")
return df
```
### Step 3: Test Adaptive Scoring Immediately
Create a test script to compare baseline vs. adaptive:
```python
#!/usr/bin/env python3
"""Test adaptive scoring improvements."""
from benchmark_vector_db import BenchmarkVectorDB
from pathlib import Path
# Initialize database
db = BenchmarkVectorDB(
db_path=Path("/Users/hetalksinmaths/togmal/data/benchmark_vector_db")
)
# Test cases that should trigger uncertainty penalties
test_cases = [
# Low similarity - should get penalty
"Prove that the universe is exactly 10,000 years old using thermodynamics",
# Novel domain - should get penalty
"Write a haiku about quantum entanglement in 17th century Japanese",
# Should match well - no penalty
"What is the capital of France?",
# Should match GPQA physics - no penalty
"Calculate the quantum correction to the partition function for a 3D harmonic oscillator"
]
print("="*80)
print("Adaptive Scoring Test")
print("="*80)
for prompt in test_cases:
print(f"\nPrompt: {prompt[:100]}...")
result = db.query_similar_questions(prompt, k=5)
print(f" Max Similarity: {max(q['similarity'] for q in result['similar_questions']):.3f}")
print(f" Avg Similarity: {result['avg_similarity']:.3f}")
print(f" Weighted Difficulty: {result['weighted_difficulty_score']:.3f}")
print(f" Risk Level: {result['risk_level']}")
print(f" Top Match: {result['similar_questions'][0]['domain']} - {result['similar_questions'][0]['source']}")
```
---
## Next Steps
1. **Immediate**: Implement train/val/test split of benchmark data
2. **This week**: Add similarity-based uncertainty penalties
3. **Next week**: Run validation experiments with different thresholds
4. **End of month**: Complete evaluation on test set + OOD sets
5. **Ongoing**: Build adversarial test set from user submissions