# Metrics Logic: Justification for Metric Selection
## Metric Definitions
**Core Metrics:**
- **Precision@K** = TP / (TP + FP) - Fraction of returned results that are relevant
- **Recall@K** = TP / (TP + FN) - Fraction of all relevant items that were returned
- **F1@K** = 2 × (P × R) / (P + R) - Harmonic mean of Precision and Recall
- **File Discovery Rate** = Files Found / Files Expected - Fraction of expected files found
- **Substring Coverage** = Substrings Found / Substrings Required - Fraction of required content found
**Stability Metrics** (multi-run evaluation):
- **Coefficient of Variation (CV)** = Std Dev / Mean - Measures consistency across runs
- **Stability Score** = Average of (1 - CV) for key metrics
## The Logic Test: Proof of Perfection
**Theorem:** If all metrics score 1.0, the solution is mathematically perfect.
**Proof:**
1. **Precision = 1.0** ⟹ TP / (TP + FP) = 1.0 ⟹ FP = 0 (no false positives)
2. **Recall = 1.0** ⟹ TP / (TP + FN) = 1.0 ⟹ FN = 0 (no false negatives)
3. **File Discovery Rate = 1.0** ⟹ All expected files found
4. **Substring Coverage = 1.0** ⟹ All required content present
5. **Partial Match Rate = 0.0** ⟹ All matches are complete
**Completeness:** These metrics cover all failure modes:
- False positives → Precision < 1.0
- False negatives → Recall < 1.0
- File-level failures → File Discovery Rate < 1.0
- Content-level failures → Substring Coverage < 1.0
- Partial failures → Partial Match Rate > 0.0
**Conclusion:** Achieving 1.0 on all metrics guarantees the solution returns exactly the relevant snippets with complete content, and nothing else. **Q.E.D.**
## Why These Metrics?
### Precision & Recall: The Foundation
Precision and Recall are the gold standard for information retrieval evaluation because they measure the two fundamental aspects of search quality:
- **Precision** answers: "Of what I returned, how much is correct?"
- **Recall** answers: "Of what should be returned, how much did I find?"
Together, they form a complete picture: Precision = 1.0 means no noise, Recall = 1.0 means no missed results.
### Granular Metrics: Beyond Binary Classification
Traditional Precision/Recall treat each result as binary (relevant/irrelevant). Our granular metrics provide deeper insight:
- **File Discovery Rate**: Distinguishes "file not found" from "file found but incomplete"
- **Substring Coverage**: Measures content completeness within found files
- **Partial Match Rate**: Identifies cases where snippets are too small or fragmented
This granularity is essential because semantic search failures often occur at different levels:
1. Wrong files returned (Precision < 1.0)
2. Correct files missed (Recall < 1.0)
3. Correct files found but wrong snippets extracted (File Discovery Rate < 1.0)
4. Correct snippets but incomplete content (Substring Coverage < 1.0)
### Stability Metrics: Measuring Reproducibility
LLM-based systems are non-deterministic. Stability metrics measure whether results are consistent across runs:
- **CV < 10%**: Stable, reproducible results
- **CV > 10%**: High variance, results may differ between runs
This is critical for production systems where users expect consistent behavior.
## Mathematical Rigor
The proof above demonstrates that:
1. Each metric individually proves perfection when = 1.0
2. The metric set is complete (covers all failure modes)
3. Achieving all metrics = 1.0 guarantees perfect solution
This satisfies the requirement for mathematical certainty: if metrics = 1.0, the solution is perfect.