# Phase 4: Validation
## Overview
Run the full stress test and weight tuning to validate that all pipeline changes
meet accuracy targets. This is the acceptance gate — if targets are missed,
iterate on Phases 1-2 before proceeding.
**Depends on**: Phase 1 + Phase 2
**Entry state**: Pipeline rewritten with per-divider voting, fuzzy accuracy metric.
**Exit state**: Stress test passes all targets. `pipeline_weights.json` generated from data.
---
## Wave 4.1: Full stress test
### Task 4.1.1: Run stress test and verify targets
- **Description**: Run `tests/stress_test_real_library.py`. Verify all targets:
1. **0 MAJOR failures** in the stress test report
2. **Overall fuzzy GT accuracy >= 95%** (mean `fuzzy_accuracy_pct` across all
GT tables)
3. **Consensus never degrades by more than 5%**: for every GT table, pipeline
consensus `fuzzy_accuracy_pct` >= best single-method `fuzzy_accuracy_pct` - 5.0
4. **DEFAULT config >= FAST and MINIMAL** on mean fuzzy accuracy (variant
comparison section of the report)
If any target is missed, diagnose using the debug DB and iterate on Phases 1-2.
Common failure modes:
- Vacuous scores still appearing → Phase 1 metric not fully integrated
- Consensus degradation → Phase 2 acceptance threshold too aggressive/lenient
- Low accuracy on specific tables → investigate per-method results in debug DB
- **Files to modify**: None (execution task)
- **Tests**:
- The stress test IS the test. Acceptance verified by reading
`STRESS_TEST_REPORT.md` and `_stress_test_debug.db`.
- **Acceptance criteria**:
- 0 MAJOR failures
- Mean fuzzy GT accuracy >= 95%
- No table where consensus is > 5% worse than best single method
- DEFAULT config mean accuracy >= FAST and MINIMAL config mean accuracy
- Report shows no vacuous 100% scores
### Task 4.1.2: Generate tuned weights and verify
- **Description**: Run `tests/tune_weights.py` against the stress test debug DB
to generate `tests/pipeline_weights.json` with confidence multipliers computed
from actual win rates. Then re-run the stress test to verify tuned weights
don't regress accuracy.
The workflow:
1. Run `tune_weights.py` — reads `_stress_test_debug.db`, computes per-method
win rates from `fuzzy_accuracy_pct` in `method_results`, outputs
`pipeline_weights.json`
2. Re-run stress test (which reads the new `pipeline_weights.json` at Pipeline init)
3. Verify all targets still met (no regression from tuned weights)
- **Files to modify**: None (execution task). `tests/pipeline_weights.json` is
generated by the script.
- **Tests**:
- The re-run of the stress test validates the tuned weights.
- **Acceptance criteria**:
- `pipeline_weights.json` contains multipliers computed from actual win rates
- Multipliers reflect the fuzzy accuracy metric (not the old vacuous metric)
- Re-running the stress test with tuned weights shows no regression
- All Phase 4.1.1 targets still met after weight tuning