Zotero Chunk RAG

phase-4-validation.md•2.94 KiB

# Phase 4: Validation ## Overview Run the full stress test and weight tuning to validate that all pipeline changes meet accuracy targets. This is the acceptance gate — if targets are missed, iterate on Phases 1-2 before proceeding. **Depends on**: Phase 1 + Phase 2 **Entry state**: Pipeline rewritten with per-divider voting, fuzzy accuracy metric. **Exit state**: Stress test passes all targets. `pipeline_weights.json` generated from data. --- ## Wave 4.1: Full stress test ### Task 4.1.1: Run stress test and verify targets - **Description**: Run `tests/stress_test_real_library.py`. Verify all targets: 1. **0 MAJOR failures** in the stress test report 2. **Overall fuzzy GT accuracy >= 95%** (mean `fuzzy_accuracy_pct` across all GT tables) 3. **Consensus never degrades by more than 5%**: for every GT table, pipeline consensus `fuzzy_accuracy_pct` >= best single-method `fuzzy_accuracy_pct` - 5.0 4. **DEFAULT config >= FAST and MINIMAL** on mean fuzzy accuracy (variant comparison section of the report) If any target is missed, diagnose using the debug DB and iterate on Phases 1-2. Common failure modes: - Vacuous scores still appearing → Phase 1 metric not fully integrated - Consensus degradation → Phase 2 acceptance threshold too aggressive/lenient - Low accuracy on specific tables → investigate per-method results in debug DB - **Files to modify**: None (execution task) - **Tests**: - The stress test IS the test. Acceptance verified by reading `STRESS_TEST_REPORT.md` and `_stress_test_debug.db`. - **Acceptance criteria**: - 0 MAJOR failures - Mean fuzzy GT accuracy >= 95% - No table where consensus is > 5% worse than best single method - DEFAULT config mean accuracy >= FAST and MINIMAL config mean accuracy - Report shows no vacuous 100% scores ### Task 4.1.2: Generate tuned weights and verify - **Description**: Run `tests/tune_weights.py` against the stress test debug DB to generate `tests/pipeline_weights.json` with confidence multipliers computed from actual win rates. Then re-run the stress test to verify tuned weights don't regress accuracy. The workflow: 1. Run `tune_weights.py` — reads `_stress_test_debug.db`, computes per-method win rates from `fuzzy_accuracy_pct` in `method_results`, outputs `pipeline_weights.json` 2. Re-run stress test (which reads the new `pipeline_weights.json` at Pipeline init) 3. Verify all targets still met (no regression from tuned weights) - **Files to modify**: None (execution task). `tests/pipeline_weights.json` is generated by the script. - **Tests**: - The re-run of the stress test validates the tuned weights. - **Acceptance criteria**: - `pipeline_weights.json` contains multipliers computed from actual win rates - Multipliers reflect the fuzzy accuracy metric (not the old vacuous metric) - Re-running the stress test with tuned weights shows no regression - All Phase 4.1.1 targets still met after weight tuning

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ccam80/zotero-chunk-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

phase-4-validation.md•2.94 KiB