---
phase: 07-integration-validation
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- pyproject.toml
- tests/validation/__init__.py
- tests/validation/conftest.py
- tests/validation/fixtures/validation_pairs.json
- tests/validation/test_mrr_evaluation.py
- tests/validation/test_baselines.py
autonomous: true
must_haves:
truths:
- "30+ validation pairs exist covering 7 categories"
- "MRR can be calculated using ranx against validation pairs"
- "Hybrid retrieval MRR exceeds vector-only MRR"
- "Hybrid retrieval MRR exceeds graph-only MRR"
artifacts:
- path: "tests/validation/fixtures/validation_pairs.json"
provides: "Query-to-expected-component mapping"
contains: "query_id"
- path: "tests/validation/test_mrr_evaluation.py"
provides: "MRR calculation tests"
contains: "ranx"
- path: "tests/validation/test_baselines.py"
provides: "Baseline comparison tests"
contains: "hybrid_outperforms"
key_links:
- from: "tests/validation/conftest.py"
to: "tests/validation/fixtures/validation_pairs.json"
via: "fixture load"
pattern: "validation_pairs\\.json"
- from: "tests/validation/test_mrr_evaluation.py"
to: "ranx"
via: "evaluate function"
pattern: "from ranx import"
---
<objective>
Create validation harness with 30+ query-component pairs and prove hybrid retrieval outperforms baselines.
Purpose: Establish ground truth evaluation framework and validate that the hybrid (vector + graph) approach delivers better ranking than single-mode retrieval.
Output: Validation fixtures, MRR evaluation tests, and baseline comparison tests that prove hybrid superiority.
</objective>
<execution_context>
@C:\Users\33641\.claude/get-shit-done/workflows/execute-plan.md
@C:\Users\33641\.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/07-integration-validation/07-RESEARCH.md
@src/skill_retriever/workflows/pipeline.py
@src/skill_retriever/nodes/retrieval/vector_search.py
@src/skill_retriever/nodes/retrieval/ppr_engine.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Add ranx dependency and create validation directory structure</name>
<files>
pyproject.toml
tests/validation/__init__.py
tests/validation/conftest.py
tests/validation/fixtures/validation_pairs.json
</files>
<action>
1. Add ranx to dev dependencies in pyproject.toml:
```toml
"ranx>=0.3.20",
```
Run `uv sync` to install.
2. Create tests/validation/ directory with __init__.py (empty).
3. Create tests/validation/fixtures/ directory.
4. Create validation_pairs.json with 30+ pairs covering these categories:
- authentication (5 pairs): JWT, OAuth, login, refresh, session
- development (5 pairs): Git, GitHub, debugging, testing, CI
- content (4 pairs): LinkedIn, Medium, writing, posts
- analysis (4 pairs): research, Z1, insights, data
- infrastructure (4 pairs): MCP, settings, hooks, sandbox
- multi-component (5 pairs): queries expecting 2-3 related components
- negative (3 pairs): queries that should return specific types only
Structure each pair as:
```json
{
"query_id": "auth_01",
"query": "I need to authenticate users with JWT tokens",
"expected": {"skill-jwt": 1, "agent-auth": 1},
"category": "authentication"
}
```
Use binary relevance (1 for relevant, absent for not relevant) since MRR uses binary.
5. Create conftest.py with fixtures:
- `validation_pairs`: Load and parse validation_pairs.json
- `validation_qrels`: Convert pairs to ranx Qrels format
- `seeded_pipeline`: Pipeline with deterministic test data
The seeded_pipeline fixture must:
- Create graph store with sample nodes matching validation pair expected IDs
- Create vector store with deterministic embeddings (use np.random.default_rng(42))
- Return configured RetrievalPipeline instance
</action>
<verify>
Run: `uv sync && python -c "from ranx import Qrels, Run, evaluate; print('ranx OK')"`
Run: `python -c "import json; d=json.load(open('tests/validation/fixtures/validation_pairs.json')); print(f'{len(d[\"pairs\"])} pairs loaded')"`
Confirm 30+ pairs loaded.
</verify>
<done>
- ranx installed as dev dependency
- validation_pairs.json contains 30+ pairs across 7 categories
- conftest.py loads pairs and converts to ranx format
</done>
</task>
<task type="auto">
<name>Task 2: Create MRR evaluation tests</name>
<files>
tests/validation/test_mrr_evaluation.py
</files>
<action>
Create test_mrr_evaluation.py with tests that:
1. `test_mrr_above_threshold`: Run all validation queries through seeded pipeline, compute MRR using ranx, assert MRR >= 0.7.
```python
from ranx import Qrels, Run, evaluate
def test_mrr_above_threshold(seeded_pipeline, validation_pairs, validation_qrels):
run_dict = {}
for pair in validation_pairs:
result = seeded_pipeline.retrieve(pair["query"], top_k=10)
run_dict[pair["query_id"]] = {
c.component_id: c.score
for c in result.context.components
}
run = Run(run_dict)
mrr = evaluate(validation_qrels, run, "mrr")
assert mrr >= 0.7, f"MRR {mrr:.3f} below 0.7 threshold"
```
2. `test_mrr_per_category`: Compute MRR for each category separately, assert all >= 0.5 (lower threshold per category since smaller sample).
3. `test_no_empty_results`: Verify every validation query returns at least 1 result (no zero-result queries).
4. `test_relevant_in_top_k`: For each pair, verify at least one expected component appears in top-10 results.
Use ranx evaluate() function with "mrr" metric string. Handle edge cases:
- Empty result sets should contribute 0.0 to MRR
- Missing expected IDs in results are handled by ranx automatically
</action>
<verify>
Run: `uv run pytest tests/validation/test_mrr_evaluation.py -v`
All tests should pass with seeded pipeline.
</verify>
<done>
- test_mrr_above_threshold passes with MRR >= 0.7
- test_mrr_per_category shows per-category breakdown
- test_no_empty_results confirms all queries return results
- test_relevant_in_top_k confirms ranking quality
</done>
</task>
<task type="auto">
<name>Task 3: Create baseline comparison tests</name>
<files>
tests/validation/test_baselines.py
</files>
<action>
Create test_baselines.py with tests proving hybrid outperforms single-mode retrieval:
1. `test_hybrid_outperforms_vector_only`:
- Run vector-only retrieval (bypass PPR, use only vector_search results)
- Run hybrid retrieval (full pipeline)
- Compare MRR values
- Assert hybrid > vector-only
For vector-only baseline, call search_with_type_filter directly without PPR:
```python
from skill_retriever.nodes.retrieval.vector_search import search_with_type_filter
vector_results = search_with_type_filter(query, vector_store, graph_store, top_k=10)
# Convert to run dict format
```
2. `test_hybrid_outperforms_graph_only`:
- Run graph-only retrieval (use PPR results without vector fusion)
- Run hybrid retrieval (full pipeline)
- Compare MRR values
- Assert hybrid > graph-only OR assert hybrid >= graph-only if graph results are empty (handle queries with no entity matches gracefully)
For graph-only baseline, call run_ppr_retrieval directly:
```python
from skill_retriever.nodes.retrieval.ppr_engine import run_ppr_retrieval
ppr_results = run_ppr_retrieval(query, graph_store, alpha=0.85, top_k=10)
```
3. `test_baseline_comparison_summary`: Print summary table of all three modes' MRR for documentation.
Handle empty graph results gracefully - some queries may have no entity matches for PPR seeds, which is expected. Use 0.0 MRR for such cases.
Clear pipeline cache between runs using `pipeline.clear_cache()` to prevent cache contamination between baselines.
</action>
<verify>
Run: `uv run pytest tests/validation/test_baselines.py -v`
Hybrid should outperform or equal both baselines.
</verify>
<done>
- test_hybrid_outperforms_vector_only shows hybrid > vector MRR
- test_hybrid_outperforms_graph_only shows hybrid >= graph MRR
- Baseline comparison documented in test output
</done>
</task>
</tasks>
<verification>
All validation tests pass:
```bash
uv run pytest tests/validation/ -v
```
Verify MRR threshold met:
```bash
uv run pytest tests/validation/test_mrr_evaluation.py::test_mrr_above_threshold -v
```
Verify baseline comparison:
```bash
uv run pytest tests/validation/test_baselines.py -v
```
</verification>
<success_criteria>
- [ ] ranx installed and importable
- [ ] 30+ validation pairs in JSON fixture
- [ ] MRR >= 0.7 on full validation set
- [ ] Hybrid outperforms vector-only baseline
- [ ] Hybrid >= graph-only baseline (handles empty gracefully)
- [ ] All tests pass with `uv run pytest tests/validation/ -v`
</success_criteria>
<output>
After completion, create `.planning/phases/07-integration-validation/07-01-SUMMARY.md`
</output>