Skill Retriever

skill-retriever
.planning
phases
07-integration-validation

07-01-PLAN.md•8.95 KiB

--- phase: 07-integration-validation plan: 01 type: execute wave: 1 depends_on: [] files_modified: - pyproject.toml - tests/validation/__init__.py - tests/validation/conftest.py - tests/validation/fixtures/validation_pairs.json - tests/validation/test_mrr_evaluation.py - tests/validation/test_baselines.py autonomous: true must_haves: truths: - "30+ validation pairs exist covering 7 categories" - "MRR can be calculated using ranx against validation pairs" - "Hybrid retrieval MRR exceeds vector-only MRR" - "Hybrid retrieval MRR exceeds graph-only MRR" artifacts: - path: "tests/validation/fixtures/validation_pairs.json" provides: "Query-to-expected-component mapping" contains: "query_id" - path: "tests/validation/test_mrr_evaluation.py" provides: "MRR calculation tests" contains: "ranx" - path: "tests/validation/test_baselines.py" provides: "Baseline comparison tests" contains: "hybrid_outperforms" key_links: - from: "tests/validation/conftest.py" to: "tests/validation/fixtures/validation_pairs.json" via: "fixture load" pattern: "validation_pairs\\.json" - from: "tests/validation/test_mrr_evaluation.py" to: "ranx" via: "evaluate function" pattern: "from ranx import" --- <objective> Create validation harness with 30+ query-component pairs and prove hybrid retrieval outperforms baselines. Purpose: Establish ground truth evaluation framework and validate that the hybrid (vector + graph) approach delivers better ranking than single-mode retrieval. Output: Validation fixtures, MRR evaluation tests, and baseline comparison tests that prove hybrid superiority. </objective> <execution_context> @C:\Users\33641\.claude/get-shit-done/workflows/execute-plan.md @C:\Users\33641\.claude/get-shit-done/templates/summary.md </execution_context> <context> @.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/07-integration-validation/07-RESEARCH.md @src/skill_retriever/workflows/pipeline.py @src/skill_retriever/nodes/retrieval/vector_search.py @src/skill_retriever/nodes/retrieval/ppr_engine.py </context> <tasks> <task type="auto"> <name>Task 1: Add ranx dependency and create validation directory structure</name> <files> pyproject.toml tests/validation/__init__.py tests/validation/conftest.py tests/validation/fixtures/validation_pairs.json </files> <action> 1. Add ranx to dev dependencies in pyproject.toml: ```toml "ranx>=0.3.20", ``` Run `uv sync` to install. 2. Create tests/validation/ directory with __init__.py (empty). 3. Create tests/validation/fixtures/ directory. 4. Create validation_pairs.json with 30+ pairs covering these categories: - authentication (5 pairs): JWT, OAuth, login, refresh, session - development (5 pairs): Git, GitHub, debugging, testing, CI - content (4 pairs): LinkedIn, Medium, writing, posts - analysis (4 pairs): research, Z1, insights, data - infrastructure (4 pairs): MCP, settings, hooks, sandbox - multi-component (5 pairs): queries expecting 2-3 related components - negative (3 pairs): queries that should return specific types only Structure each pair as: ```json { "query_id": "auth_01", "query": "I need to authenticate users with JWT tokens", "expected": {"skill-jwt": 1, "agent-auth": 1}, "category": "authentication" } ``` Use binary relevance (1 for relevant, absent for not relevant) since MRR uses binary. 5. Create conftest.py with fixtures: - `validation_pairs`: Load and parse validation_pairs.json - `validation_qrels`: Convert pairs to ranx Qrels format - `seeded_pipeline`: Pipeline with deterministic test data The seeded_pipeline fixture must: - Create graph store with sample nodes matching validation pair expected IDs - Create vector store with deterministic embeddings (use np.random.default_rng(42)) - Return configured RetrievalPipeline instance </action> <verify> Run: `uv sync && python -c "from ranx import Qrels, Run, evaluate; print('ranx OK')"` Run: `python -c "import json; d=json.load(open('tests/validation/fixtures/validation_pairs.json')); print(f'{len(d[\"pairs\"])} pairs loaded')"` Confirm 30+ pairs loaded. </verify> <done> - ranx installed as dev dependency - validation_pairs.json contains 30+ pairs across 7 categories - conftest.py loads pairs and converts to ranx format </done> </task> <task type="auto"> <name>Task 2: Create MRR evaluation tests</name> <files> tests/validation/test_mrr_evaluation.py </files> <action> Create test_mrr_evaluation.py with tests that: 1. `test_mrr_above_threshold`: Run all validation queries through seeded pipeline, compute MRR using ranx, assert MRR >= 0.7. ```python from ranx import Qrels, Run, evaluate def test_mrr_above_threshold(seeded_pipeline, validation_pairs, validation_qrels): run_dict = {} for pair in validation_pairs: result = seeded_pipeline.retrieve(pair["query"], top_k=10) run_dict[pair["query_id"]] = { c.component_id: c.score for c in result.context.components } run = Run(run_dict) mrr = evaluate(validation_qrels, run, "mrr") assert mrr >= 0.7, f"MRR {mrr:.3f} below 0.7 threshold" ``` 2. `test_mrr_per_category`: Compute MRR for each category separately, assert all >= 0.5 (lower threshold per category since smaller sample). 3. `test_no_empty_results`: Verify every validation query returns at least 1 result (no zero-result queries). 4. `test_relevant_in_top_k`: For each pair, verify at least one expected component appears in top-10 results. Use ranx evaluate() function with "mrr" metric string. Handle edge cases: - Empty result sets should contribute 0.0 to MRR - Missing expected IDs in results are handled by ranx automatically </action> <verify> Run: `uv run pytest tests/validation/test_mrr_evaluation.py -v` All tests should pass with seeded pipeline. </verify> <done> - test_mrr_above_threshold passes with MRR >= 0.7 - test_mrr_per_category shows per-category breakdown - test_no_empty_results confirms all queries return results - test_relevant_in_top_k confirms ranking quality </done> </task> <task type="auto"> <name>Task 3: Create baseline comparison tests</name> <files> tests/validation/test_baselines.py </files> <action> Create test_baselines.py with tests proving hybrid outperforms single-mode retrieval: 1. `test_hybrid_outperforms_vector_only`: - Run vector-only retrieval (bypass PPR, use only vector_search results) - Run hybrid retrieval (full pipeline) - Compare MRR values - Assert hybrid > vector-only For vector-only baseline, call search_with_type_filter directly without PPR: ```python from skill_retriever.nodes.retrieval.vector_search import search_with_type_filter vector_results = search_with_type_filter(query, vector_store, graph_store, top_k=10) # Convert to run dict format ``` 2. `test_hybrid_outperforms_graph_only`: - Run graph-only retrieval (use PPR results without vector fusion) - Run hybrid retrieval (full pipeline) - Compare MRR values - Assert hybrid > graph-only OR assert hybrid >= graph-only if graph results are empty (handle queries with no entity matches gracefully) For graph-only baseline, call run_ppr_retrieval directly: ```python from skill_retriever.nodes.retrieval.ppr_engine import run_ppr_retrieval ppr_results = run_ppr_retrieval(query, graph_store, alpha=0.85, top_k=10) ``` 3. `test_baseline_comparison_summary`: Print summary table of all three modes' MRR for documentation. Handle empty graph results gracefully - some queries may have no entity matches for PPR seeds, which is expected. Use 0.0 MRR for such cases. Clear pipeline cache between runs using `pipeline.clear_cache()` to prevent cache contamination between baselines. </action> <verify> Run: `uv run pytest tests/validation/test_baselines.py -v` Hybrid should outperform or equal both baselines. </verify> <done> - test_hybrid_outperforms_vector_only shows hybrid > vector MRR - test_hybrid_outperforms_graph_only shows hybrid >= graph MRR - Baseline comparison documented in test output </done> </task> </tasks> <verification> All validation tests pass: ```bash uv run pytest tests/validation/ -v ``` Verify MRR threshold met: ```bash uv run pytest tests/validation/test_mrr_evaluation.py::test_mrr_above_threshold -v ``` Verify baseline comparison: ```bash uv run pytest tests/validation/test_baselines.py -v ``` </verification> <success_criteria> - [ ] ranx installed and importable - [ ] 30+ validation pairs in JSON fixture - [ ] MRR >= 0.7 on full validation set - [ ] Hybrid outperforms vector-only baseline - [ ] Hybrid >= graph-only baseline (handles empty gracefully) - [ ] All tests pass with `uv run pytest tests/validation/ -v` </success_criteria> <output> After completion, create `.planning/phases/07-integration-validation/07-01-SUMMARY.md` </output>

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/AnthonyAlcaraz/skill-retriever'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

07-01-PLAN.md•8.95 KiB