Skill Retriever

skill-retriever
.planning
phases
02-domain-models-ingestion

02-03-PLAN.md•6.36 KiB

--- phase: 02-domain-models-ingestion plan: 03 type: tdd wave: 2 depends_on: ["02-01"] files_modified: - src/skill_retriever/nodes/ingestion/resolver.py - tests/test_resolver.py autonomous: true must_haves: truths: - "Duplicate components with similar names and descriptions are merged into a single entity" - "Entity resolution only compares within the same component_type (blocking strategy)" - "Fuzzy string threshold of 80 catches name variants like 'code-reviewer' vs 'code_reviewer'" - "Embedding similarity threshold of 0.85 prevents false merges of similarly-named but different components" - "Merged entity retains the richest metadata from all duplicates" artifacts: - path: "src/skill_retriever/nodes/ingestion/resolver.py" provides: "EntityResolver with two-phase dedup pipeline" exports: ["EntityResolver"] - path: "tests/test_resolver.py" provides: "Entity resolution tests covering merge, blocking, and threshold behavior" key_links: - from: "src/skill_retriever/nodes/ingestion/resolver.py" to: "src/skill_retriever/entities/components.py" via: "Operates on list[ComponentMetadata], returns deduplicated list[ComponentMetadata]" pattern: "list\\[ComponentMetadata\\]" - from: "src/skill_retriever/nodes/ingestion/resolver.py" to: "rapidfuzz" via: "Phase 1 fuzzy matching with token_sort_ratio" pattern: "fuzz\\.token_sort_ratio" - from: "src/skill_retriever/nodes/ingestion/resolver.py" to: "fastembed" via: "Phase 2 embedding similarity confirmation" pattern: "TextEmbedding" --- <objective> Build the two-phase entity resolution pipeline that deduplicates components after extraction. Phase 1 uses RapidFuzz fuzzy string matching to find candidate pairs. Phase 2 confirms matches using FastEmbed embedding cosine similarity. This prevents duplicate graph nodes when the same component appears in multiple repos or under variant names. Purpose: Without dedup, the knowledge graph accumulates redundant nodes that pollute retrieval results and inflate PPR computation. Entity resolution is the quality gate between raw extraction and graph storage. Output: `resolver.py` with EntityResolver class, comprehensive TDD tests. </objective> <execution_context> @C:\Users\33641\.claude/get-shit-done/workflows/execute-plan.md @C:\Users\33641\.claude/get-shit-done/templates/summary.md </execution_context> <context> @.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/02-domain-models-ingestion/02-RESEARCH.md @.planning/phases/02-domain-models-ingestion/02-01-SUMMARY.md @src/skill_retriever/entities/components.py @src/skill_retriever/config.py </context> <feature> <name>Two-Phase Entity Resolution Pipeline</name> <files> src/skill_retriever/nodes/ingestion/resolver.py tests/test_resolver.py </files> <behavior> EntityResolver.resolve(entities: list[ComponentMetadata]) -> list[ComponentMetadata] Given a list of extracted components, returns a deduplicated list where: **Phase 1 — Fuzzy String Matching (RapidFuzz):** - Group entities by component_type (blocking strategy to avoid O(n^2) on full set) - Within each group, compare names using `fuzz.token_sort_ratio` - Pairs scoring above `fuzzy_threshold` (default 80.0) become candidate duplicates **Phase 2 — Embedding Similarity Confirmation (FastEmbed):** - For each candidate pair, compute cosine similarity of embeddings of "{name} {description}" - Pairs above `embedding_threshold` (default 0.85) are confirmed duplicates **Merge Strategy:** - For confirmed duplicate sets, keep the entity with the richest metadata (longest description, most tags, most tools) - Merge tags from all duplicates (union) - Merge tools from all duplicates (union) - Keep the most recent last_updated timestamp Test cases: - 2 agents with same name, same type -> merged (1 result) - 2 agents with similar names ("code-reviewer" vs "code_reviewer"), same type -> merged - 2 components with same name but different types (agent vs skill) -> NOT merged (blocking) - 2 agents with similar names but very different descriptions -> NOT merged (embedding rejects) - Empty input -> empty output - No duplicates -> same list returned (count preserved) - Merged entity has union of tags from both sources </behavior> <implementation> Create `EntityResolver` class: - Constructor: `__init__(self, embedding_model: TextEmbedding | None = None, fuzzy_threshold: float = 80.0, embedding_threshold: float = 0.85)` - If `embedding_model` is None, skip Phase 2 (fuzzy-only mode for testing without model load) - Private method `_find_fuzzy_candidates(entities: list[ComponentMetadata]) -> list[tuple[int, int, float]]` — returns (idx_i, idx_j, score) triples - Private method `_confirm_with_embeddings(entities: list[ComponentMetadata], candidates: list[tuple[int, int, float]]) -> list[tuple[int, int]]` — returns confirmed pairs - Private method `_merge_group(entities: list[ComponentMetadata]) -> ComponentMetadata` — merges a group of duplicates into one - Public method `resolve(entities: list[ComponentMetadata]) -> list[ComponentMetadata]` Use `from rapidfuzz import fuzz, process` for fuzzy matching. Use `numpy` for cosine similarity (already transitive dep of fastembed). Use `from collections import defaultdict` for grouping by component_type. Blocking strategy: `groups = defaultdict(list); for e in entities: groups[e.component_type].append(e)`. Run fuzzy+embedding within each group independently. For merge: use `model_copy(update={...})` to create the merged entity. The "richest" entity is determined by `len(description) + len(tags) + len(tools)` score. </implementation> </feature> <verification> ```bash uv run pytest tests/test_resolver.py -v uv run pyright src/skill_retriever/nodes/ingestion/resolver.py uv run ruff check src/skill_retriever/nodes/ingestion/resolver.py ``` </verification> <success_criteria> - All TDD test cases pass (RED -> GREEN -> REFACTOR cycle) - Fuzzy-only mode works when embedding_model is None (for fast tests) - Blocking by component_type prevents cross-type false merges - Merge preserves richest metadata and unions tags/tools - Pyright strict + ruff clean </success_criteria> <output> After completion, create `.planning/phases/02-domain-models-ingestion/02-03-SUMMARY.md` </output>

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/AnthonyAlcaraz/skill-retriever'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

02-03-PLAN.md•6.36 KiB