---
phase: 02-domain-models-ingestion
plan: 03
type: tdd
wave: 2
depends_on: ["02-01"]
files_modified:
- src/skill_retriever/nodes/ingestion/resolver.py
- tests/test_resolver.py
autonomous: true
must_haves:
truths:
- "Duplicate components with similar names and descriptions are merged into a single entity"
- "Entity resolution only compares within the same component_type (blocking strategy)"
- "Fuzzy string threshold of 80 catches name variants like 'code-reviewer' vs 'code_reviewer'"
- "Embedding similarity threshold of 0.85 prevents false merges of similarly-named but different components"
- "Merged entity retains the richest metadata from all duplicates"
artifacts:
- path: "src/skill_retriever/nodes/ingestion/resolver.py"
provides: "EntityResolver with two-phase dedup pipeline"
exports: ["EntityResolver"]
- path: "tests/test_resolver.py"
provides: "Entity resolution tests covering merge, blocking, and threshold behavior"
key_links:
- from: "src/skill_retriever/nodes/ingestion/resolver.py"
to: "src/skill_retriever/entities/components.py"
via: "Operates on list[ComponentMetadata], returns deduplicated list[ComponentMetadata]"
pattern: "list\\[ComponentMetadata\\]"
- from: "src/skill_retriever/nodes/ingestion/resolver.py"
to: "rapidfuzz"
via: "Phase 1 fuzzy matching with token_sort_ratio"
pattern: "fuzz\\.token_sort_ratio"
- from: "src/skill_retriever/nodes/ingestion/resolver.py"
to: "fastembed"
via: "Phase 2 embedding similarity confirmation"
pattern: "TextEmbedding"
---
<objective>
Build the two-phase entity resolution pipeline that deduplicates components after extraction. Phase 1 uses RapidFuzz fuzzy string matching to find candidate pairs. Phase 2 confirms matches using FastEmbed embedding cosine similarity. This prevents duplicate graph nodes when the same component appears in multiple repos or under variant names.
Purpose: Without dedup, the knowledge graph accumulates redundant nodes that pollute retrieval results and inflate PPR computation. Entity resolution is the quality gate between raw extraction and graph storage.
Output: `resolver.py` with EntityResolver class, comprehensive TDD tests.
</objective>
<execution_context>
@C:\Users\33641\.claude/get-shit-done/workflows/execute-plan.md
@C:\Users\33641\.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/02-domain-models-ingestion/02-RESEARCH.md
@.planning/phases/02-domain-models-ingestion/02-01-SUMMARY.md
@src/skill_retriever/entities/components.py
@src/skill_retriever/config.py
</context>
<feature>
<name>Two-Phase Entity Resolution Pipeline</name>
<files>
src/skill_retriever/nodes/ingestion/resolver.py
tests/test_resolver.py
</files>
<behavior>
EntityResolver.resolve(entities: list[ComponentMetadata]) -> list[ComponentMetadata]
Given a list of extracted components, returns a deduplicated list where:
**Phase 1 — Fuzzy String Matching (RapidFuzz):**
- Group entities by component_type (blocking strategy to avoid O(n^2) on full set)
- Within each group, compare names using `fuzz.token_sort_ratio`
- Pairs scoring above `fuzzy_threshold` (default 80.0) become candidate duplicates
**Phase 2 — Embedding Similarity Confirmation (FastEmbed):**
- For each candidate pair, compute cosine similarity of embeddings of "{name} {description}"
- Pairs above `embedding_threshold` (default 0.85) are confirmed duplicates
**Merge Strategy:**
- For confirmed duplicate sets, keep the entity with the richest metadata (longest description, most tags, most tools)
- Merge tags from all duplicates (union)
- Merge tools from all duplicates (union)
- Keep the most recent last_updated timestamp
Test cases:
- 2 agents with same name, same type -> merged (1 result)
- 2 agents with similar names ("code-reviewer" vs "code_reviewer"), same type -> merged
- 2 components with same name but different types (agent vs skill) -> NOT merged (blocking)
- 2 agents with similar names but very different descriptions -> NOT merged (embedding rejects)
- Empty input -> empty output
- No duplicates -> same list returned (count preserved)
- Merged entity has union of tags from both sources
</behavior>
<implementation>
Create `EntityResolver` class:
- Constructor: `__init__(self, embedding_model: TextEmbedding | None = None, fuzzy_threshold: float = 80.0, embedding_threshold: float = 0.85)`
- If `embedding_model` is None, skip Phase 2 (fuzzy-only mode for testing without model load)
- Private method `_find_fuzzy_candidates(entities: list[ComponentMetadata]) -> list[tuple[int, int, float]]` — returns (idx_i, idx_j, score) triples
- Private method `_confirm_with_embeddings(entities: list[ComponentMetadata], candidates: list[tuple[int, int, float]]) -> list[tuple[int, int]]` — returns confirmed pairs
- Private method `_merge_group(entities: list[ComponentMetadata]) -> ComponentMetadata` — merges a group of duplicates into one
- Public method `resolve(entities: list[ComponentMetadata]) -> list[ComponentMetadata]`
Use `from rapidfuzz import fuzz, process` for fuzzy matching.
Use `numpy` for cosine similarity (already transitive dep of fastembed).
Use `from collections import defaultdict` for grouping by component_type.
Blocking strategy: `groups = defaultdict(list); for e in entities: groups[e.component_type].append(e)`. Run fuzzy+embedding within each group independently.
For merge: use `model_copy(update={...})` to create the merged entity. The "richest" entity is determined by `len(description) + len(tags) + len(tools)` score.
</implementation>
</feature>
<verification>
```bash
uv run pytest tests/test_resolver.py -v
uv run pyright src/skill_retriever/nodes/ingestion/resolver.py
uv run ruff check src/skill_retriever/nodes/ingestion/resolver.py
```
</verification>
<success_criteria>
- All TDD test cases pass (RED -> GREEN -> REFACTOR cycle)
- Fuzzy-only mode works when embedding_model is None (for fast tests)
- Blocking by component_type prevents cross-type false merges
- Merge preserves richest metadata and unions tags/tools
- Pyright strict + ruff clean
</success_criteria>
<output>
After completion, create `.planning/phases/02-domain-models-ingestion/02-03-SUMMARY.md`
</output>