Skill Retriever

skill-retriever
.planning
phases
02-domain-models-ingestion

02-03-SUMMARY.md•2.76 KiB

# Phase 2 Plan 3: Entity Resolution Pipeline Summary Two-phase EntityResolver using rapidfuzz fuzzy matching with optional fastembed cosine similarity confirmation, Union-Find for transitive grouping, and richest-metadata merge strategy. ## What Was Built ### EntityResolver (`src/skill_retriever/nodes/ingestion/resolver.py`) - **Phase 1 (Fuzzy):** Groups entities by component_type (blocking), then compares names within groups using `fuzz.token_sort_ratio`. Pairs above `fuzzy_threshold` (default 80.0) become candidates. - **Phase 2 (Embeddings):** Optional confirmation step. Computes cosine similarity on "{name} {description}" embeddings via fastembed TextEmbedding. Pairs above `embedding_threshold` (default 0.85) are confirmed. Skipped when no embedding_model is provided (fuzzy-only mode). - **Union-Find:** Transitive duplicate pairs are merged into groups (A~B and B~C yields {A,B,C}). - **Merge Strategy:** Keeps entity with richest metadata (longest description, most tags, most tools). Unions tags and tools from all group members. Keeps most recent `last_updated`. ### Tests (`tests/test_resolver.py`) 8 test cases covering: 1. Exact duplicate merging 2. Similar name merging (hyphen vs underscore) 3. Blocking by component_type (same name, different type not merged) 4. Embedding rejection of divergent descriptions (with mock embedding model) 5. Empty input 6. No duplicates (all unique) 7. Richest metadata preservation (tags/tools union, longest description, latest timestamp) 8. Transitive duplicate merging (A~B~C -> single entity) ## Verification Results - **pytest:** 8/8 passed - **pyright:** 0 errors, 0 warnings - **ruff:** All checks passed ## Deviations from Plan None - plan executed exactly as written. ## Decisions Made | Decision | Rationale | |----------|-----------| | ComponentMetadata import in TYPE_CHECKING block | Follows project ruff TC001 convention; works with `from __future__ import annotations` | | Sorted output for merged tags/tools | Deterministic output for testing and reproducibility | | Richness scoring as (description_len, tags_count, tools_count) tuple | Natural Python tuple comparison gives priority to description length | ## Key Files | File | Action | Purpose | |------|--------|---------| | `src/skill_retriever/nodes/ingestion/resolver.py` | Created | EntityResolver with two-phase pipeline | | `tests/test_resolver.py` | Created | 8 test cases for resolver | ## Commits | Hash | Message | |------|---------| | `5467184` | feat(02-03): two-phase entity resolution pipeline | ## Duration ~5 minutes ## Next Phase Readiness - EntityResolver ready for integration into ingestion workflow - Fuzzy-only mode enables fast unit testing without model downloads - Embedding confirmation path ready for integration tests with real fastembed models

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/AnthonyAlcaraz/skill-retriever'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

02-03-SUMMARY.md•2.76 KiB