Skill Retriever

skill-retriever
.planning
phases
02-domain-models-ingestion

02-RESEARCH.md•25.4 KiB

# Phase 02: Domain Models & Ingestion - Research **Researched:** 2026-02-02 **Domain:** Pydantic entity modeling, Git repository crawling, entity resolution/deduplication **Confidence:** HIGH ## Summary This phase builds three layers: (1) Pydantic v2 domain models for the 7 component types in Claude Code ecosystems, (2) a strategy-pattern repository crawler that extracts components from any repo structure, and (3) a two-phase entity resolution pipeline combining fuzzy string matching with embedding similarity. The davila7/claude-code-templates repository serves as the reference implementation but the crawler must handle arbitrary structures. Component definitions are markdown files with YAML frontmatter containing `name`, `description`, `tools`/`allowed-tools`, `version`, and `model`/`priority` fields. The repository organizes components under `cli-tool/components/{type}/{category}/{component-name}/` with individual `.md` files or `SKILL.md` files inside named directories. GitPython provides the simplest path to extracting git health signals (last commit date, commit frequency) without requiring C library dependencies. RapidFuzz is the clear choice over TheFuzz for fuzzy string matching in the entity resolution pipeline, offering C++ performance with API compatibility. FastEmbed (already a project dependency) handles the embedding similarity phase. **Primary recommendation:** Build thin Pydantic models with Literal-discriminated unions for component types, use Protocol-based strategy pattern for extractors, and keep entity resolution as a standalone pipeline stage that runs after all extraction completes. ## Standard Stack ### Core | Library | Version | Purpose | Why Standard | |---------|---------|---------|--------------| | pydantic | >=2.12.5 | Entity models, validation, serialization | Already pinned in pyproject.toml; Rust-backed validation, discriminated unions | | gitpython | >=3.1.44 | Git metadata extraction (commit dates, frequency) | Pure Python, no C deps, wraps git CLI, mature (3.1.x stable) | | rapidfuzz | >=3.14 | Fuzzy string matching for entity resolution | C++ performance, MIT license, drop-in TheFuzz replacement | | fastembed | >=0.7.4 | Embedding similarity for entity resolution phase 2 | Already pinned in pyproject.toml; BAAI/bge-small-en-v1.5 configured | ### Supporting | Library | Version | Purpose | When to Use | |---------|---------|---------|-------------| | pyyaml | >=6.0 | Parse YAML frontmatter from markdown component files | Every component file extraction | | python-frontmatter | >=1.1 | Higher-level frontmatter parsing (wraps pyyaml) | Cleaner API for markdown+YAML files | ### Alternatives Considered | Instead of | Could Use | Tradeoff | |------------|-----------|----------| | gitpython | pygit2 | pygit2 is faster (libgit2 C bindings) but requires compiled libgit2 dependency; gitpython is simpler for our needs (commit metadata only, not heavy git operations) | | gitpython | subprocess git | Lower-level, more parsing work; gitpython already wraps this cleanly | | rapidfuzz | thefuzz | TheFuzz is slower (pure Python Levenshtein), no longer actively developed; RapidFuzz is the community successor | | python-frontmatter | manual yaml.safe_load | python-frontmatter handles edge cases (no frontmatter, malformed, content split) that manual parsing misses | **Installation:** ```bash uv add gitpython rapidfuzz python-frontmatter ``` ## Architecture Patterns ### Recommended Project Structure ``` src/skill_retriever/ ├── entities/ │ ├── __init__.py # Re-exports all models │ ├── components.py # ComponentMetadata, ComponentType enum │ ├── capabilities.py # CapabilityEntity │ └── graph.py # GraphNode, GraphEdge, EdgeType ├── nodes/ │ ├── __init__.py │ ├── ingestion/ │ │ ├── __init__.py │ │ ├── crawler.py # RepositoryCrawler (discovers files) │ │ ├── extractors.py # ExtractionStrategy Protocol + implementations │ │ ├── frontmatter.py # Markdown/YAML parsing utilities │ │ ├── git_signals.py # GitPython health signal extraction │ │ └── resolver.py # Entity resolution pipeline │ └── ... ``` ### Pattern 1: Literal-Discriminated Component Types **What:** Use Pydantic's Literal discriminator to distinguish 7 component types in a union **When to use:** Serializing/deserializing mixed component collections; validating component type at parse time **Example:** ```python from enum import StrEnum from pydantic import BaseModel, Field from typing import Literal class ComponentType(StrEnum): AGENT = "agent" SKILL = "skill" COMMAND = "command" SETTING = "setting" MCP = "mcp" HOOK = "hook" SANDBOX = "sandbox" class ComponentMetadata(BaseModel): """Core entity representing any Claude Code component.""" id: str = Field(description="Deterministic ID: {repo_owner}/{repo_name}/{type}/{name}") name: str component_type: ComponentType description: str = "" tags: list[str] = Field(default_factory=list) author: str = "" version: str = "" # Git health signals (INGS-04) last_updated: datetime | None = None commit_count: int = 0 commit_frequency_30d: float = 0.0 # commits per day in last 30 days # Full definition (INGS-03) raw_content: str = "" # Full markdown content parameters: dict[str, str] = Field(default_factory=dict) dependencies: list[str] = Field(default_factory=list) tools: list[str] = Field(default_factory=list) # Source tracking source_repo: str = "" source_path: str = "" # Relative path within repo category: str = "" # e.g., "ai-specialists", "development" ``` **Source:** Pydantic v2 discriminated unions documentation (https://docs.pydantic.dev/latest/concepts/unions/) ### Pattern 2: Protocol-Based Extraction Strategy **What:** Use `typing.Protocol` to define extraction interface; each repo structure gets its own concrete strategy **When to use:** Supporting multiple repository layouts without conditional branching in the crawler **Example:** ```python from typing import Protocol, runtime_checkable from pathlib import Path @runtime_checkable class ExtractionStrategy(Protocol): """Strategy for extracting components from a specific repo structure.""" def can_handle(self, repo_root: Path) -> bool: """Return True if this strategy understands the repo layout.""" ... def discover(self, repo_root: Path) -> list[Path]: """Return paths to all component definition files.""" ... def extract(self, file_path: Path, repo_root: Path) -> ComponentMetadata: """Parse a single component file into a ComponentMetadata entity.""" ... class Davila7Strategy: """Extracts from cli-tool/components/{type}/{category}/{name}/ structure.""" def can_handle(self, repo_root: Path) -> bool: return (repo_root / "cli-tool" / "components").is_dir() def discover(self, repo_root: Path) -> list[Path]: components_dir = repo_root / "cli-tool" / "components" paths = [] for type_dir in components_dir.iterdir(): if type_dir.is_dir() and type_dir.name in COMPONENT_TYPE_DIRS: for md_file in type_dir.rglob("*.md"): paths.append(md_file) return paths class FlatDirectoryStrategy: """Extracts from flat .claude/ directory structure.""" def can_handle(self, repo_root: Path) -> bool: return (repo_root / ".claude").is_dir() ``` **Source:** PEP 544 Protocol classes, Strategy pattern (https://refactoring.guru/design-patterns/strategy/python/example) ### Pattern 3: Two-Phase Entity Resolution **What:** First pass uses fuzzy string matching (RapidFuzz) on names/descriptions to find candidates; second pass computes embedding similarity to confirm matches above 0.85 threshold **When to use:** After all extraction completes, before graph insertion **Example:** ```python from rapidfuzz import fuzz, process from fastembed import TextEmbedding import numpy as np class EntityResolver: def __init__(self, embedding_model: TextEmbedding, fuzzy_threshold: float = 80.0, embedding_threshold: float = 0.85): self.embedding_model = embedding_model self.fuzzy_threshold = fuzzy_threshold self.embedding_threshold = embedding_threshold def resolve(self, entities: list[ComponentMetadata]) -> list[ComponentMetadata]: """Deduplicate entities using two-phase resolution.""" # Phase 1: Fuzzy string matching on names candidates = self._find_fuzzy_candidates(entities) # Phase 2: Embedding similarity confirmation confirmed_duplicates = self._confirm_with_embeddings(candidates) # Merge duplicates, keeping richest metadata return self._merge_duplicates(entities, confirmed_duplicates) def _find_fuzzy_candidates(self, entities): names = [e.name for e in entities] pairs = [] for i, name in enumerate(names): matches = process.extract(name, names[i+1:], scorer=fuzz.token_sort_ratio, score_cutoff=self.fuzzy_threshold) for match_name, score, idx in matches: pairs.append((i, i + 1 + idx, score)) return pairs ``` ### Pattern 4: Deterministic Component IDs **What:** Generate stable, reproducible IDs from source location: `{repo_owner}/{repo_name}/{type}/{normalized_name}` **When to use:** Every component gets an ID at extraction time; IDs must be stable across re-ingestion runs **Why:** Enables incremental updates and cross-repo deduplication without UUIDs ### Anti-Patterns to Avoid - **Parsing markdown with regex:** Use python-frontmatter for YAML extraction; regex fails on edge cases (nested code blocks, multiline values, no frontmatter) - **Loading entire repo history into memory:** Use `iter_commits(paths=file, max_count=N)` with GitPython; never `list(repo.iter_commits())` on large repos - **Tight coupling between crawler and extractor:** Crawler discovers files, strategy extracts metadata. Never put repo-structure-specific logic in the crawler - **Mutable entity models:** ComponentMetadata should be functionally immutable after creation. Use `model_copy(update={...})` for modifications - **Running entity resolution during extraction:** Resolution must be a separate pipeline stage after ALL extraction completes, because it needs the full entity set for comparison ## Don't Hand-Roll | Problem | Don't Build | Use Instead | Why | |---------|-------------|-------------|-----| | YAML frontmatter parsing | Regex-based frontmatter extraction | python-frontmatter | Handles missing frontmatter, malformed YAML, content splitting, encoding edge cases | | Fuzzy string matching | Custom Levenshtein implementation | rapidfuzz | C++ performance, battle-tested scorers (token_sort_ratio, partial_ratio), `process.extract` for bulk matching | | Git commit metadata | subprocess + git log parsing | gitpython `Repo.iter_commits()` | Object model for commits, handles encoding, cross-platform | | Embedding computation | Manual model loading + tokenization | fastembed (already in deps) | Handles batching, ONNX runtime, model caching | | Component ID generation | UUID or random strings | Deterministic path-based IDs | Enables incremental re-ingestion, stable references, cross-session consistency | | Markdown content extraction | BeautifulSoup or custom parser | python-frontmatter `.content` attribute | Cleanly separates frontmatter from body; body is already the "full definition" | **Key insight:** The davila7 repo components are markdown files with YAML frontmatter. This is a solved problem in the Python ecosystem. The complexity is in handling the variety of repo structures and the entity resolution, not the parsing itself. ## Common Pitfalls ### Pitfall 1: Assuming Uniform Frontmatter Fields Across Component Types **What goes wrong:** Agent files have `name`, `description`, `tools`, `model`. Skill files have `name`, `description`, `allowed-tools`, `version`, `priority`. Hook files are JSON-based (`HOOK_PATTERNS_COMPRESSED.json`), not markdown at all. Settings and sandbox components may have entirely different structures. **Why it happens:** Testing only against agent files during development **How to avoid:** Define a minimal required field set (name, description) and treat everything else as optional. Use `model_validator(mode='before')` to normalize field names across formats (e.g., `tools` and `allowed-tools` both map to `tools` field). **Warning signs:** Tests pass for agents but fail for skills or hooks ### Pitfall 2: Git Operations on Cloned Repos Without .git Directory **What goes wrong:** If repo was downloaded as ZIP or tarball (no `.git`), GitPython throws `InvalidGitRepositoryError` **Why it happens:** Not all users will `git clone`; some may download and extract **How to avoid:** Check for `.git` directory before attempting git operations. When absent, skip git health signals gracefully (set defaults: `last_updated=None`, `commit_count=0`) **Warning signs:** `InvalidGitRepositoryError` in tests using fixture directories ### Pitfall 3: Entity Resolution Quadratic Blowup **What goes wrong:** Comparing every entity against every other entity is O(n^2). With ~1400 components, that is ~1M comparisons for fuzzy matching alone, plus embedding computation **Why it happens:** Naive all-pairs comparison without blocking **How to avoid:** Use blocking strategy: only compare entities of the same `component_type`. This reduces from ~1400^2 to sum of (type_count^2), roughly 10x reduction. Also use `rapidfuzz.process.extract` with `score_cutoff` to skip low-scoring pairs early **Warning signs:** Entity resolution takes >10 seconds on the full dataset ### Pitfall 4: Confusing File Structure with Component Type **What goes wrong:** In davila7, the directory name maps to component type (`agents/` = agent, `skills/` = skill). In flat repos, there may be no such directory structure. Mapping logic that depends on directory names breaks for other repos. **How to avoid:** File path hints are one signal; frontmatter content is another. The ExtractionStrategy is responsible for determining component type, not the crawler. For flat repos, infer type from frontmatter fields or file content patterns. **Warning signs:** All components from a flat repo get classified as "unknown" type ### Pitfall 5: Over-Indexing on davila7 Structure **What goes wrong:** Building an extractor that perfectly handles `cli-tool/components/{type}/{category}/{name}/` but fails on any other layout **Why it happens:** Only testing against one repo **How to avoid:** Success criteria #2 explicitly requires a flat-directory test. Create a minimal test fixture with 3-5 components in a flat `.claude/` structure during plan 02-02 **Warning signs:** Strategy only has `Davila7Strategy`, no `FlatDirectoryStrategy` or `GenericMarkdownStrategy` ### Pitfall 6: Forgetting the marketplace.json Metadata Source **What goes wrong:** Ignoring `marketplace.json` means missing bundle-level metadata (keywords, version, license, mcpServers) that enriches individual components **Why it happens:** Focus on individual .md files, not aggregate metadata files **How to avoid:** Davila7Strategy should parse `marketplace.json` first to build a keyword/bundle index, then enrich individual components during extraction **Warning signs:** Components have no tags/keywords despite marketplace.json containing rich keyword arrays ## Code Examples ### Frontmatter Parsing with python-frontmatter ```python import frontmatter def parse_component_file(file_path: Path) -> tuple[dict, str]: """Parse a component markdown file into metadata dict and content body.""" post = frontmatter.load(str(file_path)) metadata = dict(post.metadata) # YAML frontmatter as dict content = post.content # Markdown body return metadata, content # Agent example yields: # metadata = {"name": "prompt-engineer", "description": "...", "tools": ["Read", "Write", "Edit"], "model": "opus"} # content = "## Expertise Areas\n..." ``` ### Git Health Signal Extraction with GitPython ```python from git import Repo, InvalidGitRepositoryError from datetime import datetime, timedelta, timezone def extract_git_signals(repo_path: Path, file_relative_path: str) -> dict: """Extract git health signals for a specific file.""" try: repo = Repo(repo_path) except InvalidGitRepositoryError: return {"last_updated": None, "commit_count": 0, "commit_frequency_30d": 0.0} commits = list(repo.iter_commits(paths=file_relative_path, max_count=500)) if not commits: return {"last_updated": None, "commit_count": 0, "commit_frequency_30d": 0.0} last_updated = commits[0].committed_datetime commit_count = len(commits) # Commits in last 30 days thirty_days_ago = datetime.now(timezone.utc) - timedelta(days=30) recent = [c for c in commits if c.committed_datetime > thirty_days_ago] frequency_30d = len(recent) / 30.0 return { "last_updated": last_updated, "commit_count": commit_count, "commit_frequency_30d": round(frequency_30d, 4), } ``` ### RapidFuzz Entity Matching ```python from rapidfuzz import fuzz, process def find_fuzzy_duplicates( names: list[str], threshold: float = 80.0 ) -> list[tuple[int, int, float]]: """Find name pairs above fuzzy similarity threshold.""" duplicates = [] for i in range(len(names)): # Only compare forward to avoid duplicate pairs remaining = names[i + 1:] if not remaining: continue matches = process.extract( names[i], remaining, scorer=fuzz.token_sort_ratio, score_cutoff=threshold, ) for match_name, score, idx in matches: duplicates.append((i, i + 1 + idx, score)) return duplicates ``` ### Embedding Similarity Confirmation ```python from fastembed import TextEmbedding import numpy as np def confirm_duplicates_with_embeddings( entities: list[ComponentMetadata], candidate_pairs: list[tuple[int, int, float]], model: TextEmbedding, threshold: float = 0.85, ) -> list[tuple[int, int]]: """Confirm fuzzy candidates using embedding cosine similarity.""" if not candidate_pairs: return [] # Batch embed all unique indices unique_indices = set() for i, j, _ in candidate_pairs: unique_indices.add(i) unique_indices.add(j) texts = [f"{entities[i].name} {entities[i].description}" for i in sorted(unique_indices)] embeddings = list(model.embed(texts)) idx_to_emb = dict(zip(sorted(unique_indices), embeddings)) confirmed = [] for i, j, fuzzy_score in candidate_pairs: emb_i = idx_to_emb[i] emb_j = idx_to_emb[j] cosine_sim = np.dot(emb_i, emb_j) / (np.linalg.norm(emb_i) * np.linalg.norm(emb_j)) if cosine_sim >= threshold: confirmed.append((i, j)) return confirmed ``` ## State of the Art | Old Approach | Current Approach | When Changed | Impact | |--------------|------------------|--------------|--------| | fuzzywuzzy (GPL) | rapidfuzz (MIT, C++) | 2022+ | 10-50x faster, permissive license, active maintenance | | Pydantic v1 validators | Pydantic v2 `field_validator` / `model_validator` | 2023 (v2.0) | Rust-backed core, discriminated unions in Rust, `model_copy` replaces `.copy()` | | Manual YAML parsing | python-frontmatter | Stable since 2018 | Handles all edge cases of markdown+YAML | | pygit2 for simple git ops | gitpython for metadata-only use cases | Ongoing | gitpython is simpler when you only need commit iteration, no C deps | **Deprecated/outdated:** - `fuzzywuzzy`: Replaced by `thefuzz` (rename) and superseded by `rapidfuzz` (performance) - Pydantic v1 `@validator` decorator: Replaced by `@field_validator` in v2 - Pydantic v1 `.copy(update={})`: Replaced by `.model_copy(update={})` in v2 ## Open Questions 1. **Hook component format** - What we know: Hooks use `HOOK_PATTERNS_COMPRESSED.json` (JSON), not markdown. Individual hook dirs contain markdown files but the compressed JSON is the aggregate. - What's unclear: Do individual hook `.md` files have frontmatter, or are they pure instruction text? The compressed JSON suggests hooks may need a JSON-based extractor alongside the markdown one. - Recommendation: Build a `HookJsonExtractor` alongside the markdown-based strategies. Parse both the compressed JSON and individual `.md` files to get maximum coverage. 2. **marketplace.json as metadata enrichment source** - What we know: 10 plugin bundles in marketplace.json with keywords, version, license, mcpServers arrays. Components referenced by file path. - What's unclear: How to merge bundle-level keywords with individual component metadata. Should bundle keywords propagate to all components in that bundle? - Recommendation: Parse marketplace.json early. Propagate bundle keywords as tags on each component in that bundle. Store bundle membership as metadata for potential graph edges later. 3. **Flat repo structure detection heuristic** - What we know: Flat repos have `.claude/agents/` and `.claude/commands/` directly. No `cli-tool/components/` hierarchy. - What's unclear: How many real-world repos use flat vs hierarchical structures? Are there other common patterns? - Recommendation: Build `Davila7Strategy` and `FlatClaudeStrategy` for v1. Add a `GenericMarkdownStrategy` that recursively scans for any `.md` files with recognized frontmatter fields as fallback. 4. **Incremental re-ingestion scope** - What we know: Context mentions "incremental updates after initial full build" - What's unclear: Whether Phase 2 needs to support incremental, or if full rebuild is sufficient for v1 - Recommendation: Phase 2 builds full-rebuild only. Use deterministic IDs so that Phase 3 (memory layer) can detect "already ingested" components by ID. Incremental is a v2 optimization. ## Reference: davila7/claude-code-templates Repository Structure ``` davila7/claude-code-templates/ ├── .claude/ │ ├── agents/ # Flat structure (empty in this repo) │ └── commands/ # Flat structure (empty in this repo) ├── .claude-plugin/ │ └── marketplace.json # 10 bundles with keywords, versions, component paths ├── cli-tool/ │ └── components/ │ ├── agents/ # 27+ category dirs, each with .md files │ │ ├── ai-specialists/ # 7 agents (e.g., prompt-engineer.md) │ │ ├── database/ │ │ ├── development-team/ │ │ └── ... │ ├── skills/ # 18 category dirs, each with {skill-name}/SKILL.md │ │ ├── ai-research/ # ~30 skills │ │ ├── development/ # clean-code/, architecture/, etc. │ │ └── ... │ ├── commands/ # 22 category dirs with command .md files │ ├── hooks/ # 10 category dirs + HOOK_PATTERNS_COMPRESSED.json │ ├── settings/ # 12 category dirs │ ├── mcps/ # 11 category dirs │ └── sandbox/ # 3 dirs (cloudflare, docker, e2b) ``` **Component file formats observed:** - Agents: `{category}/{agent-name}.md` with YAML frontmatter (`name`, `description`, `tools`, `model`) - Skills: `{category}/{skill-name}/SKILL.md` with YAML frontmatter (`name`, `description`, `allowed-tools`, `version`, `priority`) - Commands: `{category}/{command-name}.md` (format TBD, likely similar frontmatter) - Hooks: JSON (`HOOK_PATTERNS_COMPRESSED.json`) + individual `.md` files in category dirs - Settings: category dirs with configuration files (format TBD) - MCPs: category dirs with MCP definition files (format TBD) - Sandbox: `README.md` + implementation dirs (cloudflare, docker, e2b) ## Sources ### Primary (HIGH confidence) - davila7/claude-code-templates GitHub API - repository tree, file contents, marketplace.json schema verified directly - Pydantic v2 official docs (https://docs.pydantic.dev/latest/concepts/unions/) - discriminated unions, validators - Pydantic v2 validators docs (https://docs.pydantic.dev/latest/concepts/validators/) - field_validator, model_validator ### Secondary (MEDIUM confidence) - GitPython documentation (https://gitpython.readthedocs.io/en/stable/tutorial.html) - iter_commits, commit metadata APIs - RapidFuzz documentation (https://rapidfuzz.github.io/RapidFuzz/) - process.extract, scorer options, score_cutoff - RapidFuzz GitHub (https://github.com/rapidfuzz/RapidFuzz) - version 3.14.x, C++ backend confirmed - python-frontmatter PyPI (https://pypi.org/project/python-frontmatter/) - frontmatter.load API - PEP 544 Protocol classes - typing.Protocol for strategy pattern ### Tertiary (LOW confidence) - Component counts (~329 agents, ~664 skills, etc.) from architecture research context - not independently verified against current repo state ## Metadata **Confidence breakdown:** - Standard stack: HIGH - all libraries verified via official docs/repos, versions confirmed - Architecture: HIGH - patterns verified against actual davila7 repo structure, Protocol strategy pattern is well-established Python idiom - Pitfalls: MEDIUM - derived from structural analysis of the repo and general engineering experience; some pitfalls (quadratic blowup thresholds) would benefit from empirical validation **Research date:** 2026-02-02 **Valid until:** 2026-03-04 (30 days - stable domain, libraries are mature)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/AnthonyAlcaraz/skill-retriever'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

02-RESEARCH.md•25.4 KiB