Skill Retriever

skill-retriever
.planning
phases
02-domain-models-ingestion

02-02-PLAN.md•16.1 KiB

--- phase: 02-domain-models-ingestion plan: 02 type: execute wave: 2 depends_on: ["02-01"] files_modified: - src/skill_retriever/nodes/__init__.py - src/skill_retriever/nodes/ingestion/__init__.py - src/skill_retriever/nodes/ingestion/frontmatter.py - src/skill_retriever/nodes/ingestion/git_signals.py - src/skill_retriever/nodes/ingestion/extractors.py - src/skill_retriever/nodes/ingestion/crawler.py - tests/test_ingestion.py - tests/fixtures/davila7_sample/cli-tool/components/agents/ai-specialists/prompt-engineer.md - tests/fixtures/davila7_sample/cli-tool/components/skills/development/clean-code/SKILL.md - tests/fixtures/davila7_sample/cli-tool/components/hooks/automation/pre-commit/pre-commit.md - tests/fixtures/flat_sample/.claude/agents/code-reviewer.md - tests/fixtures/flat_sample/.claude/commands/deploy.md autonomous: true must_haves: truths: - "Crawler auto-selects the correct extraction strategy based on repository structure" - "Davila7 strategy extracts agents, skills, hooks, and other component types from the nested cli-tool/components/ layout" - "Flat directory strategy extracts components from .claude/ directory structure" - "Each extracted ComponentMetadata has name, type, tags, description, and source_path populated" - "Git health signals (last_updated, commit_count, commit_frequency_30d) are extracted when .git exists and gracefully default when absent" - "Frontmatter field name differences (tools vs allowed-tools) are normalized" artifacts: - path: "src/skill_retriever/nodes/ingestion/frontmatter.py" provides: "Markdown+YAML frontmatter parsing with field normalization" exports: ["parse_component_file", "normalize_frontmatter"] - path: "src/skill_retriever/nodes/ingestion/git_signals.py" provides: "Git health signal extraction with graceful fallback" exports: ["extract_git_signals"] - path: "src/skill_retriever/nodes/ingestion/extractors.py" provides: "ExtractionStrategy Protocol, Davila7Strategy, FlatDirectoryStrategy, GenericMarkdownStrategy" exports: ["ExtractionStrategy", "Davila7Strategy", "FlatDirectoryStrategy", "GenericMarkdownStrategy"] - path: "src/skill_retriever/nodes/ingestion/crawler.py" provides: "RepositoryCrawler that orchestrates strategy selection and extraction" exports: ["RepositoryCrawler"] - path: "tests/test_ingestion.py" provides: "Ingestion pipeline tests against fixture repos" key_links: - from: "src/skill_retriever/nodes/ingestion/crawler.py" to: "src/skill_retriever/nodes/ingestion/extractors.py" via: "Crawler iterates registered strategies, calls can_handle then discover+extract" pattern: "strategy\\.can_handle.*strategy\\.discover.*strategy\\.extract" - from: "src/skill_retriever/nodes/ingestion/extractors.py" to: "src/skill_retriever/nodes/ingestion/frontmatter.py" via: "Strategies use parse_component_file to read markdown files" pattern: "parse_component_file" - from: "src/skill_retriever/nodes/ingestion/extractors.py" to: "src/skill_retriever/entities/components.py" via: "Strategies produce ComponentMetadata instances" pattern: "ComponentMetadata" --- <objective> Build the repository crawling and component extraction pipeline. The crawler discovers component files in any repo structure using the strategy pattern, parses markdown/YAML frontmatter, extracts git health signals, and produces ComponentMetadata entities. Purpose: This is the data entry point. Without reliable extraction, the graph has no nodes and vector search has no documents. Universal extraction means the system works beyond davila7. Output: `ingestion/` subpackage with crawler, extractors, frontmatter parser, git signal extractor, test fixtures, and tests. </objective> <execution_context> @C:\Users\33641\.claude/get-shit-done/workflows/execute-plan.md @C:\Users\33641\.claude/get-shit-done/templates/summary.md </execution_context> <context> @.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/02-domain-models-ingestion/02-RESEARCH.md @.planning/phases/02-domain-models-ingestion/02-01-SUMMARY.md @src/skill_retriever/entities/components.py </context> <tasks> <task type="auto"> <name>Task 1: Create test fixtures and frontmatter/git_signals utilities</name> <files> tests/fixtures/davila7_sample/cli-tool/components/agents/ai-specialists/prompt-engineer.md tests/fixtures/davila7_sample/cli-tool/components/skills/development/clean-code/SKILL.md tests/fixtures/davila7_sample/cli-tool/components/hooks/automation/pre-commit/pre-commit.md tests/fixtures/flat_sample/.claude/agents/code-reviewer.md tests/fixtures/flat_sample/.claude/commands/deploy.md src/skill_retriever/nodes/ingestion/__init__.py src/skill_retriever/nodes/__init__.py src/skill_retriever/nodes/ingestion/frontmatter.py src/skill_retriever/nodes/ingestion/git_signals.py </files> <action> **Test fixtures** (minimal but realistic): Create `tests/fixtures/davila7_sample/cli-tool/components/agents/ai-specialists/prompt-engineer.md`: ```yaml --- name: prompt-engineer description: Expert at crafting effective prompts for AI systems tools: - Read - Write - Edit model: opus --- ## Expertise Areas Prompt design, chain-of-thought, few-shot examples... ``` Create `tests/fixtures/davila7_sample/cli-tool/components/skills/development/clean-code/SKILL.md`: ```yaml --- name: clean-code description: Writes clean, maintainable code following best practices allowed-tools: - Read - Write - Edit - Bash version: "1.0" priority: 2 --- ## Instructions Follow SOLID principles... ``` Create `tests/fixtures/davila7_sample/cli-tool/components/hooks/automation/pre-commit/pre-commit.md`: ```yaml --- name: pre-commit description: Runs linting and formatting before commits --- ## Hook Definition Trigger on pre-commit events... ``` Create `tests/fixtures/flat_sample/.claude/agents/code-reviewer.md`: ```yaml --- name: code-reviewer description: Reviews code for bugs and style issues tools: - Read - Grep --- ## Role Thorough code review agent... ``` Create `tests/fixtures/flat_sample/.claude/commands/deploy.md`: ```yaml --- name: deploy description: Deploy to production environment --- ## Usage /deploy [environment] ``` **frontmatter.py:** Create `parse_component_file(file_path: Path) -> tuple[dict[str, Any], str]`: - Use `frontmatter.load(str(file_path))` to parse - Return `(dict(post.metadata), post.content)` - Handle `FileNotFoundError` and files with no frontmatter (return empty dict + full content) Create `normalize_frontmatter(raw: dict[str, Any]) -> dict[str, Any]`: - Map `allowed-tools` to `tools` key - Map `allowed_tools` to `tools` key - Ensure `tags` is always a list (split string on commas if string) - Ensure `tools` is always a list - Strip whitespace from `name` and `description` if present - Return normalized dict **git_signals.py:** Create `extract_git_signals(repo_path: Path, file_relative_path: str) -> dict[str, Any]`: - Try `Repo(repo_path)`, catch `InvalidGitRepositoryError` and return defaults: `{"last_updated": None, "commit_count": 0, "commit_frequency_30d": 0.0}` - Use `repo.iter_commits(paths=file_relative_path, max_count=500)` - Compute `last_updated` (first commit's `committed_datetime`), `commit_count` (len), `commit_frequency_30d` (commits in last 30 days / 30) - Handle empty commit list (file never committed) with same defaults **__init__.py files:** - `src/skill_retriever/nodes/__init__.py`: `"""Nodes package — self-contained AI logic units."""` - `src/skill_retriever/nodes/ingestion/__init__.py`: Re-export `RepositoryCrawler` (will exist after Task 2). For now, just docstring: `"""Ingestion pipeline: crawl, extract, resolve."""` </action> <verify> Run `uv run python -c "from skill_retriever.nodes.ingestion.frontmatter import parse_component_file; print('OK')"`. Run `uv run python -c "from skill_retriever.nodes.ingestion.git_signals import extract_git_signals; print('OK')"`. Verify fixture files exist: `ls tests/fixtures/davila7_sample/cli-tool/components/agents/ai-specialists/prompt-engineer.md`. </verify> <done> Frontmatter parser handles YAML+markdown files with field normalization. Git signal extractor gracefully handles non-git directories. Five test fixture files exist across davila7 and flat layouts. </done> </task> <task type="auto"> <name>Task 2: Create extraction strategies and repository crawler</name> <files> src/skill_retriever/nodes/ingestion/extractors.py src/skill_retriever/nodes/ingestion/crawler.py src/skill_retriever/nodes/ingestion/__init__.py </files> <action> **extractors.py:** Define `ExtractionStrategy` as a `typing.Protocol` (runtime_checkable) with three methods: - `can_handle(self, repo_root: Path) -> bool` - `discover(self, repo_root: Path) -> list[Path]` — returns paths to all component definition files - `extract(self, file_path: Path, repo_root: Path) -> ComponentMetadata | None` — returns None if file cannot be parsed Define a module-level dict `COMPONENT_TYPE_DIRS` mapping directory names to ComponentType: `{"agents": ComponentType.AGENT, "skills": ComponentType.SKILL, "commands": ComponentType.COMMAND, "hooks": ComponentType.HOOK, "settings": ComponentType.SETTING, "mcps": ComponentType.MCP, "sandbox": ComponentType.SANDBOX}`. **Davila7Strategy:** - `can_handle`: Check `(repo_root / "cli-tool" / "components").is_dir()` - `discover`: Iterate `cli-tool/components/{type_dir}/` for each key in COMPONENT_TYPE_DIRS. For each type dir, `rglob("*.md")` to find all markdown files. Return list of all found paths. - `extract`: Determine component_type from the path (which type_dir it's under). Use `parse_component_file` + `normalize_frontmatter`. Determine category from the parent directory structure (the directory between type_dir and the file). Build a ComponentMetadata using `generate_id` with repo_owner and repo_name extracted from `source_repo` parameter. Pass `source_repo` as a class attribute or constructor parameter (prefer constructor: `Davila7Strategy(repo_owner: str, repo_name: str)`). **FlatDirectoryStrategy:** - `can_handle`: Check `(repo_root / ".claude").is_dir()` and at least one recognized subdirectory exists (agents, commands, skills, etc.) - `discover`: Look for `.claude/{type_dir}/` directories matching COMPONENT_TYPE_DIRS keys. Glob `*.md` in each. - `extract`: Similar to Davila7 but path structure is `.claude/{type}/{file}.md`. Infer component_type from directory name. Constructor: `FlatDirectoryStrategy(repo_owner: str, repo_name: str)`. **GenericMarkdownStrategy:** - `can_handle`: Always returns True (fallback). Should be registered last. - `discover`: `repo_root.rglob("*.md")` excluding hidden dirs (`.git`, `.github`, `node_modules`, `__pycache__`). Filter to files that actually have recognized frontmatter fields (`name` key in metadata). - `extract`: Parse frontmatter. If `name` field exists, create ComponentMetadata. Infer component_type from frontmatter hints or default to AGENT. Constructor: `GenericMarkdownStrategy(repo_owner: str, repo_name: str)`. **crawler.py:** Create `RepositoryCrawler`: - Constructor: `__init__(self, repo_owner: str, repo_name: str, repo_path: Path)` - Attribute `strategies: list[ExtractionStrategy]` initialized in order: `[Davila7Strategy(...), FlatDirectoryStrategy(...), GenericMarkdownStrategy(...)]` - Method `crawl(self) -> list[ComponentMetadata]`: 1. Iterate strategies, find first where `can_handle(self.repo_path)` returns True 2. Call `strategy.discover(self.repo_path)` to get file paths 3. For each file path, call `strategy.extract(file_path, self.repo_path)` 4. For each successful extraction, call `extract_git_signals(self.repo_path, file_path.relative_to(self.repo_path).as_posix())` and merge signals into ComponentMetadata via `model_copy(update=signals)` 5. Collect and return all non-None ComponentMetadata objects 6. Log (using `logging`) the strategy selected and count of components found **Update __init__.py:** Update `src/skill_retriever/nodes/ingestion/__init__.py` to re-export `RepositoryCrawler`. </action> <verify> Run `uv run pyright src/skill_retriever/nodes/ingestion/` — zero errors. Run `uv run ruff check src/skill_retriever/nodes/ingestion/` — zero errors. Run `uv run python -c "from skill_retriever.nodes.ingestion import RepositoryCrawler; print('OK')"`. </verify> <done> Three extraction strategies handle davila7, flat, and generic repo layouts. RepositoryCrawler auto-selects strategy and produces ComponentMetadata list with git health signals merged. </done> </task> <task type="auto"> <name>Task 3: Write ingestion pipeline tests</name> <files>tests/test_ingestion.py</files> <action> Create `tests/test_ingestion.py` with a `conftest.py`-style approach using fixtures at the top of the file (or in `tests/conftest.py` if it doesn't exist yet — create it). **Fixtures** (in `tests/conftest.py`): - `davila7_repo` — returns `Path` to `tests/fixtures/davila7_sample` - `flat_repo` — returns `Path` to `tests/fixtures/flat_sample` **Test cases:** 1. `test_parse_component_file_with_frontmatter` — Parse the prompt-engineer.md fixture. Verify metadata dict has `name`, `description`, `tools`. Verify content contains "Expertise Areas". 2. `test_parse_component_file_no_frontmatter` — Create a temp file with no frontmatter (just markdown). Verify returns empty dict and full content. 3. `test_normalize_frontmatter_allowed_tools` — Pass `{"allowed-tools": ["Read", "Write"]}`, verify output has `tools` key with `["Read", "Write"]`. 4. `test_normalize_frontmatter_string_tags` — Pass `{"tags": "ai, testing, code"}`, verify output has `tags` as `["ai", "testing", "code"]` (stripped). 5. `test_extract_git_signals_no_git` — Call on fixture dir (no .git). Verify defaults returned. 6. `test_davila7_strategy_can_handle` — Davila7Strategy recognizes the davila7_sample fixture. 7. `test_davila7_strategy_cannot_handle_flat` — Davila7Strategy rejects the flat_sample fixture. 8. `test_davila7_strategy_discover` — Discover files in davila7_sample, verify at least 3 paths returned (prompt-engineer.md, SKILL.md, pre-commit.md). 9. `test_davila7_strategy_extract_agent` — Extract prompt-engineer.md. Verify: name="prompt-engineer", component_type=AGENT, description non-empty, tools contains "Read", category="ai-specialists". 10. `test_flat_strategy_can_handle` — FlatDirectoryStrategy recognizes flat_sample. 11. `test_flat_strategy_discover` — Discover files in flat_sample, verify 2 paths. 12. `test_flat_strategy_extract` — Extract code-reviewer.md. Verify: name="code-reviewer", component_type=AGENT, tools contains "Read". 13. `test_crawler_davila7` — Create RepositoryCrawler for davila7_sample, call crawl(). Verify returns list of ComponentMetadata, length >= 3, all have non-empty name and id. 14. `test_crawler_flat` — Same for flat_sample, length >= 2. 15. `test_crawler_component_ids_are_deterministic` — Crawl twice, verify same IDs produced. </action> <verify>Run `uv run pytest tests/test_ingestion.py -v` — all 15 tests pass.</verify> <done>15 tests verify frontmatter parsing, field normalization, strategy selection, extraction from both repo layouts, and crawler determinism. Both success criteria #1 (davila7 extraction) and #2 (flat directory extraction) are covered.</done> </task> </tasks> <verification> ```bash uv run pytest tests/test_ingestion.py tests/test_entities.py -v uv run pyright src/skill_retriever/nodes/ingestion/ uv run ruff check src/skill_retriever/nodes/ ``` All ingestion tests pass. Pyright strict and ruff clean on the ingestion subpackage. </verification> <success_criteria> - Davila7 fixture extraction produces ComponentMetadata for agent, skill, and hook files - Flat fixture extraction produces ComponentMetadata for agent and command files - Frontmatter field normalization handles tools/allowed-tools divergence - Git signal extraction gracefully defaults when no .git directory - Crawler auto-selects correct strategy per repo structure - Component IDs are deterministic across runs - All 15 ingestion tests pass - Pyright strict + ruff clean </success_criteria> <output> After completion, create `.planning/phases/02-domain-models-ingestion/02-02-SUMMARY.md` </output>

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/AnthonyAlcaraz/skill-retriever'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

02-02-PLAN.md•16.1 KiB