---
phase: 02-domain-models-ingestion
plan: 02
type: execute
wave: 2
depends_on: ["02-01"]
files_modified:
- src/skill_retriever/nodes/__init__.py
- src/skill_retriever/nodes/ingestion/__init__.py
- src/skill_retriever/nodes/ingestion/frontmatter.py
- src/skill_retriever/nodes/ingestion/git_signals.py
- src/skill_retriever/nodes/ingestion/extractors.py
- src/skill_retriever/nodes/ingestion/crawler.py
- tests/test_ingestion.py
- tests/fixtures/davila7_sample/cli-tool/components/agents/ai-specialists/prompt-engineer.md
- tests/fixtures/davila7_sample/cli-tool/components/skills/development/clean-code/SKILL.md
- tests/fixtures/davila7_sample/cli-tool/components/hooks/automation/pre-commit/pre-commit.md
- tests/fixtures/flat_sample/.claude/agents/code-reviewer.md
- tests/fixtures/flat_sample/.claude/commands/deploy.md
autonomous: true
must_haves:
truths:
- "Crawler auto-selects the correct extraction strategy based on repository structure"
- "Davila7 strategy extracts agents, skills, hooks, and other component types from the nested cli-tool/components/ layout"
- "Flat directory strategy extracts components from .claude/ directory structure"
- "Each extracted ComponentMetadata has name, type, tags, description, and source_path populated"
- "Git health signals (last_updated, commit_count, commit_frequency_30d) are extracted when .git exists and gracefully default when absent"
- "Frontmatter field name differences (tools vs allowed-tools) are normalized"
artifacts:
- path: "src/skill_retriever/nodes/ingestion/frontmatter.py"
provides: "Markdown+YAML frontmatter parsing with field normalization"
exports: ["parse_component_file", "normalize_frontmatter"]
- path: "src/skill_retriever/nodes/ingestion/git_signals.py"
provides: "Git health signal extraction with graceful fallback"
exports: ["extract_git_signals"]
- path: "src/skill_retriever/nodes/ingestion/extractors.py"
provides: "ExtractionStrategy Protocol, Davila7Strategy, FlatDirectoryStrategy, GenericMarkdownStrategy"
exports: ["ExtractionStrategy", "Davila7Strategy", "FlatDirectoryStrategy", "GenericMarkdownStrategy"]
- path: "src/skill_retriever/nodes/ingestion/crawler.py"
provides: "RepositoryCrawler that orchestrates strategy selection and extraction"
exports: ["RepositoryCrawler"]
- path: "tests/test_ingestion.py"
provides: "Ingestion pipeline tests against fixture repos"
key_links:
- from: "src/skill_retriever/nodes/ingestion/crawler.py"
to: "src/skill_retriever/nodes/ingestion/extractors.py"
via: "Crawler iterates registered strategies, calls can_handle then discover+extract"
pattern: "strategy\\.can_handle.*strategy\\.discover.*strategy\\.extract"
- from: "src/skill_retriever/nodes/ingestion/extractors.py"
to: "src/skill_retriever/nodes/ingestion/frontmatter.py"
via: "Strategies use parse_component_file to read markdown files"
pattern: "parse_component_file"
- from: "src/skill_retriever/nodes/ingestion/extractors.py"
to: "src/skill_retriever/entities/components.py"
via: "Strategies produce ComponentMetadata instances"
pattern: "ComponentMetadata"
---
<objective>
Build the repository crawling and component extraction pipeline. The crawler discovers component files in any repo structure using the strategy pattern, parses markdown/YAML frontmatter, extracts git health signals, and produces ComponentMetadata entities.
Purpose: This is the data entry point. Without reliable extraction, the graph has no nodes and vector search has no documents. Universal extraction means the system works beyond davila7.
Output: `ingestion/` subpackage with crawler, extractors, frontmatter parser, git signal extractor, test fixtures, and tests.
</objective>
<execution_context>
@C:\Users\33641\.claude/get-shit-done/workflows/execute-plan.md
@C:\Users\33641\.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/02-domain-models-ingestion/02-RESEARCH.md
@.planning/phases/02-domain-models-ingestion/02-01-SUMMARY.md
@src/skill_retriever/entities/components.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Create test fixtures and frontmatter/git_signals utilities</name>
<files>
tests/fixtures/davila7_sample/cli-tool/components/agents/ai-specialists/prompt-engineer.md
tests/fixtures/davila7_sample/cli-tool/components/skills/development/clean-code/SKILL.md
tests/fixtures/davila7_sample/cli-tool/components/hooks/automation/pre-commit/pre-commit.md
tests/fixtures/flat_sample/.claude/agents/code-reviewer.md
tests/fixtures/flat_sample/.claude/commands/deploy.md
src/skill_retriever/nodes/ingestion/__init__.py
src/skill_retriever/nodes/__init__.py
src/skill_retriever/nodes/ingestion/frontmatter.py
src/skill_retriever/nodes/ingestion/git_signals.py
</files>
<action>
**Test fixtures** (minimal but realistic):
Create `tests/fixtures/davila7_sample/cli-tool/components/agents/ai-specialists/prompt-engineer.md`:
```yaml
---
name: prompt-engineer
description: Expert at crafting effective prompts for AI systems
tools:
- Read
- Write
- Edit
model: opus
---
## Expertise Areas
Prompt design, chain-of-thought, few-shot examples...
```
Create `tests/fixtures/davila7_sample/cli-tool/components/skills/development/clean-code/SKILL.md`:
```yaml
---
name: clean-code
description: Writes clean, maintainable code following best practices
allowed-tools:
- Read
- Write
- Edit
- Bash
version: "1.0"
priority: 2
---
## Instructions
Follow SOLID principles...
```
Create `tests/fixtures/davila7_sample/cli-tool/components/hooks/automation/pre-commit/pre-commit.md`:
```yaml
---
name: pre-commit
description: Runs linting and formatting before commits
---
## Hook Definition
Trigger on pre-commit events...
```
Create `tests/fixtures/flat_sample/.claude/agents/code-reviewer.md`:
```yaml
---
name: code-reviewer
description: Reviews code for bugs and style issues
tools:
- Read
- Grep
---
## Role
Thorough code review agent...
```
Create `tests/fixtures/flat_sample/.claude/commands/deploy.md`:
```yaml
---
name: deploy
description: Deploy to production environment
---
## Usage
/deploy [environment]
```
**frontmatter.py:**
Create `parse_component_file(file_path: Path) -> tuple[dict[str, Any], str]`:
- Use `frontmatter.load(str(file_path))` to parse
- Return `(dict(post.metadata), post.content)`
- Handle `FileNotFoundError` and files with no frontmatter (return empty dict + full content)
Create `normalize_frontmatter(raw: dict[str, Any]) -> dict[str, Any]`:
- Map `allowed-tools` to `tools` key
- Map `allowed_tools` to `tools` key
- Ensure `tags` is always a list (split string on commas if string)
- Ensure `tools` is always a list
- Strip whitespace from `name` and `description` if present
- Return normalized dict
**git_signals.py:**
Create `extract_git_signals(repo_path: Path, file_relative_path: str) -> dict[str, Any]`:
- Try `Repo(repo_path)`, catch `InvalidGitRepositoryError` and return defaults: `{"last_updated": None, "commit_count": 0, "commit_frequency_30d": 0.0}`
- Use `repo.iter_commits(paths=file_relative_path, max_count=500)`
- Compute `last_updated` (first commit's `committed_datetime`), `commit_count` (len), `commit_frequency_30d` (commits in last 30 days / 30)
- Handle empty commit list (file never committed) with same defaults
**__init__.py files:**
- `src/skill_retriever/nodes/__init__.py`: `"""Nodes package — self-contained AI logic units."""`
- `src/skill_retriever/nodes/ingestion/__init__.py`: Re-export `RepositoryCrawler` (will exist after Task 2). For now, just docstring: `"""Ingestion pipeline: crawl, extract, resolve."""`
</action>
<verify>
Run `uv run python -c "from skill_retriever.nodes.ingestion.frontmatter import parse_component_file; print('OK')"`.
Run `uv run python -c "from skill_retriever.nodes.ingestion.git_signals import extract_git_signals; print('OK')"`.
Verify fixture files exist: `ls tests/fixtures/davila7_sample/cli-tool/components/agents/ai-specialists/prompt-engineer.md`.
</verify>
<done>
Frontmatter parser handles YAML+markdown files with field normalization. Git signal extractor gracefully handles non-git directories. Five test fixture files exist across davila7 and flat layouts.
</done>
</task>
<task type="auto">
<name>Task 2: Create extraction strategies and repository crawler</name>
<files>
src/skill_retriever/nodes/ingestion/extractors.py
src/skill_retriever/nodes/ingestion/crawler.py
src/skill_retriever/nodes/ingestion/__init__.py
</files>
<action>
**extractors.py:**
Define `ExtractionStrategy` as a `typing.Protocol` (runtime_checkable) with three methods:
- `can_handle(self, repo_root: Path) -> bool`
- `discover(self, repo_root: Path) -> list[Path]` — returns paths to all component definition files
- `extract(self, file_path: Path, repo_root: Path) -> ComponentMetadata | None` — returns None if file cannot be parsed
Define a module-level dict `COMPONENT_TYPE_DIRS` mapping directory names to ComponentType: `{"agents": ComponentType.AGENT, "skills": ComponentType.SKILL, "commands": ComponentType.COMMAND, "hooks": ComponentType.HOOK, "settings": ComponentType.SETTING, "mcps": ComponentType.MCP, "sandbox": ComponentType.SANDBOX}`.
**Davila7Strategy:**
- `can_handle`: Check `(repo_root / "cli-tool" / "components").is_dir()`
- `discover`: Iterate `cli-tool/components/{type_dir}/` for each key in COMPONENT_TYPE_DIRS. For each type dir, `rglob("*.md")` to find all markdown files. Return list of all found paths.
- `extract`: Determine component_type from the path (which type_dir it's under). Use `parse_component_file` + `normalize_frontmatter`. Determine category from the parent directory structure (the directory between type_dir and the file). Build a ComponentMetadata using `generate_id` with repo_owner and repo_name extracted from `source_repo` parameter. Pass `source_repo` as a class attribute or constructor parameter (prefer constructor: `Davila7Strategy(repo_owner: str, repo_name: str)`).
**FlatDirectoryStrategy:**
- `can_handle`: Check `(repo_root / ".claude").is_dir()` and at least one recognized subdirectory exists (agents, commands, skills, etc.)
- `discover`: Look for `.claude/{type_dir}/` directories matching COMPONENT_TYPE_DIRS keys. Glob `*.md` in each.
- `extract`: Similar to Davila7 but path structure is `.claude/{type}/{file}.md`. Infer component_type from directory name. Constructor: `FlatDirectoryStrategy(repo_owner: str, repo_name: str)`.
**GenericMarkdownStrategy:**
- `can_handle`: Always returns True (fallback). Should be registered last.
- `discover`: `repo_root.rglob("*.md")` excluding hidden dirs (`.git`, `.github`, `node_modules`, `__pycache__`). Filter to files that actually have recognized frontmatter fields (`name` key in metadata).
- `extract`: Parse frontmatter. If `name` field exists, create ComponentMetadata. Infer component_type from frontmatter hints or default to AGENT. Constructor: `GenericMarkdownStrategy(repo_owner: str, repo_name: str)`.
**crawler.py:**
Create `RepositoryCrawler`:
- Constructor: `__init__(self, repo_owner: str, repo_name: str, repo_path: Path)`
- Attribute `strategies: list[ExtractionStrategy]` initialized in order: `[Davila7Strategy(...), FlatDirectoryStrategy(...), GenericMarkdownStrategy(...)]`
- Method `crawl(self) -> list[ComponentMetadata]`:
1. Iterate strategies, find first where `can_handle(self.repo_path)` returns True
2. Call `strategy.discover(self.repo_path)` to get file paths
3. For each file path, call `strategy.extract(file_path, self.repo_path)`
4. For each successful extraction, call `extract_git_signals(self.repo_path, file_path.relative_to(self.repo_path).as_posix())` and merge signals into ComponentMetadata via `model_copy(update=signals)`
5. Collect and return all non-None ComponentMetadata objects
6. Log (using `logging`) the strategy selected and count of components found
**Update __init__.py:**
Update `src/skill_retriever/nodes/ingestion/__init__.py` to re-export `RepositoryCrawler`.
</action>
<verify>
Run `uv run pyright src/skill_retriever/nodes/ingestion/` — zero errors.
Run `uv run ruff check src/skill_retriever/nodes/ingestion/` — zero errors.
Run `uv run python -c "from skill_retriever.nodes.ingestion import RepositoryCrawler; print('OK')"`.
</verify>
<done>
Three extraction strategies handle davila7, flat, and generic repo layouts. RepositoryCrawler auto-selects strategy and produces ComponentMetadata list with git health signals merged.
</done>
</task>
<task type="auto">
<name>Task 3: Write ingestion pipeline tests</name>
<files>tests/test_ingestion.py</files>
<action>
Create `tests/test_ingestion.py` with a `conftest.py`-style approach using fixtures at the top of the file (or in `tests/conftest.py` if it doesn't exist yet — create it).
**Fixtures** (in `tests/conftest.py`):
- `davila7_repo` — returns `Path` to `tests/fixtures/davila7_sample`
- `flat_repo` — returns `Path` to `tests/fixtures/flat_sample`
**Test cases:**
1. `test_parse_component_file_with_frontmatter` — Parse the prompt-engineer.md fixture. Verify metadata dict has `name`, `description`, `tools`. Verify content contains "Expertise Areas".
2. `test_parse_component_file_no_frontmatter` — Create a temp file with no frontmatter (just markdown). Verify returns empty dict and full content.
3. `test_normalize_frontmatter_allowed_tools` — Pass `{"allowed-tools": ["Read", "Write"]}`, verify output has `tools` key with `["Read", "Write"]`.
4. `test_normalize_frontmatter_string_tags` — Pass `{"tags": "ai, testing, code"}`, verify output has `tags` as `["ai", "testing", "code"]` (stripped).
5. `test_extract_git_signals_no_git` — Call on fixture dir (no .git). Verify defaults returned.
6. `test_davila7_strategy_can_handle` — Davila7Strategy recognizes the davila7_sample fixture.
7. `test_davila7_strategy_cannot_handle_flat` — Davila7Strategy rejects the flat_sample fixture.
8. `test_davila7_strategy_discover` — Discover files in davila7_sample, verify at least 3 paths returned (prompt-engineer.md, SKILL.md, pre-commit.md).
9. `test_davila7_strategy_extract_agent` — Extract prompt-engineer.md. Verify: name="prompt-engineer", component_type=AGENT, description non-empty, tools contains "Read", category="ai-specialists".
10. `test_flat_strategy_can_handle` — FlatDirectoryStrategy recognizes flat_sample.
11. `test_flat_strategy_discover` — Discover files in flat_sample, verify 2 paths.
12. `test_flat_strategy_extract` — Extract code-reviewer.md. Verify: name="code-reviewer", component_type=AGENT, tools contains "Read".
13. `test_crawler_davila7` — Create RepositoryCrawler for davila7_sample, call crawl(). Verify returns list of ComponentMetadata, length >= 3, all have non-empty name and id.
14. `test_crawler_flat` — Same for flat_sample, length >= 2.
15. `test_crawler_component_ids_are_deterministic` — Crawl twice, verify same IDs produced.
</action>
<verify>Run `uv run pytest tests/test_ingestion.py -v` — all 15 tests pass.</verify>
<done>15 tests verify frontmatter parsing, field normalization, strategy selection, extraction from both repo layouts, and crawler determinism. Both success criteria #1 (davila7 extraction) and #2 (flat directory extraction) are covered.</done>
</task>
</tasks>
<verification>
```bash
uv run pytest tests/test_ingestion.py tests/test_entities.py -v
uv run pyright src/skill_retriever/nodes/ingestion/
uv run ruff check src/skill_retriever/nodes/
```
All ingestion tests pass. Pyright strict and ruff clean on the ingestion subpackage.
</verification>
<success_criteria>
- Davila7 fixture extraction produces ComponentMetadata for agent, skill, and hook files
- Flat fixture extraction produces ComponentMetadata for agent and command files
- Frontmatter field normalization handles tools/allowed-tools divergence
- Git signal extraction gracefully defaults when no .git directory
- Crawler auto-selects correct strategy per repo structure
- Component IDs are deterministic across runs
- All 15 ingestion tests pass
- Pyright strict + ruff clean
</success_criteria>
<output>
After completion, create `.planning/phases/02-domain-models-ingestion/02-02-SUMMARY.md`
</output>