# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
MCP server for hierarchical RAG over 299 Lenny Rachitsky podcast transcripts. Provides semantic search with progressive disclosure: search → chapter → full transcript.
## Commands
```bash
# Install and run
pip install -e .
python -m src.server # Run MCP server directly
lenny-server # Run via entry point
# Preprocessing (requires Claude CLI)
python scripts/preprocess_haiku.py # Process unprocessed transcripts
python scripts/preprocess_haiku.py --file "Guest.txt" # Single file
python scripts/preprocess_haiku.py --limit 50 # Batch of 50
python scripts/preprocess_haiku.py --offset 50 --limit 50 # For parallel batches
# Embedding
python scripts/embed.py # Build embeddings (incremental)
python scripts/embed.py --rebuild # Rebuild all embeddings
# Dev tools
black --line-length 100 src/
ruff check src/
```
## Architecture
### Data Pipeline
```
transcripts/*.txt → preprocess_haiku.py → preprocessed/*.json → embed.py → chroma_db/
```
1. **Transcripts** (`transcripts/`): 299 raw .txt podcast transcripts
2. **Preprocessing**: Uses Claude CLI with Haiku model + `prompts/extraction.md` to extract structured JSON with topics, insights, examples
3. **Embedding**: bge-small-en-v1.5 embeddings stored in ChromaDB with 4 document types: episode, topic, insight, example
### MCP Tools
- `search_lenny(query, top_k, type_filter)` - Semantic search, returns pointers to drill down
- `get_chapter(episode, topic_id)` - Load topic with insights, examples, transcript segment
- `get_full_transcript(episode)` - Full episode transcript with metadata
- `list_episodes(expertise_filter)` - Browse available episodes
### Key Files
- `src/server.py` - MCP tool definitions and handlers
- `src/retrieval.py` - `LennyRetriever` class wrapping ChromaDB queries
- `src/utils.py` - File loading helpers (`load_transcript`, `load_preprocessed`, `get_topic_by_id`)
### Extracted Data Schema
```json
{
"episode": { "guest", "expertise_tags", "summary", "key_frameworks" },
"topics": [{ "id", "title", "summary", "line_start", "line_end" }],
"insights": [{ "id", "text", "context", "topic_id", "line_start", "line_end" }],
"examples": [{ "id", "explicit_text", "inferred_identity", "confidence", "tags", "lesson", "topic_id" }]
}
```
## Notes
- Topic IDs are normalized to `topic_N` format across all preprocessed files
- Preprocessing validation thresholds: <10 topics, <15 insights, <10 examples triggers warnings