# YouTube Knowledge Base MCP - Full RAG Pipeline
## INGESTION PIPELINE (YouTube URL β Stored Chunks)
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. VIDEO ACQUISITION β
β ββββββββββββββββββββ β
β Tool: yt-dlp β
β Input: YouTube URL β
β Output: Video metadata + Subtitle file (.vtt) β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β YouTube URL β ββββΊ β yt-dlp β ββββΊ β Subtitles β β
β β β β (download) β β + Metadata β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. TRANSCRIPT PARSING β
β βββββββββββββββββββββ β
β Tool: webvtt-py β
β Technique: VTT timestamp extraction β
β Output: List of {text, start_time, end_time} segments β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β .vtt file β ββββΊ β VTT Parser β ββββΊ β Timestamped β β
β β β β β β Segments β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. SEMANTIC CHUNKING β
β ββββββββββββββββββββ β
β Technique: Embedding-based topic shift detection β
β Model: all-MiniLM-L6-v2 (local, FREE) - NOT voyage-3-large β
β Config: ~500 chars/chunk, 150 char overlap β
β β
β Cost Optimization: Use cheap local model for topic detection only. β
β Voyage/OpenAI only used for final chunk embeddings (Step 5). β
β β
β How it works: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Split transcript into sentences β β
β β 2. Add context window (1 sentence before/after each) β β
β β 3. Embed each sentence with all-MiniLM-L6-v2 (local, ~80MB) β β
β β 4. Calculate cosine distance between consecutive embeddings β β
β β 5. Find breakpoints at 80th percentile (topic shifts) β β
β β 6. Group sentences between breakpoints into chunks β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Fallback: If sentence-transformers not installed, uses sentence-boundary β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Timestamped β ββββΊ β Semantic β ββββΊ β Coherent β β
β β Segments β β Chunker β β Chunks β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
β Preserves: Timestamp boundaries for each chunk β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. CONTEXTUAL RETRIEVAL (Optional) β
β ββββββββββββββββββββββββββββββββββ β
β Technique: Anthropic's Contextual Retrieval β
β Model: gpt-4o-mini (configurable) β
β Purpose: Add document-level context to each chunk β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β For each chunk, LLM generates context like: β β
β β "This chunk discusses the relationship between exponentials and β β
β β the Laplace transform, following the introduction of complex β β
β β frequency in the previous section." β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Semantic β ββββΊ β LLM β ββββΊ β Chunks + β β
β β Chunks β β (gpt-4o-mini)β β Context β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
β Cost: 1 API call per chunk (~1.8s each) β
β Toggle: DISABLE_CONTEXTUAL_RETRIEVAL=true β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 5. EMBEDDING GENERATION (Provider Locked) β
β βββββββββββββββββββββββββββββββββββββββββββ β
β Default: voyage-3-large (Voyage AI) - 1024 dimensions β
β Alternative: openai, bge, ollama (explicit config, NO FALLBACK) β
β β
β IMPORTANT: Provider is locked on first ingestion. All chunks must use β
β the same provider. Switching requires: kb db migrate-embeddings --to X β
β β
β Input: If context exists: "[context]\n\n[content]" β
β Otherwise: "[content]" only β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Chunks + β ββββΊ βvoyage-3-largeβ ββββΊ β Vectors β β
β β Context β β (1024-dim) β β (1024-dim) β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 6. STORAGE β
β ββββββββββ β
β Database: LanceDB (embedded vector database) β
β Vector Index: IVF-PQ for approximate nearest neighbor β
β FTS Index: Tantivy (BM25) for full-text search β
β β
β ββββββββββββββββ ββββββββββββββββ β
β β Vectors β ββββΊ β LanceDB β β
β β + Metadata β β (persist) β β
β ββββββββββββββββ ββββββββββββββββ β
β β
β Stored per chunk: β
β - content (original text) β
β - context (LLM-generated, if enabled) β
β - vector (1024 floats) β
β - timestamp_start, timestamp_end β
β - source_id, tags, source_channel, source_type β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
---
## RETRIEVAL PIPELINE (Question β Answer)
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. QUERY TRANSFORMATION (HyDE) β
β ββββββββββββββββββββββββββββββ β
β Technique: Hypothetical Document Embeddings β
β Model: gpt-4o-mini β
β Purpose: Bridge vocabulary gap between question and document β
β β
β Problem: User asks "What is Laplace?" but document says β
β "The transform converts time-domain to s-domain..." β
β β
β Solution: Generate hypothetical answer, then embed that instead β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Query β ββββΊ β LLM β ββββΊ β Hypothetical β β
β β "What is β β (HyDE) β β Answer β β
β β Laplace?" β β β β β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
β Output: "The Laplace transform is a mathematical operation that β
β converts a function of time into a function of complex β
β frequency, enabling analysis of differential equations..." β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. QUERY EMBEDDING β
β ββββββββββββββββββ β
β Model: voyage-3-large (same as ingestion for consistency) β
β Input: HyDE-transformed query (or original if HyDE disabled) β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Hypothetical β ββββΊ βvoyage-3-largeβ ββββΊ β Query β β
β β Answer β β (1024-dim) β β Vector β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. HYBRID RETRIEVAL (Stage 1: High Recall) - SPLIT PIPELINE β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Technique: Vector Search + Full-Text Search + RRF Fusion β
β Database: LanceDB β
β Limit: 5x final limit (candidate pool for reranking) β
β β
β CRITICAL: FTS uses ORIGINAL query (not HyDE). If user searches "Error 404",β
β BM25 must match literal "Error 404", not HyDE's "page not found issues". β
β β
β ββββββββββββββββ β
β β Query βββββββββββββββββββββββββββββββββ β
β ββββββββ¬ββββββββ β β
β β β (Original Query) β
β βΌ βΌ β
β ββββββββββββββββ βββββββββββββββ β
β β HyDE β β FTS Search β β
β β Generation β β (BM25) β β
β ββββββββ¬ββββββββ ββββββββ¬βββββββ β
β β (Hypothetical Doc) β β
β βΌ β β
β ββββββββββββββββ β β
β β Vector Embed β β β
β β (Voyage AI) β β β
β ββββββββ¬ββββββββ β β
β β β β
β βΌ β β
β ββββββββββββββββ β β
β β Vector β β β
β β Search β β β
β ββββββββ¬ββββββββ β β
β β β β
β βββββββββββββββββ¬ββββββββββββββββββββββββ β
β βΌ β
β βββββββββββββββββ β
β β RRF Fusion β Reciprocal Rank Fusion (K=60) β
β β β Combines rankings from both β
β βββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββ β
β β Candidates β ~50 chunks (high recall) β
β βββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. CROSS-ENCODER RERANKING (Stage 2: High Precision) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Model: ms-marco-MiniLM-L-12-v2 (FlashRank, ~34MB local) β
β Technique: Cross-encoder scoring (query-document pairs) β
β β
β Why cross-encoder beats bi-encoder: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Bi-encoder (embedding): query β vec_q doc β vec_d cos(q,d) β β
β β Fast (separate encoding), but loses query-doc interaction β β
β β β β
β β Cross-encoder: [query] [SEP] [doc] β transformer β score β β
β β Slow (joint encoding), but sees full query-doc interaction β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Input: Original query + "[context]\n\n[content]" for each candidate β
β Context field improves reranking when available! β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Candidates β ββββΊ βCross-Encoder β ββββΊ β Reranked β β
β β (~50) β β (pairwise) β β Results β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 5. DEDUPLICATION (Stage 3: Diversity) β
β βββββββββββββββββββββββββββββββββββββ β
β Technique: Jaccard similarity on word sets β
β Threshold: 0.9 (90% word overlap = duplicate) β
β β
β Why needed: Reranker optimizes for RELEVANCE, not DIVERSITY β
β Problem: 5 near-identical chunks about Laplace all score 0.99 β
β Solution: Keep first, skip duplicates β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Reranked β ββββΊ β Jaccard β ββββΊ β Diverse β β
β β Results β β Dedup β β Results β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 6. ENRICHMENT β
β βββββββββββββ β
β Purpose: Add display metadata from Source table β
β Added: source_title, source_url, timestamp_link β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Diverse β ββββΊ β Source β ββββΊ β Final β β
β β Results β β Lookup β β Results β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
β Output per result: β
β - chunk.content (text to display) β
β - chunk.context (if available) β
β - final_score (cross-encoder score, 0-1) β
β - source_title ("Why Laplace transforms...") β
β - timestamp_link ("https://youtube.com/watch?v=xxx&t=246s") β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
---
## Technique Summary
| Stage | Technique |
|-------|-----------|
| Download | yt-dlp (subtitle extraction) |
| Parsing | webvtt-py (timestamp preservation) |
| Chunking | **SEMANTIC**: all-MiniLM-L6-v2 (local, FREE) for topic detection |
| | *Fallback*: Sentence-boundary aware (~500 chars) |
| Context Gen | Contextual Retrieval (Anthropic technique, gpt-4o-mini) |
| Final Embedding | **voyage-3-large** (1024-dim) - provider locked on first ingestion |
| Storage | LanceDB (embedded vector DB + Tantivy FTS) |
| Query Transform | HyDE - Hypothetical Document Embeddings (gpt-4o-mini) |
| Retrieval | **SPLIT PIPELINE**: HyDEβVector + OriginalβFTS + RRF K=60 |
| Reranking | Cross-Encoder (ms-marco-MiniLM-L-12-v2, local) |
| Deduplication | Jaccard similarity (0.9 threshold) |
---
## Configuration
### Embedding Provider (LOCKED - No Fallback)
**Critical**: The database locks to a single embedding provider on first ingestion.
Mixing embeddings from different providers corrupts search quality silently.
**Available Providers:**
| Provider | Model | API Key Required | Notes |
|----------|-------|------------------|-------|
| `voyage` | voyage-3-large | `VOYAGE_API_KEY` | Recommended, best quality |
| `openai` | text-embedding-3-large | `OPENAI_API_KEY` | Good alternative |
| `bge` | BAAI/bge-m3 | None (local) | Requires sentence-transformers |
| `ollama` | mxbai-embed-large | None (local) | Requires Ollama server |
**Provider Lock Behavior:**
| Scenario | Behavior |
|----------|----------|
| Fresh database | Locks to configured provider on first ingestion |
| Legacy database (has data, no metadata) | Trust On First Use - locks to current config, warns |
| Provider mismatch | **Fails immediately** with clear error |
**Switching Providers:**
```bash
# Check current configuration
kb config
# Migrate all chunks to a new provider
kb db migrate-embeddings --to openai
# Or skip confirmation
kb db migrate-embeddings --to voyage --yes
```
### Environment Variables
| Variable | Description |
|----------|-------------|
| `EMBEDDING_PROVIDER` | `voyage`, `openai`, `bge`, or `ollama` |
| `VOYAGE_API_KEY` | Voyage AI API key |
| `OPENAI_API_KEY` | OpenAI API key (also used for HyDE/Context) |
| `DISABLE_CONTEXTUAL_RETRIEVAL=true` | Skip context generation (faster) |
| `YOUTUBE_KB_DATA_DIR` | Custom data directory |
---
## Performance Characteristics
### Ingestion Time (per video)
| Mode | Time | API Calls |
|------|------|-----------|
| Without contextual retrieval | ~10s | 1 (embedding batch) |
| With contextual retrieval | ~2min | N+1 (1 per chunk + summary) |
### Search Latency
| Component | Time |
|-----------|------|
| HyDE transformation | ~500ms |
| Query embedding | ~100ms |
| Hybrid retrieval | ~50ms |
| Cross-encoder reranking | ~200ms |
| **Total** | **~2-3s** |