YouTube Knowledge Base MCP

PIPELINE.md•32.2 KiB

# YouTube Knowledge Base MCP - Full RAG Pipeline ## INGESTION PIPELINE (YouTube URL → Stored Chunks) ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ 1. VIDEO ACQUISITION │ │ ──────────────────── │ │ Tool: yt-dlp │ │ Input: YouTube URL │ │ Output: Video metadata + Subtitle file (.vtt) │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ YouTube URL │ ───► │ yt-dlp │ ───► │ Subtitles │ │ │ │ │ │ (download) │ │ + Metadata │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ 2. TRANSCRIPT PARSING │ │ ───────────────────── │ │ Tool: webvtt-py │ │ Technique: VTT timestamp extraction │ │ Output: List of {text, start_time, end_time} segments │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ .vtt file │ ───► │ VTT Parser │ ───► │ Timestamped │ │ │ │ │ │ │ │ Segments │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ 3. SEMANTIC CHUNKING │ │ ──────────────────── │ │ Technique: Embedding-based topic shift detection │ │ Model: all-MiniLM-L6-v2 (local, FREE) - NOT voyage-3-large │ │ Config: ~500 chars/chunk, 150 char overlap │ │ │ │ Cost Optimization: Use cheap local model for topic detection only. │ │ Voyage/OpenAI only used for final chunk embeddings (Step 5). │ │ │ │ How it works: │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ 1. Split transcript into sentences │ │ │ │ 2. Add context window (1 sentence before/after each) │ │ │ │ 3. Embed each sentence with all-MiniLM-L6-v2 (local, ~80MB) │ │ │ │ 4. Calculate cosine distance between consecutive embeddings │ │ │ │ 5. Find breakpoints at 80th percentile (topic shifts) │ │ │ │ 6. Group sentences between breakpoints into chunks │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ Fallback: If sentence-transformers not installed, uses sentence-boundary │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Timestamped │ ───► │ Semantic │ ───► │ Coherent │ │ │ │ Segments │ │ Chunker │ │ Chunks │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ Preserves: Timestamp boundaries for each chunk │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ 4. CONTEXTUAL RETRIEVAL (Optional) │ │ ────────────────────────────────── │ │ Technique: Anthropic's Contextual Retrieval │ │ Model: gpt-4o-mini (configurable) │ │ Purpose: Add document-level context to each chunk │ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ For each chunk, LLM generates context like: │ │ │ │ "This chunk discusses the relationship between exponentials and │ │ │ │ the Laplace transform, following the introduction of complex │ │ │ │ frequency in the previous section." │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Semantic │ ───► │ LLM │ ───► │ Chunks + │ │ │ │ Chunks │ │ (gpt-4o-mini)│ │ Context │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ Cost: 1 API call per chunk (~1.8s each) │ │ Toggle: DISABLE_CONTEXTUAL_RETRIEVAL=true │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ 5. EMBEDDING GENERATION (Provider Locked) │ │ ─────────────────────────────────────────── │ │ Default: voyage-3-large (Voyage AI) - 1024 dimensions │ │ Alternative: openai, bge, ollama (explicit config, NO FALLBACK) │ │ │ │ IMPORTANT: Provider is locked on first ingestion. All chunks must use │ │ the same provider. Switching requires: kb db migrate-embeddings --to X │ │ │ │ Input: If context exists: "[context]\n\n[content]" │ │ Otherwise: "[content]" only │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Chunks + │ ───► │voyage-3-large│ ───► │ Vectors │ │ │ │ Context │ │ (1024-dim) │ │ (1024-dim) │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ 6. STORAGE │ │ ────────── │ │ Database: LanceDB (embedded vector database) │ │ Vector Index: IVF-PQ for approximate nearest neighbor │ │ FTS Index: Tantivy (BM25) for full-text search │ │ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ Vectors │ ───► │ LanceDB │ │ │ │ + Metadata │ │ (persist) │ │ │ └──────────────┘ └──────────────┘ │ │ │ │ Stored per chunk: │ │ - content (original text) │ │ - context (LLM-generated, if enabled) │ │ - vector (1024 floats) │ │ - timestamp_start, timestamp_end │ │ - source_id, tags, source_channel, source_type │ └─────────────────────────────────────────────────────────────────────────────┘ ``` --- ## RETRIEVAL PIPELINE (Question → Answer) ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ 1. QUERY TRANSFORMATION (HyDE) │ │ ────────────────────────────── │ │ Technique: Hypothetical Document Embeddings │ │ Model: gpt-4o-mini │ │ Purpose: Bridge vocabulary gap between question and document │ │ │ │ Problem: User asks "What is Laplace?" but document says │ │ "The transform converts time-domain to s-domain..." │ │ │ │ Solution: Generate hypothetical answer, then embed that instead │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Query │ ───► │ LLM │ ───► │ Hypothetical │ │ │ │ "What is │ │ (HyDE) │ │ Answer │ │ │ │ Laplace?" │ │ │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ Output: "The Laplace transform is a mathematical operation that │ │ converts a function of time into a function of complex │ │ frequency, enabling analysis of differential equations..." │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ 2. QUERY EMBEDDING │ │ ────────────────── │ │ Model: voyage-3-large (same as ingestion for consistency) │ │ Input: HyDE-transformed query (or original if HyDE disabled) │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Hypothetical │ ───► │voyage-3-large│ ───► │ Query │ │ │ │ Answer │ │ (1024-dim) │ │ Vector │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ 3. HYBRID RETRIEVAL (Stage 1: High Recall) - SPLIT PIPELINE │ │ ─────────────────────────────────────────────────────────── │ │ Technique: Vector Search + Full-Text Search + RRF Fusion │ │ Database: LanceDB │ │ Limit: 5x final limit (candidate pool for reranking) │ │ │ │ CRITICAL: FTS uses ORIGINAL query (not HyDE). If user searches "Error 404",│ │ BM25 must match literal "Error 404", not HyDE's "page not found issues". │ │ │ │ ┌──────────────┐ │ │ │ Query │───────────────────────────────┐ │ │ └──────┬───────┘ │ │ │ │ │ (Original Query) │ │ ▼ ▼ │ │ ┌──────────────┐ ┌─────────────┐ │ │ │ HyDE │ │ FTS Search │ │ │ │ Generation │ │ (BM25) │ │ │ └──────┬───────┘ └──────┬──────┘ │ │ │ (Hypothetical Doc) │ │ │ ▼ │ │ │ ┌──────────────┐ │ │ │ │ Vector Embed │ │ │ │ │ (Voyage AI) │ │ │ │ └──────┬───────┘ │ │ │ │ │ │ │ ▼ │ │ │ ┌──────────────┐ │ │ │ │ Vector │ │ │ │ │ Search │ │ │ │ └──────┬───────┘ │ │ │ │ │ │ │ └───────────────┬───────────────────────┘ │ │ ▼ │ │ ┌───────────────┐ │ │ │ RRF Fusion │ Reciprocal Rank Fusion (K=60) │ │ │ │ Combines rankings from both │ │ └───────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────┐ │ │ │ Candidates │ ~50 chunks (high recall) │ │ └───────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ 4. CROSS-ENCODER RERANKING (Stage 2: High Precision) │ │ ──────────────────────────────────────────────────── │ │ Model: ms-marco-MiniLM-L-12-v2 (FlashRank, ~34MB local) │ │ Technique: Cross-encoder scoring (query-document pairs) │ │ │ │ Why cross-encoder beats bi-encoder: │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Bi-encoder (embedding): query → vec_q doc → vec_d cos(q,d) │ │ │ │ Fast (separate encoding), but loses query-doc interaction │ │ │ │ │ │ │ │ Cross-encoder: [query] [SEP] [doc] → transformer → score │ │ │ │ Slow (joint encoding), but sees full query-doc interaction │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ Input: Original query + "[context]\n\n[content]" for each candidate │ │ Context field improves reranking when available! │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Candidates │ ───► │Cross-Encoder │ ───► │ Reranked │ │ │ │ (~50) │ │ (pairwise) │ │ Results │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ 5. DEDUPLICATION (Stage 3: Diversity) │ │ ───────────────────────────────────── │ │ Technique: Jaccard similarity on word sets │ │ Threshold: 0.9 (90% word overlap = duplicate) │ │ │ │ Why needed: Reranker optimizes for RELEVANCE, not DIVERSITY │ │ Problem: 5 near-identical chunks about Laplace all score 0.99 │ │ Solution: Keep first, skip duplicates │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Reranked │ ───► │ Jaccard │ ───► │ Diverse │ │ │ │ Results │ │ Dedup │ │ Results │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ 6. ENRICHMENT │ │ ───────────── │ │ Purpose: Add display metadata from Source table │ │ Added: source_title, source_url, timestamp_link │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Diverse │ ───► │ Source │ ───► │ Final │ │ │ │ Results │ │ Lookup │ │ Results │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ Output per result: │ │ - chunk.content (text to display) │ │ - chunk.context (if available) │ │ - final_score (cross-encoder score, 0-1) │ │ - source_title ("Why Laplace transforms...") │ │ - timestamp_link ("https://youtube.com/watch?v=xxx&t=246s") │ └─────────────────────────────────────────────────────────────────────────────┘ ``` --- ## Technique Summary | Stage | Technique | |-------|-----------| | Download | yt-dlp (subtitle extraction) | | Parsing | webvtt-py (timestamp preservation) | | Chunking | **SEMANTIC**: all-MiniLM-L6-v2 (local, FREE) for topic detection | | | *Fallback*: Sentence-boundary aware (~500 chars) | | Context Gen | Contextual Retrieval (Anthropic technique, gpt-4o-mini) | | Final Embedding | **voyage-3-large** (1024-dim) - provider locked on first ingestion | | Storage | LanceDB (embedded vector DB + Tantivy FTS) | | Query Transform | HyDE - Hypothetical Document Embeddings (gpt-4o-mini) | | Retrieval | **SPLIT PIPELINE**: HyDE→Vector + Original→FTS + RRF K=60 | | Reranking | Cross-Encoder (ms-marco-MiniLM-L-12-v2, local) | | Deduplication | Jaccard similarity (0.9 threshold) | --- ## Configuration ### Embedding Provider (LOCKED - No Fallback) **Critical**: The database locks to a single embedding provider on first ingestion. Mixing embeddings from different providers corrupts search quality silently. **Available Providers:** | Provider | Model | API Key Required | Notes | |----------|-------|------------------|-------| | `voyage` | voyage-3-large | `VOYAGE_API_KEY` | Recommended, best quality | | `openai` | text-embedding-3-large | `OPENAI_API_KEY` | Good alternative | | `bge` | BAAI/bge-m3 | None (local) | Requires sentence-transformers | | `ollama` | mxbai-embed-large | None (local) | Requires Ollama server | **Provider Lock Behavior:** | Scenario | Behavior | |----------|----------| | Fresh database | Locks to configured provider on first ingestion | | Legacy database (has data, no metadata) | Trust On First Use - locks to current config, warns | | Provider mismatch | **Fails immediately** with clear error | **Switching Providers:** ```bash # Check current configuration kb config # Migrate all chunks to a new provider kb db migrate-embeddings --to openai # Or skip confirmation kb db migrate-embeddings --to voyage --yes ``` ### Environment Variables | Variable | Description | |----------|-------------| | `EMBEDDING_PROVIDER` | `voyage`, `openai`, `bge`, or `ollama` | | `VOYAGE_API_KEY` | Voyage AI API key | | `OPENAI_API_KEY` | OpenAI API key (also used for HyDE/Context) | | `DISABLE_CONTEXTUAL_RETRIEVAL=true` | Skip context generation (faster) | | `YOUTUBE_KB_DATA_DIR` | Custom data directory | --- ## Performance Characteristics ### Ingestion Time (per video) | Mode | Time | API Calls | |------|------|-----------| | Without contextual retrieval | ~10s | 1 (embedding batch) | | With contextual retrieval | ~2min | N+1 (1 per chunk + summary) | ### Search Latency | Component | Time | |-----------|------| | HyDE transformation | ~500ms | | Query embedding | ~100ms | | Hybrid retrieval | ~50ms | | Cross-encoder reranking | ~200ms | | **Total** | **~2-3s** |

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Miandari/youtube-knowledge-base-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

PIPELINE.md•32.2 KiB