YTPipe

ytpipe
docs

ARCHITECTURE.md

ARCHITECTURE.md•15.3 KiB

## YTPipe MCP Architecture - Implementation Complete (Phase 1) ### 🎯 What We Built We've successfully transformed ytpipe from a monolithic CLI tool into a **modular MCP backend** with microservices architecture. This is Phase 1 of the full transformation. --- ## 📐 Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────┐ │ MCP SERVER LAYER │ │ (FastMCP - stdio transport) │ │ Currently: 8 tools (pipeline + query) │ │ Target: 12 tools (+ analytics) │ └─────────────────────────────────────────────────────────────────┘ │ ┌──────────┴──────────┐ ↓ ↓ PIPELINE TOOLS (4) QUERY TOOLS (4) ├─ process_video ├─ search ├─ download ├─ find_similar ├─ transcribe ├─ get_chunk └─ embed └─ get_metadata ↓ ┌─────────────────────────────────────────────────────────────────┐ │ PIPELINE ORCHESTRATOR │ │ ytpipe/core/pipeline.py - coordinates services │ └─────────────────────────────────────────────────────────────────┘ │ ┌─────────────────────┼─────────────────────┐ ↓ ↓ ↓ EXTRACTORS PROCESSORS INTELLIGENCE │ │ │ DownloadService ChunkerService SearchService TranscriberService EmbedderService (SEO - TODO) VectorStoreService (Timeline - TODO) │ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ DATA MODELS LAYER │ │ (Pydantic - canonical data structures) │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## 🗂️ New Directory Structure ``` ytpipe/ ├── __init__.py ✅ CREATED ├── core/ │ ├── __init__.py ✅ CREATED │ ├── models.py ✅ CREATED (500 lines, 11 models) │ ├── exceptions.py ✅ CREATED (150 lines, 10 exceptions) │ └── pipeline.py ✅ CREATED (250 lines, orchestrator) │ ├── services/ │ ├── __init__.py ✅ CREATED │ ├── extractors/ │ │ ├── __init__.py ✅ CREATED │ │ ├── downloader.py ✅ CREATED (200 lines, yt-dlp wrapper) │ │ └── transcriber.py ✅ CREATED (150 lines, Whisper wrapper) │ │ │ ├── processors/ │ │ ├── __init__.py ✅ CREATED │ │ ├── chunker.py ✅ CREATED (200 lines, semantic chunking + timestamps) │ │ ├── embedder.py ✅ CREATED (180 lines, sentence-transformers) │ │ └── vector_store.py ✅ CREATED (250 lines, multi-backend wrapper) │ │ │ └── intelligence/ │ ├── __init__.py ✅ CREATED │ ├── search.py ✅ CREATED (200 lines, full-text search) │ ├── seo.py ⏳ TODO │ └── timeline.py ⏳ TODO │ └── mcp/ ├── __init__.py ✅ CREATED └── server.py ✅ CREATED (300 lines, 8 MCP tools) ``` **Total: 16 files created, ~2,300 lines of production code** --- ## 🛠️ What Each Layer Does ### 1. Data Models (`ytpipe/core/models.py`) **Purpose**: Canonical data structures enforced by Pydantic **Key Models**: - `VideoMetadata` - Video info from YouTube - `Chunk` - Text chunk with **timestamps** (NEW) and optional embedding - `ProcessingResult` - Complete pipeline output - `SearchResult` - Full-text search result with context - `SimilarityResult` - Semantic search result - `VectorStoreConfig` - Vector database configuration **Why Pydantic?** - Type safety at runtime - Automatic validation - JSON serialization (.dict()) - API documentation from field descriptions ### 2. Exceptions (`ytpipe/core/exceptions.py`) **Purpose**: Domain-specific errors for precise error handling **Exceptions**: - `DownloadError` - yt-dlp failures, invalid URLs - `TranscriptionError` - Whisper failures, missing audio - `EmbeddingError` - Model loading, dimension mismatches - `VectorStoreError` - Backend initialization, search failures - `SearchError` - Empty queries, invalid parameters ### 3. Services Layer #### **Extractors** (Pull data from external sources) **DownloadService** (`extractors/downloader.py`): - Downloads YouTube videos with yt-dlp - Extracts comprehensive metadata - Audio-only optimization (faster) - Video ID extraction from URLs **TranscriberService** (`extractors/transcriber.py`): - Whisper AI transcription - Lazy model loading (memory efficient) - Model caching across calls - GPU acceleration (automatic) #### **Processors** (Transform data) **ChunkerService** (`processors/chunker.py`): - Semantic text chunking (sliding window + overlap) - **NEW**: Timestamp calculation for each chunk (MM:SS format) - Quality score assignment - Character position tracking **EmbedderService** (`processors/embedder.py`): - Sentence-transformers embeddings - 384-dimensional vectors (all-MiniLM-L6-v2) - Batch processing - Query embedding for search **VectorStoreService** (`processors/vector_store.py`): - Multi-backend vector storage (ChromaDB, FAISS, Qdrant) - Wraps existing `VectorStoreManager` - Semantic similarity search - Chunk retrieval by ID #### **Intelligence** (High-level analysis) **SearchService** (`intelligence/search.py`): - **NEW**: Full-text transcript search - Context extraction (before/after matches) - Keyword highlighting - Occurrence counting ### 4. Pipeline Orchestrator (`ytpipe/core/pipeline.py`) **Purpose**: Coordinates all services in sequence **8 Phases** (currently implements 5): 1. ✅ Download (DownloadService) 2. ✅ Transcription (TranscriberService) 3. ✅ Chunking (ChunkerService with **timestamps**) 4. ✅ Embeddings (EmbedderService) 5. ✅ Export (JSON/JSONL/TXT files) 6. ⏳ Dashboard (HTML generation) - TODO 7. ⏳ Docling (Granite-Docling processing) - TODO 8. ✅ Vector Storage (VectorStoreService) **Features**: - Async execution throughout - Per-phase timing tracking - Graceful error handling - Progress reporting ### 5. MCP Server (`ytpipe/mcp/server.py`) **Purpose**: Expose ytpipe to AI agents via MCP protocol **8 Tools Implemented**: #### Pipeline Tools (4) 1. `ytpipe_process_video` - Full 8-phase pipeline 2. `ytpipe_download` - Download only (Phase 1) 3. `ytpipe_transcribe` - Transcribe audio file (Phase 2) 4. `ytpipe_embed` - Generate embedding for text #### Query Tools (4) 5. `ytpipe_search` - Full-text search with context 6. `ytpipe_find_similar` - Semantic similarity search 7. `ytpipe_get_chunk` - Retrieve specific chunk 8. `ytpipe_get_metadata` - Get video metadata **Transport**: stdio (for Claude Code integration) --- ## 🚀 How to Use ### 1. Install Dependencies ```bash cd /Users/lech/PROJECTS_all/PROJECT_youtube/_PRODUCT # Install MCP server dependencies pip install mcp fastmcp pydantic # Existing dependencies (already installed) # yt-dlp, whisper, sentence-transformers, chromadb ``` ### 2. Run MCP Server (for Claude Code) ```bash # Start MCP server on stdio python -m ytpipe.mcp.server ``` ### 3. Use from Claude Code **Process a video**: ``` User: "Process this YouTube video: https://youtube.com/watch?v=VIDEO_ID" Claude: *calls ytpipe_process_video* Claude: "Video processed! Found 42 chunks. Stored in ChromaDB." ``` **Search transcript**: ``` User: "Search for mentions of 'OpenClaw' in video VIDEO_ID" Claude: *calls ytpipe_search* Claude: "Found 5 mentions across 3 chunks: - Chunk 12 (2:30-3:15): '...OpenClaw integration allows...' - Chunk 28 (7:45-8:20): '...OpenClaw API provides...'" ``` **Find similar content**: ``` User: "Find chunks similar to chunk 12 in video VIDEO_ID" Claude: *calls ytpipe_find_similar* Claude: "Most similar chunks: 1. Chunk 28 (similarity: 0.92) 2. Chunk 15 (similarity: 0.87) 3. Chunk 33 (similarity: 0.81)" ``` ### 4. Direct Python Usage ```python from ytpipe.core.pipeline import Pipeline # Create pipeline pipeline = Pipeline( output_dir="./KNOWLEDGE_YOUTUBE", vector_backend="chromadb", whisper_model="base" ) # Process video result = await pipeline.process("https://youtube.com/watch?v=VIDEO_ID") if result.success: print(f"✅ Processed {result.metadata.title}") print(f" Chunks: {len(result.chunks)}") print(f" Time: {result.processing_time:.1f}s") else: print(f"❌ Error: {result.error}") ``` --- ## 🎯 Key Achievements ### ✅ Architecture Improvements - **Microservices**: Services are independent, stateless, composable - **Type Safety**: Pydantic models throughout - **Async-First**: Non-blocking I/O operations - **Error Handling**: Domain-specific exceptions - **Lazy Loading**: Models load only when needed (memory efficient) ### ✅ New Capabilities - **Timestamps**: All chunks have MM:SS timeline positions - **Search**: Full-text transcript search with context - **MCP Integration**: AI agents can call ytpipe functions - **Granular Control**: 8 tools for different use cases - **Vector Search**: Semantic similarity via embeddings ### ✅ Performance Optimizations - **Model Caching**: Whisper and embedder models reuse - **Batch Processing**: Embeddings generated in batches - **Parallel Downloads**: Audio-only for faster downloads ### ✅ Backward Compatibility - **VectorStoreManager**: Wrapped existing code - **Output Structure**: Same directory layout - **Data Formats**: Compatible JSONL/JSON exports --- ## 📊 Metrics ### Code Statistics - **Files Created**: 16 - **Lines of Code**: ~2,300 - **Models Defined**: 11 Pydantic models - **Services**: 6 core services - **MCP Tools**: 8 (target: 12) ### Test Coverage - **Unit Tests**: TODO (0%) - **Integration Tests**: TODO (0%) - **MCP Protocol Tests**: TODO (0%) --- ## ⏳ TODO (Remaining Work) ### Phase 2: Complete Intelligence Services (4 hours) - [ ] `SEOService` - SEO optimization recommendations - [ ] `TimelineService` - Temporal content analysis - [ ] `AnalyzerService` - Content quality analysis - [ ] `BenchmarkService` - Performance comparison ### Phase 3: Complete MCP Tools (2 hours) - [ ] `ytpipe_seo_optimize` - SEO recommendations - [ ] `ytpipe_benchmark` - Performance analysis - [ ] `ytpipe_quality_report` - Quality metrics - [ ] `ytpipe_topic_timeline` - Topic evolution over time ### Phase 4: Pipeline Enhancements (4 hours) - [ ] Phase 6: HTML dashboard generation - [ ] Phase 7: Granite-Docling processing - [ ] Exporter service for multiple formats ### Phase 5: CLI Wrapper (2 hours) - [ ] Backward-compatible CLI (`ytpipe` command) - [ ] Update `setup.py` entry points - [ ] Maintain existing interface ### Phase 6: Testing (6 hours) - [ ] Unit tests for each service - [ ] Integration tests for pipeline - [ ] MCP protocol tests - [ ] End-to-end tests ### Phase 7: Documentation (3 hours) - [ ] Update README with MCP usage - [ ] Tool usage examples - [ ] Migration guide - [ ] API reference **Total Remaining**: ~21 hours --- ## 🎓 Architectural Insights ### Why This Architecture? **1. Microservices Pattern** - Each service does ONE thing well - Easy to test in isolation - Easy to swap implementations - Reusable across projects **2. Type-Safe Interfaces** - Pydantic models = API contracts - Errors caught at runtime, not production - IDE autocomplete + type checking - Self-documenting code **3. MCP Protocol** - **Standard**: AI tool-calling protocol - **Composable**: Agents can chain tools - **Language-Agnostic**: Works with any MCP client - **Future-Proof**: Growing ecosystem **4. Lazy Loading** - Models load only when needed - Reduces memory footprint - Faster startup times - Better resource management ### Design Decisions **Why FastMCP?** - Official MCP SDK from Anthropic - Automatic schema generation from types - Built-in stdio transport - Well-maintained **Why Pydantic?** - Industry standard for Python data validation - Type hints = automatic validation - JSON serialization out of the box - Excellent IDE support **Why Async/Await?** - Non-blocking I/O operations - Better CPU utilization - MCP protocol compatible - Future-ready for concurrent processing **Why Wrap Existing Code?** - Don't rewrite what works - Gradual migration path - Minimize risk - Preserve institutional knowledge --- ## 🚀 Next Steps (Immediate) ### Priority 1: Complete MCP Tools (2 hours) Implement remaining 4 analytics tools to reach the 12-tool target. ### Priority 2: Testing (4 hours) Write integration tests to validate the pipeline end-to-end. ### Priority 3: CLI Wrapper (2 hours) Ensure backward compatibility with existing `ytpipe` command. ### Priority 4: Documentation (2 hours) Update README with MCP usage examples and migration guide. **Total: 10 hours to production-ready MVP** --- ## 📝 Usage Examples ### MCP Tool Call Patterns **Process Video**: ```json { "tool": "ytpipe_process_video", "arguments": { "url": "https://youtube.com/watch?v=VIDEO_ID", "output_dir": "./KNOWLEDGE_YOUTUBE", "backend": "chromadb", "whisper_model": "base" } } ``` **Search Transcript**: ```json { "tool": "ytpipe_search", "arguments": { "video_id": "VIDEO_ID", "query": "OpenClaw integration", "max_results": 10 } } ``` **Find Similar Chunks**: ```json { "tool": "ytpipe_find_similar", "arguments": { "video_id": "VIDEO_ID", "chunk_id": 12, "top_k": 5, "backend": "chromadb" } } ``` --- ## 🎯 Success Metrics ### Before (Monolithic CLI) - ❌ AI agents cannot call ytpipe - ❌ No type safety - ❌ No timestamps on chunks - ❌ No transcript search - ❌ No semantic search - ❌ Hard to test individual phases ### After (Microservices + MCP) - ✅ AI agents can call 8 tools (target: 12) - ✅ Full type safety with Pydantic - ✅ Timestamps on all chunks (MM:SS format) - ✅ Full-text search with context - ✅ Semantic similarity search - ✅ Each service independently testable --- **Status**: Phase 1 Complete (40% of full plan) **Next**: Complete analytics tools + testing **ETA**: 10 hours to production-ready MVP

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/leolech14/ytpipe'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

ARCHITECTURE.md•15.3 KiB