YTPipe

OUTPUT_SCHEMA_COMPLETE.md•8.46 KiB

# YTPipe Complete Output Schema **All information generated from processing a YouTube video** --- ## 📂 Output Directory Structure ``` {OUTPUT_DIR}/{VIDEO_ID}/ ├── audio.mp3 # Downloaded audio └── exports/ ├── 1. metadata.json # Video metadata (15 fields) ├── 2. chunks.jsonl # Semantic chunks (9 fields each) ├── 3. transcript.txt # Raw transcript text ├── 4. comprehensive_dashboard.html # Interactive HTML dashboard ├── 5. transcript_structured.md # Structured markdown ├── 6. granite_docling_enhanced.json # Docling analysis ├── 7. granite_docling_enhanced.jsonl # Docling chunks (line-delimited) └── 8. docling_output/ └── transcript_structured.json # Docling processing output ``` **Plus**: Vector database in `./vector_store/{backend}/` --- ## 📄 File 1: metadata.json (15 Fields) **Purpose**: Complete video metadata and processing information ```json { "video_id": "dQw4w9WgXcQ", // YouTube video ID (11 chars) "url": "https://youtube.com/...", // Full YouTube URL "title": "Rick Astley - Never Gonna...", // Video title "duration": 213, // Duration in seconds "upload_date": "20091025", // Upload date (YYYYMMDD) "view_count": 1738771804, // View count "like_count": 18772324, // Like count "comment_count": 2400000, // Comment count "channel": "Rick Astley", // Channel name "description": "The official video for...", // Description (500 chars max) "processed_at": "2026-02-04 21:33:31.721602", // Processing timestamp "chunks_count": 1, // Total chunks generated "total_words": 386, // Total word count "transcription_method": "whisper-ai-base" // Whisper model used } ``` **Total**: 15 fields --- ## 📄 File 2: chunks.jsonl (9 Fields Per Chunk) **Purpose**: Semantic text chunks with embeddings and timestamps **Format**: JSON Lines (one chunk per line) ```json { "id": 0, // Chunk ID (0-indexed) "text": "There are no strangers to love...", // Chunk text content "word_count": 386, // Words in this chunk "start_char": 0, // Start character position "end_char": 1885, // End character position "quality_score": 8.0, // Quality score (0-10) "timestamp_start": "0:00", // Start timestamp (MM:SS) "timestamp_end": "3:33", // End timestamp (MM:SS) "embedding": [0.123, 0.456, ...] // 384-dim vector } ``` **Total**: 9 fields per chunk **Embedding**: 384-dimensional float array --- ## 📄 File 3: transcript.txt **Purpose**: Raw transcript text (plain text) **Format**: Single block of text **Content**: Complete transcription from Whisper AI **Encoding**: UTF-8 **Use**: Human-readable, simple parsing --- ## 📄 File 4: comprehensive_dashboard.html **Purpose**: Interactive HTML dashboard for visualization ### Sections Included: #### 1. **Header Section** - Video title - Channel name - Duration (MM:SS format) - View count (formatted) - Like count (formatted) - Upload date (formatted) #### 2. **Content Overview** (4 metrics) - Total chunks - Total words - Average words per chunk - High quality chunk count #### 3. **Content Chunks** (Visualization) For each chunk: - Chunk ID - Quality badge (High/Medium/Low) - Text preview (200 chars) - Word count - Timestamp range (NEW!) #### 4. **Top Keywords** (15-20 keywords) - Keyword text - Frequency count - Sorted by frequency #### 5. **Quality Distribution** (3 categories) - High quality (≥7.0): Count + percentage + progress bar - Medium quality (5.0-7.0): Count + percentage + progress bar - Low quality (<5.0): Count + percentage + progress bar ### Design Features: - Modern OKLCH color scheme - Dark mode optimized - Responsive layout - APCA contrast optimization - Hover effects - Clean typography **File Size**: ~12 KB (optimized) --- ## 📄 File 5: transcript_structured.md **Purpose**: Structured markdown for Docling processing ### Sections: ```markdown # {Video Title} ## Video Information - Video ID - Channel - Duration - Views - Upload Date --- ## Transcript {Full transcript text} --- ## Metadata - Transcription Method - Total Words - Processing Date ``` **Format**: Markdown with headers and lists --- ## 📄 File 6: granite_docling_enhanced.json **Purpose**: Docling-enhanced chunk analysis ### Structure: ```json { "chunks": [ { "chunk_id": 0, "text": "...", "source_path": "...", "word_count": 426, "char_start": 0, "char_end": 2194, "markdown_section": "", "quality_score": 8.0 } ], "metadata": { "total_chunks": 1, "total_words": 426, "processor": "granite-docling-enhanced", "version": "1.0", "timestamp": "2026-02-04T21:33:49.042420" } } ``` **Chunk Fields**: 8 per chunk **Metadata Fields**: 5 --- ## 📄 File 7: granite_docling_enhanced.jsonl **Purpose**: Same as File 6 but in JSONL format (one chunk per line) **Format**: JSON Lines **Use**: Streaming processing, easier parsing **Content**: Same chunk structure as File 6 --- ## 📄 File 8: docling_output/transcript_structured.json **Purpose**: Docling processing output (structured JSON) **Content**: Docling's internal representation of the structured markdown **Size**: ~14 KB **Format**: Docling-specific JSON schema --- ## 🗄️ Vector Database (Not a File) **Purpose**: Semantic search capability ### Stored Information Per Chunk: - Chunk ID - Text content - Embedding vector (384 dimensions) - Metadata: - video_id - timestamp_start - timestamp_end - quality_score - word_count **Backend Options**: ChromaDB, FAISS, Qdrant **Query Capabilities**: Semantic similarity search --- ## 📊 **Complete Information Summary** ### Video-Level Information (15 fields) 1. video_id (YouTube ID) 2. url (Full YouTube URL) 3. title 4. duration (seconds) 5. upload_date (YYYYMMDD) 6. view_count 7. like_count 8. comment_count 9. channel name 10. description (truncated to 500 chars) 11. processed_at (timestamp) 12. chunks_count 13. total_words 14. transcription_method ### Per-Chunk Information (9 fields) 1. id (chunk number) 2. text (chunk content) 3. word_count 4. start_char (character position) 5. end_char (character position) 6. quality_score (0-10 scale) 7. timestamp_start (MM:SS format) **NEW!** 8. timestamp_end (MM:SS format) **NEW!** 9. embedding (384-dim vector) ### Docling Enhanced (8 fields per chunk) 1. chunk_id 2. text 3. source_path 4. word_count 5. char_start 6. char_end 7. markdown_section 8. quality_score ### Dashboard Visualizations 1. Video metadata display 2. Content overview stats (4 metrics) 3. Chunk cards (first 20) 4. Top keywords (15-20) 5. Quality distribution (3 categories) --- ## 🎯 Total Information Captured ### Quantitative: - **Video metadata**: 15 fields - **Per chunk (standard)**: 9 fields - **Per chunk (Docling)**: 8 fields - **Embedding dimensions**: 384 per chunk - **Keywords extracted**: 15-20 - **Quality categories**: 3 (high/medium/low) ### Qualitative: - Full transcript text - Structured markdown - Interactive HTML dashboard - Searchable vector database - Temporal information (timestamps) --- ## 📈 Data Volume Example **For 10-minute video** (~2,000 words): - metadata.json: ~1 KB - transcript.txt: ~12 KB - chunks.jsonl: ~150 KB (with embeddings) - dashboard.html: ~15 KB - Docling JSON: ~30 KB - Vector DB: ~50 KB **Total**: ~260 KB of structured data per video --- ## 🔍 What Can Be Queried ### From Files: - ✅ Video metadata (all 15 fields) - ✅ Full transcript (searchable text) - ✅ Individual chunks (by ID or content) - ✅ Keywords and frequencies - ✅ Quality distribution - ✅ Timestamps for any chunk ### From Vector Database: - ✅ Semantic similarity search - ✅ Find similar chunks across videos - ✅ Query by embedding vector - ✅ Filter by metadata (video_id, quality, etc.) ### Via MCP Tools: - ✅ Full-text search with context - ✅ Semantic search - ✅ SEO optimization insights - ✅ Topic timeline analysis - ✅ Quality metrics - ✅ Performance benchmarks --- **Every YouTube video becomes a rich, queryable knowledge base with 50+ data points!** 📚

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/leolech14/ytpipe'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

OUTPUT_SCHEMA_COMPLETE.md•8.46 KiB