mcptube-vision
Allows ingesting YouTube videos to extract transcripts and visual frame descriptions, building a persistent wiki knowledge base that compounds information across videos.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@mcptube-visionadd https://youtu.be/dQw4w9WgXcQ to my knowledge base"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
π¬ mcptube-vision
YouTube video knowledge engine β transcripts, vision, and persistent wiki.
mcptube-vision transforms YouTube videos into a persistent, structured knowledge base using both transcripts and visual frame analysis. Built on the Karpathy LLM Wiki pattern: knowledge compounds with every video you add.
Evolved from mcptube v0.1 β mcptube-vision replaces semantic chunk search with a persistent wiki that gets smarter with every video ingested.
π§ How It Works
Traditional video tools re-discover knowledge from scratch on every query. mcptube-vision is different:
mcptube v0.1 mcptube-vision
βββββββββββββββββββββββ βββββββββββββββββββββββββββ
β Query β vector searchβ β Video ingested β LLM β
β β raw chunks β LLM β β extracts knowledge β β
β β answer (from scratchβ β wiki pages created β β
β every time) β β cross-references built β
βββββββββββββββββββββββ β β
β Query β FTS5 + agent β
β β reasons over compiled β
β knowledge β answer β
βββββββββββββββββββββββββββv0.1 (Video Search Engine) | vision (Video Knowledge Engine) | |
On ingest | Chunk transcript, embed in vector DB | LLM watches + reads, writes wiki pages |
On query | Find similar chunks | Agent reasons over compiled knowledge |
Frames | Timestamp or keyword extraction | Scene-change detection + vision model |
Cross-video | Re-search all chunks each time | Connections already in the wiki |
Over time | Library of isolated videos | Compounding knowledge base |
ποΈ Technical Architecture
mcptube-vision is built around a core insight: video knowledge should compound, not be re-discovered. Every architectural decision flows from this principle.
System Overview
flowchart TD
YT[YouTube URL] --> EXT[YouTubeExtractor\ntranscript + metadata]
EXT --> FRAMES[SceneFrameExtractor\nffmpeg scene-change detection]
FRAMES --> VISION[VisionDescriber\nLLM vision model]
VISION --> WIKI_EXT[WikiExtractor\nLLM knowledge extraction]
EXT --> WIKI_EXT
WIKI_EXT --> WIKI_ENG[WikiEngine\nmerge + update]
WIKI_ENG --> FILE[FileWikiRepository\nJSON pages on disk]
WIKI_ENG --> FTS[SQLite FTS5\nsearch index]
FILE --> AGENT[Ask Agent\nFTS5 β LLM reasoning]
FTS --> AGENT
FILE --> CLI[CLI / MCP Server]
FTS --> CLI
subgraph Ingestion Pipeline
EXT
FRAMES
VISION
WIKI_EXT
end
subgraph Knowledge Store
WIKI_ENG
FILE
FTS
end
subgraph Retrieval
AGENT
endThe system overview shows three distinct subsystems connected by a unidirectional data flow. The Ingestion Pipeline (left) transforms a raw YouTube URL into structured knowledge through four stages: transcript extraction, scene-change frame detection, vision-model description, and LLM-powered knowledge extraction. Each stage enriches the signal β raw video becomes text, text becomes typed knowledge objects.
The Knowledge Store (center) is the persistent layer. The WikiEngine applies merge semantics β deciding whether to create new pages or append to existing ones β then writes JSON files to disk and updates the FTS5 search index in parallel. These two stores serve different access patterns: files for full-page reads and exports, FTS5 for sub-millisecond keyword retrieval.
The Retrieval layer (right) combines both stores. The Ask Agent first narrows via FTS5, then loads full pages from disk, and finally reasons over candidates with structural awareness from the wiki TOC. The CLI and MCP Server sit alongside as thin presentation layers β they never contain business logic.
Ingestion Flow
sequenceDiagram
participant User
participant CLI
participant YouTubeExtractor
participant SceneFrameExtractor
participant VisionDescriber
participant WikiExtractor
participant WikiEngine
participant FileRepo
participant FTS5
User->>CLI: mcptube add <url>
CLI->>YouTubeExtractor: fetch transcript + metadata
YouTubeExtractor-->>CLI: segments, duration, channel
CLI->>SceneFrameExtractor: extract scene frames (ffmpeg)
SceneFrameExtractor-->>CLI: frame images (scene_000x.jpg)
CLI->>VisionDescriber: describe frames (LLM vision)
VisionDescriber-->>CLI: frame descriptions (prose)
CLI->>WikiExtractor: extract knowledge\n(transcript + frame descriptions)
WikiExtractor-->>CLI: entities, topics, concepts, video page
CLI->>WikiEngine: merge into wiki
WikiEngine->>FileRepo: write/update JSON pages\n(append entities, rewrite synthesis)
WikiEngine->>FTS5: update search index
FileRepo-->>WikiEngine: β
FTS5-->>WikiEngine: β
WikiEngine-->>CLI: wiki processed
CLI-->>User: β
Added + Wiki: full_analysisThe ingestion flow is a write-once pipeline β LLM-heavy at ingest time, but never repeated for the same video. This is the key cost tradeoff: invest tokens upfront to build compiled knowledge, so retrieval is cheap.
The sequence shows two critical branching points. First, after transcript extraction, the pipeline forks into vision processing (scene frames β LLM vision descriptions) and feeds both streams into the WikiExtractor. This dual-signal approach means the LLM sees both what was said and what was shown β critical for content like coding tutorials or slide-based lectures where the transcript alone misses visual information.
Second, the WikiEngine merge step is where knowledge compounding happens. Rather than blindly writing new pages, it checks for existing entities, topics, and concepts β appending new video contributions to existing pages and rewriting synthesis summaries. This is why ingesting video #10 makes the wiki smarter about videos #1β9 too: shared concepts get richer synthesis with each new source.
The final FTS5 index update runs synchronously after the file write, ensuring search consistency. There is no eventual-consistency window β once add_video returns, all new knowledge is immediately searchable.
Retrieval Flow
sequenceDiagram
participant User
participant CLI
participant FTS5
participant FileRepo
participant Agent
User->>CLI: mcptube ask "What is RLHF?"
CLI->>FTS5: keyword search (sanitized query)
FTS5-->>CLI: candidate page slugs (ranked)
CLI->>FileRepo: load candidate pages (JSON)
FileRepo-->>CLI: wiki pages (entities, topics, concepts)
CLI->>FileRepo: load wiki TOC
FileRepo-->>CLI: table of contents (all page titles + types)
CLI->>Agent: candidates + TOC + question
Agent-->>CLI: reasoned answer with source citations
CLI-->>User: answer + (source-slug) citationsThe retrieval flow is deliberately two-stage to balance cost and intelligence. The first stage β FTS5 keyword search β runs entirely locally with zero LLM tokens, narrowing thousands of wiki pages to a ranked handful in milliseconds. Query sanitization strips special characters (e.g. ?, !) that would break FTS5 syntax, ensuring robustness for natural-language questions.
The second stage loads two types of context for the agent: the candidate pages (full detail β summaries, contributions, entity references) and the wiki TOC (a compact structural map of all knowledge). The TOC is critical β it gives the agent awareness of what it doesn't know. Without it, the agent would hallucinate answers from weak matches. With it, the agent can reason: "The wiki has pages on RLHF and scaling laws, but nothing on quantum computing β so I should say I don't have that information."
In CLI mode (BYOK), the agent is an LLM call that synthesizes the final answer with source citations. In MCP server mode (passthrough), this stage returns the raw candidates and TOC to the client β letting the client's own model (Copilot, Claude, Gemini) do the reasoning. This dual-mode design means the server never requires an API key when used via MCP.
Subsystem Breakdown
1. Ingestion Pipeline
YouTubeExtractor pulls transcript segments via youtube-transcript-api and video metadata via yt-dlp. Transcripts are chunked by natural segment boundaries, not fixed token windows β preserving semantic coherence.
SceneFrameExtractor uses ffmpeg's perceptual scene-change filter (select='gt(scene,{threshold})') rather than fixed-interval sampling. This is deliberate: fixed intervals waste tokens on static frames (slides held for 30s), while scene-change detection captures transitions β the moments of highest information density. The threshold (default 0.4) is configurable.
VisionDescriber sends detected frames to a vision-capable LLM (GPT-4o, Claude, Gemini β auto-detected via API key priority). Frame descriptions are plain prose, not structured JSON, to maximise the LLM's descriptive latitude.
Why this matters: A transcript of a coding tutorial misses the code on screen. Scene-change vision capture recovers that signal without the token cost of dense fixed-interval sampling.
2. WikiEngine β The Novel Core β
Inspired by the Karpathy LLM Wiki pattern, this is the most architecturally distinctive component.
WikiExtractor takes the combined transcript + frame descriptions and prompts an LLM to extract four typed knowledge objects:
Type | Semantics | Update Policy |
| Immutable per-video summary + timestamps | Write-once |
| People, tools, companies | Append-only β new references added, never overwritten |
| Broad themes (e.g. "Scaling Laws") | Synthesis rewritten; per-video contributions immutable |
| Specific ideas (e.g. "RLHF") | Synthesis rewritten; per-video contributions immutable |
WikiEngine handles merge semantics β when a new video references an existing entity or concept, it integrates the new evidence without destroying prior contributions. This is a CRDT-like append model for knowledge, not a vector store replacement index.
Why this matters: Vector stores are retrieval indexes β they don't synthesize. Two videos about "attention mechanisms" produce two isolated chunks. The WikiEngine merges them into a single concept-attention-mechanisms page with a synthesis that evolves as evidence accumulates. Knowledge compounds.
Version history is maintained for all non-immutable pages β every synthesis rewrite is snapshotted, enabling full auditability.
3. Storage Layer
FileWikiRepository stores wiki pages as JSON on disk, one file per page. Chosen over a document DB deliberately:
Human-readable and git-diffable
Trivially exportable to markdown/HTML
Schema evolution without migrations
SQLite FTS5 maintains a parallel search index over page titles, tags, and content. Chosen over a vector store because:
Zero embedding cost at query time
Deterministic, auditable results
Sub-millisecond latency at thousands of pages
Why not ChromaDB/Pinecone? At wiki scale, BM25-style keyword search over compiled knowledge pages outperforms semantic similarity over raw chunks β the wiki pages are already semantically rich by construction.
4. Hybrid Retrieval Agent β
The ask command uses a deliberate two-stage pattern:
FTS5 keyword search β narrows the full wiki to a small candidate set (milliseconds, zero LLM cost)
LLM agent β receives candidates + the wiki table of contents, reasons about relevance, synthesizes a grounded answer with source citations
Why this matters over RAG: Standard RAG retrieves chunks and generates. The agent here retrieves compiled knowledge pages and reasons. The wiki TOC gives the agent structural awareness of what knowledge exists β enabling it to correctly say "I don't have information about X" rather than hallucinating from weak chunk matches.
5. MCP Server
Exposes all subsystems as tools consumable by any MCP-compatible client. Report and synthesis tools use a passthrough pattern β returning structured data for the client's own LLM to analyse, rather than making a second LLM call server-side. This avoids double-billing and lets the client model apply its own reasoning style.
Key Design Decisions
Decision | Alternative Considered | Reason |
Scene-change frame extraction | Fixed-interval sampling | Higher signal/token ratio |
Wiki knowledge model | Vector store chunks | Knowledge compounds; no re-discovery per query |
FTS5 retrieval | Embedding similarity | Compiled wiki pages are already semantic |
File-based wiki storage | SQLite/document DB | Human-readable, git-diffable, zero migrations |
Append-only entity updates | Full rewrite | Source attribution preserved; full auditability |
Passthrough MCP reports | Server-side LLM | Avoids double-billing; client model reasons |
β¨ Features
Feature | CLI | MCP Server |
Add/remove YouTube videos | β | β |
Wiki knowledge base (auto-built) | β | β |
Scene-change frame extraction + vision analysis | β | β |
Full-text wiki search (FTS5) | β | β |
Agentic Q&A over wiki | β | β |
Browse wiki pages (entities, topics, concepts) | β | β |
Wiki version history | β | β |
Wiki export (markdown, HTML) | β | β |
Illustrated reports (single & cross-video) | β (BYOK) | β (passthrough) |
YouTube discovery + clustering | β (BYOK) | β |
Cross-video synthesis | β (BYOK) | β (passthrough) |
Text-only processing mode | β | β |
BYOK = Bring Your Own Key (Anthropic, OpenAI, or Google) Passthrough = The MCP client's own LLM does the analysis
π¦ Installation
Prerequisites
Python 3.12 or 3.13
ffmpeg β required for frame extraction (install guide)
Recommended: pipx
pipx install mcptube --python python3.12Alternative: pip
python3.12 -m venv venv
source venv/bin/activate
pip install mcptubeVerify installation
mcptube --helpπ Quick Start
# 1. Add a video (builds wiki automatically)
mcptube add "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
# 2. Add with text-only processing (cheaper, faster)
mcptube add "https://www.youtube.com/watch?v=abc123" --text-only
# 3. Browse the wiki
mcptube wiki list
mcptube wiki show "video-dQw4w9WgXcQ"
# 4. Search the knowledge base
mcptube search "main topic"
# 5. Ask a question (agentic retrieval over wiki)
mcptube ask "What are the key ideas discussed?"
# 6. View the table of contents
mcptube wiki tocπ‘ Always wrap multi-word arguments in double quotes.
π CLI Reference
Library Management
Command | Description | Example |
| Ingest video + build wiki (full analysis) |
|
| Ingest without vision processing |
|
| List all videos with tags |
|
| Show full video details (transcript, chapters) |
|
| Remove video + clean wiki references |
|
<query>can be a video index number, video ID, or partial title.
Wiki Knowledge Base
Command | Description | Example |
| Browse all wiki pages |
|
| Filter by type: |
|
| Filter by tag |
|
| Read a specific wiki page in full |
|
| Full-text search across all wiki pages |
|
| Table of contents (all pages, compact) |
|
| Version history for a wiki page |
|
| Export all pages as markdown (default) |
|
| Export all pages as single HTML file |
|
| Export a single page |
|
Search & Ask
Command | Description | Example |
| Full-text search, returns page list |
|
| Agentic Q&A over wiki (BYOK) |
|
Frames
Command | Description | Example |
| Extract frame at exact timestamp (seconds) |
|
| Extract frame by transcript match |
|
Analysis & Reports (BYOK)
Command | Description | Example |
| LLM classify + tag a video |
|
| Generate illustrated report for one video |
|
| Guide report with a focus query |
|
| Save report as HTML |
|
| Cross-video report on a topic |
|
| Cross-video report filtered by tag |
|
| Save cross-video report |
|
| Cross-video theme synthesis |
|
| Save synthesis as HTML |
|
| Search YouTube, cluster results (no ingest) |
|
Server
Command | Description |
| Start MCP server over HTTP (default |
| Start MCP server over stdio (for Claude Desktop) |
| Custom host/port |
| Hot-reload mode for development |
π§© Wiki Page Types
When you ingest a video, mcptube-vision builds four types of wiki pages:
Page Type | Created From | Update Policy |
Video | Each ingested video | Write-once (immutable) |
Entity | People, companies, tools mentioned | Append-only (new references added) |
Topic | Broad themes (e.g., "Machine Learning") | Synthesis rewritten, per-video contributions immutable |
Concept | Specific ideas (e.g., "Scaling Laws") | Synthesis rewritten, per-video contributions immutable |
Principle: Raw source content (what was said/shown in each video) is never modified. Only synthesis summaries evolve as new videos are added. Version history is maintained for all changes.
π How Search Works (Hybrid Retrieval)
mcptube-vision uses a two-step hybrid approach:
SQLite FTS5 β keyword search narrows thousands of wiki pages to a handful of candidates (milliseconds, zero LLM cost)
LLM Agent β reads candidates + wiki table of contents, reasons about relevance, synthesizes an answer
This gives you the speed of keyword search with the intelligence of an LLM agent.
ποΈ Vision Pipeline
When you ingest a video without --text-only, mcptube-vision:
Extracts key frames using ffmpeg scene-change detection (
select='gt(scene,0.4)')Sends frames to a vision-capable LLM (GPT-4o, Claude, Gemini) for description
Combines frame descriptions with transcript in the knowledge extraction pass
This captures visual content (slides, code, diagrams, demos) that transcripts alone miss.
π MCP Client Setup
mcptube exposes 25+ MCP tools via two transports:
Transport | How it works | Used by |
Streamable HTTP ( | Client connects to a running mcptube server | VS Code, Claude Code, Cursor, Windsurf, Codex, Gemini CLI |
stdio | MCP client spawns | Claude Desktop |
βΉοΈ The MCP server is currently available for local use only. You must run
mcptube servelocally or let the client spawn it.
VS Code + GitHub Copilot β Tested
Open Cmd+Shift+P β MCP: Open User Configuration and add:
{
"servers": {
"mcptube": {
"url": "http://127.0.0.1:9093/mcp"
}
}
}Then start the server in a terminal:
mcptube serveClaude Code β Tested
claude mcp add mcptube --transport http http://127.0.0.1:9093/mcpThen start the server in a separate terminal:
mcptube serveClaude Desktop
Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):
If installed via pipx (recommended):
{
"mcpServers": {
"mcptube": {
"command": "mcptube",
"args": ["serve", "--stdio"]
}
}
}If installed in a virtual environment:
{
"mcpServers": {
"mcptube": {
"command": "/full/path/to/.venv/bin/mcptube",
"args": ["serve", "--stdio"]
}
}
}No separate server needed β Claude Desktop spawns the process automatically.
Cursor
Create or edit ~/.cursor/mcp.json (global) or .cursor/mcp.json (project-scoped):
{
"mcpServers": {
"mcptube": {
"url": "http://127.0.0.1:9093/mcp"
}
}
}Then start the server:
mcptube serveWindsurf
Edit ~/.codeium/windsurf/mcp_config.json:
{
"mcpServers": {
"mcptube": {
"serverUrl": "http://127.0.0.1:9093/mcp"
}
}
}Then start the server:
mcptube serveOpenAI Codex
Edit ~/.codex/config.toml:
[mcp_servers.mcptube]
url = "http://127.0.0.1:9093/mcp"Then start the server:
mcptube serveGemini CLI
Edit ~/.gemini/settings.json:
{
"mcpServers": {
"mcptube": {
"httpUrl": "http://127.0.0.1:9093/mcp"
}
}
}Then start the server:
mcptube serveVerify Connection
Once connected, ask your MCP client:
use mcptube. list all videos in my library
It should call the list_videos tool and return results.
MCP Tools
Tool | Description |
| Ingest video + build wiki |
| List library |
| Remove video + clean wiki |
| Browse wiki pages |
| Read a wiki page |
| Full-text search |
| Table of contents |
| Agentic Q&A |
| Version history |
| Extract frame (inline image) |
| Frame by transcript match |
| Get metadata for classification |
| Get data for single-video report |
| Get data for cross-video report |
| Get data for theme synthesis |
| Search YouTube |
| Single-video Q&A data |
| Multi-video Q&A data |
βοΈ Configuration
All settings can be overridden via environment variables prefixed with MCPTUBE_:
Setting | Default | Env Var |
Data directory |
|
|
Server host |
|
|
Server port |
|
|
Default LLM model |
|
|
BYOK API Keys
Set one or more to enable LLM features:
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GOOGLE_API_KEY="AI..."Auto-detection priority: Anthropic β OpenAI β Google.
π Data Layout
~/.mcptube/
βββ mcptube.db # Video metadata (SQLite)
βββ wiki.db # FTS5 search index (SQLite)
βββ wiki/
β βββ video/ # Video pages (JSON)
β βββ entity/ # Entity pages (JSON)
β βββ topic/ # Topic pages (JSON)
β βββ concept/ # Concept pages (JSON)
β βββ _history/ # Version history
βββ frames/
βββ <id>_<ts>.jpg # Single extracted frames
βββ <id>_scenes/ # Scene-change frames + metadataπ§ͺ Development
git clone https://github.com/0xchamin/mcptube.git
cd mcptube
git checkout vision
python3.12 -m venv venv
source venv/bin/activate
pip install -e ".[dev]"
pytestπΊοΈ Roadmap
Wiki knowledge engine (entities, topics, concepts)
Scene-change frame extraction + vision analysis
Hybrid retrieval (FTS5 + agentic)
CLI + MCP server
Playlist/series support
Web app with early access sign-up
Token-based payment integration
π License
MIT β see LICENSE for details.
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/0xchamin/mcptube'
If you have feedback or need assistance with the MCP directory API, please join our Discord server