youtube-mcp
Provides tools for watching YouTube videos by extracting frames and transcripts, enabling AI assistants to understand video content.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@youtube-mcpWatch this video and describe the scene at 2:30: https://youtu.be/dQw4w9WgXcQ"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
What is this?
An MCP server that lets AI assistants actually watch YouTube videos — not just read transcripts.
It extracts frames at scene changes and visual reference moments, pairs each frame with the exact words spoken at that timestamp, and returns everything as dense interleaved content. The AI sees what's on screen at the exact moment someone says "as you can see here."
No existing YouTube MCP server does this. Every other one is transcript-only. This is the first to combine transcript + vision.
The Token Math
Approach | 10 min video | Token cost |
Gemini native (1 FPS, 258 tok/frame) | 600 frames | ~155K tokens |
Sending raw JPEGs to any model | 600 frames | ~7.2M tokens |
youtube-mcp (dense interleave, 1 frame/5s) | 120 frames | ~1.4M tokens |
youtube-mcp describe mode (BLIP-2 → text) | 600 frames | ~18K tokens |
Describe mode gives you 400x savings over raw images — full visual coverage as pure text.
Quick Install
Prerequisites
brew install yt-dlp ffmpegSetup
git clone https://github.com/coah80/youtube-mcp.git
cd youtube-mcp
bun installAdd to Claude Code
Add to your ~/.mcp.json:
{
"mcpServers": {
"youtube": {
"command": "bun",
"args": ["run", "/path/to/youtube-mcp/src/index.ts"]
}
}
}Then restart Claude Code.
Optional: Enable describe mode (local BLIP-2)
This lets the AI "see" every second of a video as text descriptions — no images needed.
cd youtube-mcp
uv venv --python 3.11 .venv
uv pip install --python .venv/bin/python torch torchvision transformers accelerate pillowFirst run downloads the BLIP-2 model (~3GB). Subsequent runs use the cached model.
AI Installer
For AI assistants: Copy and paste this into your MCP config to install youtube-mcp.
Add this to ~/.mcp.json (create the file if it doesn't exist):
{
"mcpServers": {
"youtube": {
"command": "bun",
"args": ["run", "/path/to/youtube-mcp/src/index.ts"]
}
}
}Replace /path/to/youtube-mcp with the actual clone path.
If the file already has other servers, just add the "youtube" key inside the existing "mcpServers" object.
How It Works
YouTube URL
│
├──→ yt-dlp ──→ Transcript (timestamped, word-level)
│
├──→ yt-dlp ──→ Stream URL ──→ ffmpeg ──→ Frames
│ │
│ ┌────────────┼────────────┐
│ │ │ │
│ Scene Change Visual Cues Regular
│ Detection in Transcript Intervals
│ (ffmpeg) ("as you can (fill gaps)
│ see here")
│ │ │ │
│ └────────────┼────────────┘
│ │
│ Frame Selection (prioritized)
│ │
└──────────────────→ Dense Interleave
│
┌──────────┴──────────┐
│ │
Image Mode Describe Mode
(raw screenshots) (BLIP-2 captions)
│ │
Frame + "words Text description
spoken during + "words spoken
this frame" during this frame"Visual Cue Detection
The analyzer scans transcript text for 25+ patterns indicating the speaker is referencing something visual:
Pattern | Example |
| "As you can see here, the API returns..." |
| "Look at this graph" |
| "What's on screen right now is..." |
| "If you click here, it opens..." |
| "In this diagram, we have..." |
| "Notice how the color changes" |
When detected, a frame is extracted at that exact timestamp — so the AI sees what the speaker was pointing at.
Scene Change Detection
Uses ffmpeg's scene detection filter (select=gt(scene,0.3)) to find where the visual content actually changes. This means:
Static talking-head sections get fewer frames (nothing's changing)
Slide transitions, screen recordings, demos get more frames (lots changing)
Segment-Based Processing
For videos longer than 5 minutes, watch_video processes in 3-minute segments with ~1 frame every 5 seconds. The AI calls it repeatedly:
watch_video(url) → first 3 min, 36 frames
watch_video(url, start_time=180) → next 3 min, 36 frames
watch_video(url, start_time=360) → next 3 min, 36 frames
...until the endEach response tells the AI how to continue: "To continue watching, call watch_video with start_time=360"
Tools
Tool | What it does |
| Dense frame↔transcript interleaving in segments. ~1 frame/5s. The full "watch" experience. |
| Full visual coverage via local BLIP-2. Every frame described as text. 400x fewer tokens than images. |
| Composite grid image of scene changes. Quick visual summary of the whole video. |
| Extract frames at specific timestamps. For drilling into moments. |
| Full timestamped transcript. |
| Video metadata (title, channel, duration, views, description). |
Examples
"Watch this video and summarize it"
The AI calls watch_video and gets interleaved content like:
[1:23] (scene change) "and here's where it gets interesting"
[screenshot of code editor]
[1:28] "if you look at this function right here"
[screenshot showing the function being discussed]
[1:33] (visual reference) "notice how the state updates"
[screenshot at the exact moment they reference the visual]"Describe this entire lecture for me"
The AI calls describe_video and gets pure text:
[0:00] [VISUAL] A title slide reading "Introduction to Neural Networks"
[0:00] Welcome everyone to today's lecture on neural networks.
[0:05] [VISUAL] A diagram showing interconnected nodes in layers
[0:05] We'll start with the basic architecture.
[0:10] [VISUAL] The same diagram with arrows highlighted between layers
[0:10] Each connection between nodes has a weight...600 frames of a 10-minute video → ~18K tokens. Fits in any context window.
Architecture
youtube-mcp/
├── src/
│ ├── index.ts # MCP server — 6 tool definitions
│ ├── youtube.ts # yt-dlp + ffmpeg operations (stream URL, frames, scenes)
│ ├── analyzer.ts # Visual cue detection, chunking, dense interleaving
│ ├── describe.ts # BLIP-2 integration (TypeScript wrapper)
│ └── captioner.py # BLIP-2 inference (Python, runs on MPS/CUDA/CPU)
├── .venv/ # Python venv for BLIP-2 (optional)
├── package.json
├── tsconfig.json
└── README.mdTech Stack
Runtime: Bun
MCP SDK: @modelcontextprotocol/sdk
Vision (optional): BLIP-2 via PyTorch on Apple MPS
Compatibility
Works with any MCP-compatible AI assistant:
Claude Code (CLI, Desktop, Web)
Claude Desktop
Cursor
Any future MCP host
The image-based tools (watch_video, get_frames, get_scene_overview) require a vision-capable model.
The text-based tool (describe_video) works with any model — even text-only ones — because BLIP-2 converts all visuals to text locally.
Roadmap
Gemini Flash proxy mode — use Gemini Flash ($0.10/1M tokens) as a visual encoder for higher-quality frame descriptions than BLIP-2
Frame deduplication — perceptual similarity hashing to skip near-identical frames
Keyframe extraction — use ffmpeg I-frame detection instead of fixed intervals
Whisper integration — local audio transcription when YouTube captions aren't available
Timestamp burning — burn MM:SS into frame pixels (requires ffmpeg with libfreetype)
npm package —
npx youtube-mcpone-liner install
Research
This project was informed by deep research into how Gemini, GPT-4o, and open-source tools handle video:
Gemini processes video at 1 FPS using SigLIP-SO400M (258 tokens/frame) with native multimodal attention
GPT-4o sends base64 JPEG frames via the vision API (~12K tokens/frame)
No existing YouTube MCP server combines transcript + frame extraction — this is the first
Key references: LiveCC (CVPR 2025), mcp-deep-video, videostil, llm-video-frames
License
MIT
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/coah80/youtube-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server