Which integrations are available for this server?

Provides tools for watching YouTube videos by extracting frames and transcripts, enabling AI assistants to understand video content.

How do I use youtube-mcp?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@youtube-mcp Watch this video and describe the scene at 2:30: https://youtu.be/dQw4w9WgXcQ" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

youtube-mcp

by coah80

Overview Schema Related Servers Score Discussions

TypeScript

Local

What is this?

An MCP server that lets AI assistants actually watch YouTube videos — not just read transcripts.

It extracts frames at scene changes and visual reference moments, pairs each frame with the exact words spoken at that timestamp, and returns everything as dense interleaved content. The AI sees what's on screen at the exact moment someone says "as you can see here."

No existing YouTube MCP server does this. Every other one is transcript-only. This is the first to combine transcript + vision.

The Token Math

Approach	10 min video	Token cost
Gemini native (1 FPS, 258 tok/frame)	600 frames	~155K tokens
Sending raw JPEGs to any model	600 frames	~7.2M tokens
youtube-mcp (dense interleave, 1 frame/5s)	120 frames	~1.4M tokens
youtube-mcp describe mode (BLIP-2 → text)	600 frames	~18K tokens

Describe mode gives you 400x savings over raw images — full visual coverage as pure text.

Related MCP server: ytmcp

Quick Install

Prerequisites

brew install yt-dlp ffmpeg

Setup

git clone https://github.com/coah80/youtube-mcp.git
cd youtube-mcp
bun install

Add to Claude Code

Add to your ~/.mcp.json:

{
  "mcpServers": {
    "youtube": {
      "command": "bun",
      "args": ["run", "/path/to/youtube-mcp/src/index.ts"]
    }
  }
}

Then restart Claude Code.

Optional: Enable describe mode (local BLIP-2)

This lets the AI "see" every second of a video as text descriptions — no images needed.

cd youtube-mcp
uv venv --python 3.11 .venv
uv pip install --python .venv/bin/python torch torchvision transformers accelerate pillow

First run downloads the BLIP-2 model (~3GB). Subsequent runs use the cached model.

AI Installer

For AI assistants: Copy and paste this into your MCP config to install youtube-mcp.

Add this to ~/.mcp.json (create the file if it doesn't exist):

{
  "mcpServers": {
    "youtube": {
      "command": "bun",
      "args": ["run", "/path/to/youtube-mcp/src/index.ts"]
    }
  }
}

Replace /path/to/youtube-mcp with the actual clone path.

If the file already has other servers, just add the "youtube" key inside the existing "mcpServers" object.

How It Works

YouTube URL
    │
    ├──→ yt-dlp ──→ Transcript (timestamped, word-level)
    │
    ├──→ yt-dlp ──→ Stream URL ──→ ffmpeg ──→ Frames
    │                                │
    │                    ┌────────────┼────────────┐
    │                    │            │            │
    │              Scene Change  Visual Cues  Regular
    │              Detection    in Transcript  Intervals
    │              (ffmpeg)     ("as you can   (fill gaps)
    │                           see here")
    │                    │            │            │
    │                    └────────────┼────────────┘
    │                                │
    │                    Frame Selection (prioritized)
    │                                │
    └──────────────────→ Dense Interleave
                              │
                   ┌──────────┴──────────┐
                   │                     │
              Image Mode            Describe Mode
           (raw screenshots)     (BLIP-2 captions)
                   │                     │
              Frame + "words         Text description
              spoken during          + "words spoken
              this frame"            during this frame"

Visual Cue Detection

The analyzer scans transcript text for 25+ patterns indicating the speaker is referencing something visual:

Pattern	Example
`as you can see`	"As you can see here, the API returns..."
`look at this`	"Look at this graph"
`on screen`	"What's on screen right now is..."
`click here`	"If you click here, it opens..."
`this diagram`	"In this diagram, we have..."
`notice how`	"Notice how the color changes"

When detected, a frame is extracted at that exact timestamp — so the AI sees what the speaker was pointing at.

Scene Change Detection

Uses ffmpeg's scene detection filter (select=gt(scene,0.3)) to find where the visual content actually changes. This means:

Static talking-head sections get fewer frames (nothing's changing)
Slide transitions, screen recordings, demos get more frames (lots changing)

Segment-Based Processing

For videos longer than 5 minutes, watch_video processes in 3-minute segments with ~1 frame every 5 seconds. The AI calls it repeatedly:

watch_video(url) → first 3 min, 36 frames
watch_video(url, start_time=180) → next 3 min, 36 frames
watch_video(url, start_time=360) → next 3 min, 36 frames
...until the end

Each response tells the AI how to continue: "To continue watching, call watch_video with start_time=360"

Tools

Tool	What it does
`watch_video`	Dense frame↔transcript interleaving in segments. ~1 frame/5s. The full "watch" experience.
`describe_video`	Full visual coverage via local BLIP-2. Every frame described as text. 400x fewer tokens than images.
`get_scene_overview`	Composite grid image of scene changes. Quick visual summary of the whole video.
`get_frames`	Extract frames at specific timestamps. For drilling into moments.
`get_transcript`	Full timestamped transcript.
`get_video_info`	Video metadata (title, channel, duration, views, description).

Examples

"Watch this video and summarize it"

The AI calls watch_video and gets interleaved content like:

[1:23] (scene change) "and here's where it gets interesting"
[screenshot of code editor]

[1:28] "if you look at this function right here"
[screenshot showing the function being discussed]

[1:33] (visual reference) "notice how the state updates"
[screenshot at the exact moment they reference the visual]

"Describe this entire lecture for me"

The AI calls describe_video and gets pure text:

[0:00] [VISUAL] A title slide reading "Introduction to Neural Networks"
[0:00] Welcome everyone to today's lecture on neural networks.
[0:05] [VISUAL] A diagram showing interconnected nodes in layers
[0:05] We'll start with the basic architecture.
[0:10] [VISUAL] The same diagram with arrows highlighted between layers
[0:10] Each connection between nodes has a weight...

600 frames of a 10-minute video → ~18K tokens. Fits in any context window.

Architecture

youtube-mcp/
├── src/
│   ├── index.ts        # MCP server — 6 tool definitions
│   ├── youtube.ts      # yt-dlp + ffmpeg operations (stream URL, frames, scenes)
│   ├── analyzer.ts     # Visual cue detection, chunking, dense interleaving
│   ├── describe.ts     # BLIP-2 integration (TypeScript wrapper)
│   └── captioner.py    # BLIP-2 inference (Python, runs on MPS/CUDA/CPU)
├── .venv/              # Python venv for BLIP-2 (optional)
├── package.json
├── tsconfig.json
└── README.md

Tech Stack

Runtime: Bun
MCP SDK: @modelcontextprotocol/sdk
Video: yt-dlp + ffmpeg
Vision (optional): BLIP-2 via PyTorch on Apple MPS

Compatibility

Works with any MCP-compatible AI assistant:

Claude Code (CLI, Desktop, Web)
Claude Desktop
Cursor
Any future MCP host

The image-based tools (watch_video, get_frames, get_scene_overview) require a vision-capable model.

The text-based tool (describe_video) works with any model — even text-only ones — because BLIP-2 converts all visuals to text locally.

Roadmap

Gemini Flash proxy mode — use Gemini Flash ($0.10/1M tokens) as a visual encoder for higher-quality frame descriptions than BLIP-2
Frame deduplication — perceptual similarity hashing to skip near-identical frames
Keyframe extraction — use ffmpeg I-frame detection instead of fixed intervals
Whisper integration — local audio transcription when YouTube captions aren't available
Timestamp burning — burn MM:SS into frame pixels (requires ffmpeg with libfreetype)
npm package — npx youtube-mcp one-liner install

Research

This project was informed by deep research into how Gemini, GPT-4o, and open-source tools handle video:

Gemini processes video at 1 FPS using SigLIP-SO400M (258 tokens/frame) with native multimodal attention
GPT-4o sends base64 JPEG frames via the vision API (~12K tokens/frame)
No existing YouTube MCP server combines transcript + frame extraction — this is the first

Key references: LiveCC (CVPR 2025), mcp-deep-video, videostil, llm-video-frames

License

MIT

This server cannot be installed

license - permissive license

quality - not tested

maintenance

How are these scores calculated?

Maintenance

–Maintainers

–Response time

–Release cycle

–Releases (12mo)

Commit activity

Resources

Need Help?

Related Servers

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly
Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
OpenAI
open source

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/coah80/youtube-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server