yt-mcp
This server provides AI-powered, local multi-modal analysis of YouTube videos — no API keys required. It supports standard YouTube URL formats (youtube.com/watch?v=ID, youtu.be/ID, youtube.com/shorts/ID).
Summarize videos (
summarize_video): Generate text summaries at three detail levels — brief (2-3 sentences), medium (key points with timestamps), or detailed (comprehensive breakdown).Ask questions about videos (
ask_about_video): Pose specific questions and get AI-generated answers based on the video's content.Extract screenshots automatically (
extract_screenshots): AI identifies visually significant moments and extracts frames as base64 images, with configurable count (1-20), focus area, resolution (thumbnail to full), and optional disk saving.Preview important timestamps (
get_video_timestamps): Identify key moments and retrieve their timestamps without extracting frames — useful for planning before committing to extraction.Extract frames at specific timestamps (
extract_frames): Supply exact timestamps to retrieve corresponding frames, with configurable resolution and optional disk saving.
Underlying capabilities include local Whisper-based transcription, scene-change/interval keyframe extraction, and audio analysis (energy, tempo, RMS, music vs. speech detection via librosa).
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@yt-mcpsummarize https://www.youtube.com/watch?v=dQw4w9WgXcQ"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
yt-mcp
A fully local MCP (Model Context Protocol) server that gives AI assistants deep, multi-modal awareness of YouTube videos. No API keys required. All processing runs on-device via yt-dlp, OpenAI Whisper, FFmpeg, PySceneDetect, and librosa.
Note: This repository also contains an experimental TypeScript server (
src/) that uses the Gemini API. That server is not under active development — the Python local server (server/) is the primary implementation.
Table of Contents
How it works
YouTube URL
│
▼
yt-dlp ──────────────── download video.mp4
│ extract audio.wav (16 kHz mono)
▼
Whisper ─────────────── timestamped transcript (word-level)
│
▼
PySceneDetect ────────── detect scene-cut timestamps
│
▼
FFmpeg ──────────────── extract keyframe JPEGs at scene cuts
│
▼
OpenCV ──────────────── pixel-diff animation detection
│
▼
librosa ─────────────── energy · tempo · music vs speech
│
▼
timeline.py ─────────── unified JSON timeline (all signals, time-aligned)All results are cached in /tmp/yt-analysis-cache/<video_id>/. Re-calling the same URL is instant.
Prerequisites
# macOS
brew install ffmpeg
# Ubuntu / Debian
sudo apt install ffmpeg
# Verify
ffmpeg -version
python3 --version # must be 3.10+Installation
git clone https://github.com/yourusername/yt-mcp.git
cd yt-mcp
# Create and activate a virtual environment (recommended)
python3 -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate # Windows
pip install -r requirements.txtWhisper model weights download automatically on the first transcription call (~75 MB for base, ~1.5 GB for large).
MCP integration
MCP clients spawn the server as a subprocess — they do not activate your shell or venv automatically. You must point them at the venv's Python interpreter directly using its absolute path.
Find your interpreter path after activating the venv:
source .venv/bin/activate
which python # e.g. /Users/you/repos/yt-mcp/.venv/bin/pythonClaude Code:
claude mcp add -s user yt-mcp -- /path/to/yt-mcp/.venv/bin/python /path/to/yt-mcp/server/main.pyClaude Desktop — add to ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"yt-mcp": {
"command": "/path/to/yt-mcp/.venv/bin/python",
"args": ["/path/to/yt-mcp/server/main.py"]
}
}
}Replace
/path/to/yt-mcpwith the absolute path to wherever you cloned the repo. On Windows the interpreter is at.venv\Scripts\python.exe.
Tools
get_video_transcript
Transcribe a YouTube video using OpenAI Whisper (runs entirely locally).
Parameter | Type | Default | Description |
| string | — | Full YouTube URL |
| string |
|
|
Response:
{
"title": "Video Title",
"duration": 847,
"language": "en",
"full_text": "Welcome to this video...",
"segments": [
{
"t_start": 0.0,
"t_end": 4.5,
"text": "Welcome to this video.",
"words": [{ "word": "Welcome", "start": 0.0, "end": 0.6 }]
}
]
}get_video_frames
Extract keyframes as base64-encoded JPEGs. Uses PySceneDetect for scene detection and FFmpeg for extraction.
Parameter | Type | Default | Description |
| string | — | Full YouTube URL |
| string |
|
|
| integer |
| Seconds between frames (for |
Response:
{
"title": "Video Title",
"duration": 847,
"duration_formatted": "14:07",
"frame_count": 12,
"strategy": "scene",
"frames": [
{
"t": 0.0,
"t_formatted": "0:00",
"keyframe": "<base64 JPEG>",
"scene_change": false,
"animation_detected": false
}
],
"summary": [ /* same list without keyframe bytes — for quick review */ ]
}get_audio_features
Analyze audio characteristics using librosa (runs locally).
Parameter | Type | Default | Description |
| string | — | Full YouTube URL |
| integer |
| Analysis window size in seconds |
Response:
{
"title": "Video Title",
"duration": 847,
"segment_duration": 30,
"segments": [
{
"t_start": 0.0,
"t_end": 30.0,
"energy": "medium",
"music": false,
"tempo_bpm": 95.0,
"rms_db": -22.1
}
]
}get_full_context
Primary tool. Returns a complete, synchronized multi-modal timeline — transcript + scene boundaries + animation detection + audio features, all time-aligned.
Parameter | Type | Default | Description |
| string | — | Full YouTube URL |
| boolean |
| Embed base64 keyframes per segment |
| string |
| Whisper model size |
Response:
{
"title": "How Transformers Work",
"channel": "AI Explained",
"duration": 847,
"duration_formatted": "14:07",
"language": "en",
"description": "In this video...",
"segments": [
{
"t_start": 0.0,
"t_end": 12.0,
"transcript": "Welcome to this video on transformers...",
"keyframe": null,
"scene_change": false,
"animation_detected": false,
"audio": {
"energy": "low",
"speech_rate": "normal",
"music": true,
"tempo_bpm": 0.0,
"rms_db": -28.4
}
}
]
}Context window tip: Call
get_full_contextwithinclude_frames=falsefirst to understand the video structure, then callget_video_framesfor specific timestamps of interest.
Supported URL Formats
https://www.youtube.com/watch?v=VIDEO_ID
https://youtu.be/VIDEO_ID
https://youtube.com/shorts/VIDEO_IDEnvironment Variables
Variable | Default | Description |
|
| Cache directory for downloaded videos and audio |
Development
# Activate the venv first
source .venv/bin/activate
# Run the server directly (stdio mode — same as MCP clients use)
python server/main.py
# Quick smoke test
python -c "
from server.utils.downloader import VideoDownloader
from server.tools.transcript import get_transcript
d = VideoDownloader()
vp, ap, info = d.download('https://www.youtube.com/watch?v=jNQXAC9IVRw')
print(get_transcript(ap)['language'])
"Testing
The Python server has a full unit test suite — 164 tests across 6 modules. All tests run without any network access or model downloads; every external dependency (Whisper, librosa, FFmpeg, PySceneDetect, OpenCV, yt-dlp) is mocked.
Install test dependencies
pip install -r requirements-dev.txtRun the full suite
python -m pytestExpected output: 164 passed in ~4s
Run tests for a specific module
python -m pytest tests/test_downloader.py # VideoDownloader + VideoInfo
python -m pytest tests/test_transcript.py # Whisper wrapper + range helpers
python -m pytest tests/test_frames.py # FFmpeg, PySceneDetect, OpenCV
python -m pytest tests/test_audio.py # librosa AudioAnalyzer
python -m pytest tests/test_timeline.py # build_timeline + speech rate
python -m pytest tests/test_main.py # all 4 MCP tool handlersRun a single test by name
python -m pytest tests/test_timeline.py::TestBuildTimeline::test_rapid_cuts_below_min_merged -vLive smoke test against a real video
The example below uses プリマドンナ / 星街すいせい (Hoshimachi Suisei · Suisei Channel, 2:52) — a Japanese music video that exercises every layer of the pipeline: multilingual Whisper transcription, music detection via librosa HPSS, rapid scene cuts via PySceneDetect, and animation detection via OpenCV pixel-diff.
from server.utils.downloader import VideoDownloader
from server.tools.transcript import get_transcript
from server.tools.audio import AudioAnalyzer
from server.tools.frames import detect_scene_timestamps
URL = "https://www.youtube.com/watch?v=M1GYqy0tHV0"
d = VideoDownloader()
video_path, audio_path, info = d.download(URL)
print(f"Title: {info.title}") # プリマドンナ / 星街すいせい(official)
print(f"Duration: {info.duration:.0f}s") # 172
transcript = get_transcript(audio_path, model_size="base")
print(f"Language: {transcript['language']}") # ja
cuts = detect_scene_timestamps(video_path)
print(f"Scene cuts detected: {len(cuts)}") # typically 30–60 for a music video
analyzer = AudioAnalyzer(audio_path)
seg = analyzer.analyze_segment(0, 30)
print(f"First 30s — energy: {seg['energy']}, music: {seg['music']}")
# energy: 'medium' or 'high', music: TrueFor the full test guide — fixtures, mock patterns, writing tests for new tools — see docs/testing.md.
Architecture
For a detailed explanation of system design, data flows, and how to add new tools:
docs/architecture.md — pipeline diagrams and key design decisions
docs/python-server.md — component reference for all modules
docs/extending.md — how to add new tools
docs/testing.md — test suite structure, fixtures, and writing new tests
TypeScript Server (archived)
The src/ directory contains an experimental TypeScript server that delegates video analysis to the Gemini API. It is not under active development and is kept only for reference.
If you're looking for fast cloud-based video Q&A, the TypeScript server's approach (passing the YouTube URL directly to Gemini) works well for a quick prototype — but the Python server is the only implementation that will receive ongoing maintenance.
See docs/typescript-server.md for its API reference.
License
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/PakmanGames/yt-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server