What can you do with this server?

This server provides AI-powered, local multi-modal analysis of YouTube videos — no API keys required. It supports standard YouTube URL formats (youtube.com/watch?v=ID, youtu.be/ID, youtube.com/shorts/ID). * Summarize videos (summarize_video): Generate text summaries at three detail levels — brief (2-3 sentences), medium (key points with timestamps), or detailed (comprehensive breakdown). * Ask questions about videos (ask_about_video): Pose specific questions and get AI-generated answers based on the video's content. * Extract screenshots automatically (extract_screenshots): AI identifies visually significant moments and extracts frames as base64 images, with configurable count (1-20), focus area, resolution (thumbnail to full), and optional disk saving. * Preview important timestamps (get_video_timestamps): Identify key moments and retrieve their timestamps without extracting frames — useful for planning before committing to extraction. * Extract frames at specific timestamps (extract_frames): Supply exact timestamps to retrieve corresponding frames, with configurable resolution and optional disk saving. Underlying capabilities include local Whisper-based transcription, scene-change/interval keyframe extraction, and audio analysis (energy, tempo, RMS, music vs. speech detection via librosa).

Which integrations are available for this server?

Uses Google's Gemini API to analyze YouTube videos, providing summarization and Q&A capabilities. Analyzes YouTube videos by summarizing content and answering questions about video content using Google's Gemini API.

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@yt-mcp summarize https://www.youtube.com/watch?v=dQw4w9WgXcQ" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

de en es ja ko ru zh

yt-mcp

by PakmanGames

Overview Schema Related Servers Score Discussions

Python

Remote

yt-mcp

A fully local MCP (Model Context Protocol) server that gives AI assistants deep, multi-modal awareness of YouTube videos. No API keys required. All processing runs on-device via yt-dlp, OpenAI Whisper, FFmpeg, PySceneDetect, and librosa.

Note: This repository also contains an experimental TypeScript server (src/) that uses the Gemini API. That server is not under active development — the Python local server (server/) is the primary implementation.

Related MCP server: YT-NINJA

System overview

yt-mcp runs as a local subprocess that an AI assistant spawns and talks to over stdio using JSON-RPC 2.0. The assistant calls tools; the server downloads the video once, extracts multi-modal signals on-device, caches everything to disk, and returns structured JSON. No data leaves the machine except the one-time download from YouTube.

flowchart LR
    subgraph client["AI Assistant"]
        A["Claude Code / Desktop"]
    end

    subgraph server["yt-mcp · local subprocess"]
        M["FastMCP server<br/>get_video_transcript · get_video_frames<br/>get_audio_features · get_full_context"]
        P["Local pipeline<br/>yt-dlp · Whisper · FFmpeg<br/>PySceneDetect · OpenCV · librosa"]
        C[("Disk cache<br/>/tmp/yt-analysis-cache/<video_id>")]
        M --> P
        P <--> C
    end

    YT[("YouTube")]

    A -- "JSON-RPC 2.0 (stdio)" --> M
    M -- "JSON result" --> A
    P -- "download once" --> YT

    classDef store fill:#fff3cd,stroke:#d39e00,color:#332701;
    class C,YT store;

How it works

A single download feeds three parallel analysis tracks, which timeline.py then re-aligns into one time-indexed JSON document.

flowchart TD
    URL([YouTube URL]) --> DL["<b>yt-dlp</b><br/>download video.mp4<br/>extract audio.wav · 16 kHz mono"]

    DL -->|audio.wav| W["<b>Whisper</b><br/>word-level transcript"]
    DL -->|video.mp4| SD["<b>PySceneDetect</b><br/>scene-cut timestamps"]
    DL -->|audio.wav| LR["<b>librosa</b><br/>energy · tempo · music vs speech"]

    SD --> FF["<b>FFmpeg</b><br/>keyframe JPEG at each cut"]
    SD --> CV["<b>OpenCV</b><br/>pixel-diff animation detection"]

    W --> TL["<b>timeline.py</b><br/>unified, time-aligned segments"]
    FF --> TL
    CV --> TL
    LR --> TL

    TL --> OUT([Structured JSON → MCP client])

    classDef io fill:#d1e7dd,stroke:#0f5132,color:#03190f;
    class URL,OUT io;

All results are cached in /tmp/yt-analysis-cache/<video_id>/. Re-calling the same URL is instant — only the first call pays the download + transcription cost.

Prerequisites

# macOS
brew install ffmpeg

# Ubuntu / Debian
sudo apt install ffmpeg

# Verify
ffmpeg -version
python3 --version   # must be 3.10+

Installation

Dependencies are managed with uv. Install it first if you don't have it (brew install uv, or see the install guide).

git clone https://github.com/yourusername/yt-mcp.git
cd yt-mcp

# Create the virtual environment (.venv) and install all dependencies from uv.lock
uv sync

uv sync creates a .venv/ in the project directory and installs the exact, locked versions of every dependency — including the dev tools (pytest). Add --no-dev to install runtime dependencies only.

Whisper model weights download automatically on the first transcription call (~142 MB for base, ~2.9 GB for large).

MCP integration

MCP clients spawn the server as a subprocess — they do not activate your shell or venv automatically. You must point them at the venv's Python interpreter directly using its absolute path.

uv sync puts the interpreter at .venv/bin/python. Get its absolute path:

realpath .venv/bin/python   # e.g. /Users/you/repos/yt-mcp/.venv/bin/python

Claude Code:

claude mcp add -s user yt-mcp -- /path/to/yt-mcp/.venv/bin/python /path/to/yt-mcp/server/main.py

Claude Desktop — add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "yt-mcp": {
      "command": "/path/to/yt-mcp/.venv/bin/python",
      "args": ["/path/to/yt-mcp/server/main.py"]
    }
  }
}

Replace /path/to/yt-mcp with the absolute path to wherever you cloned the repo. On Windows the interpreter is at .venv\Scripts\python.exe.

Docker

Prefer not to install FFmpeg, Python, and the ML stack on the host? Build the image and let the MCP client spawn it. The server speaks MCP over stdio, so the container must be run with -i (interactive stdin):

docker build -t yt-mcp .
docker run -i --rm -v yt-mcp-cache:/data/cache yt-mcp

Wire it into a client the same way as the local install, but with docker as the command:

claude mcp add -s user yt-mcp -- docker run -i --rm -v yt-mcp-cache:/data/cache yt-mcp

The -v yt-mcp-cache:/data/cache volume persists downloaded videos and Whisper model weights across runs. The image installs the CPU-only build of PyTorch on purpose — see docs/deployment.md for the build details, the cache layout, and the rationale (plus how to build a GPU variant).

Tools

The server exposes four tools. get_full_context is the primary one — it combines every signal into a single timeline. Reach for the others when you need just one modality or want to control token usage.

flowchart TD
    Q{"What do you need?"}
    Q -->|"Complete situational awareness"| FC["<b>get_full_context</b><br/>transcript + scenes + audio,<br/>time-aligned · start here"]
    Q -->|"Exact words + timestamps"| TR["<b>get_video_transcript</b><br/>Whisper, word-level"]
    Q -->|"Visual keyframes"| FR["<b>get_video_frames</b><br/>JPEGs at scene cuts / intervals"]
    Q -->|"Energy · tempo · music"| AU["<b>get_audio_features</b><br/>librosa, per window"]

    classDef primary fill:#cfe2ff,stroke:#084298,color:#031633;
    class FC primary;

Tool	Modality	Returns base64 images?	Safe for long videos?
`get_full_context`	All (transcript + scene + audio)	Only if `include_frames=true`	Yes (default `include_frames=false`)
`get_video_transcript`	Speech → text	No	Yes
`get_video_frames`	Visual	Yes (always)	Use on short clips / specific ranges
`get_audio_features`	Audio	No	Yes

`get_video_transcript`

Transcribe a YouTube video using OpenAI Whisper (runs entirely locally).

Parameter	Type	Default	Description
`youtube_url`	string	—	Full YouTube URL
`model_size`	string	`base`	`tiny` · `base` · `small` · `medium` · `large`

Response:

{
  "title": "Video Title",
  "duration": 847,
  "language": "en",
  "full_text": "Welcome to this video...",
  "segments": [
    {
      "t_start": 0.0,
      "t_end": 4.5,
      "text": "Welcome to this video.",
      "words": [{ "word": "Welcome", "start": 0.0, "end": 0.6 }]
    }
  ]
}

`get_video_frames`

Extract keyframes as base64-encoded JPEGs. Uses PySceneDetect for scene detection and FFmpeg for extraction.

Parameter	Type	Default	Description
`youtube_url`	string	—	Full YouTube URL
`strategy`	string	`scene`	`scene` · `interval` · `both`
`interval`	integer	`30`	Seconds between frames (for `interval` or `both` strategies)

Response:

{
  "title": "Video Title",
  "duration": 847,
  "duration_formatted": "14:07",
  "frame_count": 12,
  "strategy": "scene",
  "frames": [
    {
      "t": 0.0,
      "t_formatted": "0:00",
      "keyframe": "<base64 JPEG>",
      "scene_change": false,
      "animation_detected": false
    }
  ],
  "summary": [ /* same list without keyframe bytes — for quick review */ ]
}

`get_audio_features`

Analyze audio characteristics using librosa (runs locally).

Parameter	Type	Default	Description
`youtube_url`	string	—	Full YouTube URL
`segment_duration`	integer	`30`	Analysis window size in seconds

Response:

{
  "title": "Video Title",
  "duration": 847,
  "segment_duration": 30,
  "segments": [
    {
      "t_start": 0.0,
      "t_end": 30.0,
      "energy": "medium",
      "music": false,
      "tempo_bpm": 95.0,
      "rms_db": -22.1
    }
  ]
}

`get_full_context`

Primary tool. Returns a complete, synchronized multi-modal timeline — transcript + scene boundaries + animation detection + audio features, all time-aligned.

Parameter	Type	Default	Description
`youtube_url`	string	—	Full YouTube URL
`include_frames`	boolean	`false`	Embed base64 keyframes per segment
`model_size`	string	`base`	Whisper model size

Response:

{
  "title": "How Transformers Work",
  "channel": "AI Explained",
  "duration": 847,
  "duration_formatted": "14:07",
  "language": "en",
  "description": "In this video...",
  "segments": [
    {
      "t_start": 0.0,
      "t_end": 12.0,
      "transcript": "Welcome to this video on transformers...",
      "keyframe": null,
      "scene_change": false,
      "animation_detected": false,
      "audio": {
        "energy": "low",
        "speech_rate": "normal",
        "music": true,
        "tempo_bpm": 0.0,
        "rms_db": -28.4
      }
    }
  ]
}

Context window tip: Call get_full_context with include_frames=false first to understand the video structure, then call get_video_frames for specific timestamps of interest.

Supported URL Formats

https://www.youtube.com/watch?v=VIDEO_ID
https://youtu.be/VIDEO_ID
https://youtube.com/shorts/VIDEO_ID

Environment Variables

Variable	Default	Description
`YT_CACHE_DIR`	`/tmp/yt-analysis-cache`	Cache directory for downloaded videos and audio

Development

# Run the server directly (stdio mode — same as MCP clients use)
# `uv run` executes inside the project venv without needing to activate it
uv run python server/main.py

# Quick smoke test
uv run python -c "
from server.utils.downloader import VideoDownloader
from server.tools.transcript import get_transcript
d = VideoDownloader()
vp, ap, info = d.download('https://www.youtube.com/watch?v=jNQXAC9IVRw')
print(get_transcript(ap)['language'])
"

Testing

The Python server has a full unit test suite — 164 tests across 6 modules. All tests run without any network access or model downloads; every external dependency (Whisper, librosa, FFmpeg, PySceneDetect, OpenCV, yt-dlp) is mocked.

Install test dependencies

The dev dependencies (pytest, pytest-mock) are installed by uv sync — no separate step needed.

Run the full suite

uv run pytest

Expected output: 164 passed in ~4s

Run tests for a specific module

uv run pytest tests/test_downloader.py   # VideoDownloader + VideoInfo
uv run pytest tests/test_transcript.py   # Whisper wrapper + range helpers
uv run pytest tests/test_frames.py       # FFmpeg, PySceneDetect, OpenCV
uv run pytest tests/test_audio.py        # librosa AudioAnalyzer
uv run pytest tests/test_timeline.py     # build_timeline + speech rate
uv run pytest tests/test_main.py         # all 4 MCP tool handlers

Run a single test by name

uv run pytest tests/test_timeline.py::TestBuildTimeline::test_rapid_cuts_below_min_merged -v

Live smoke test against a real video

The example below uses プリマドンナ / 星街すいせい (Hoshimachi Suisei · Suisei Channel, 2:52) — a Japanese music video that exercises every layer of the pipeline: multilingual Whisper transcription, music detection via librosa HPSS, rapid scene cuts via PySceneDetect, and animation detection via OpenCV pixel-diff.

from server.utils.downloader import VideoDownloader
from server.tools.transcript import get_transcript
from server.tools.audio import AudioAnalyzer
from server.tools.frames import detect_scene_timestamps

URL = "https://www.youtube.com/watch?v=M1GYqy0tHV0"

d = VideoDownloader()
video_path, audio_path, info = d.download(URL)

print(f"Title:    {info.title}")        # プリマドンナ / 星街すいせい(official)
print(f"Duration: {info.duration:.0f}s")  # 172

transcript = get_transcript(audio_path, model_size="base")
print(f"Language: {transcript['language']}")  # ja

cuts = detect_scene_timestamps(video_path)
print(f"Scene cuts detected: {len(cuts)}")    # typically 30–60 for a music video

analyzer = AudioAnalyzer(audio_path)
seg = analyzer.analyze_segment(0, 30)
print(f"First 30s — energy: {seg['energy']}, music: {seg['music']}")
# energy: 'medium' or 'high', music: True

For the full test guide — fixtures, mock patterns, writing tests for new tools — see docs/testing.md.

Documentation

Document	What it covers
SPEC.md	Formal specification — tool contracts, data schemas, algorithms, thresholds, and the error model. The authoritative reference.
docs/architecture.md	System design, data-flow and UML diagrams, and key design decisions
docs/python-server.md	Component reference for every module
docs/extending.md	How to add new tools
docs/testing.md	Test suite structure, fixtures, and writing new tests
docs/deployment.md	Docker build & run, MCP client config, cache volume, and the CPU-torch design decision
TODO.md	Running roadmap of planned improvements and ideas

TypeScript Server (archived)

The src/ directory contains an experimental TypeScript server that delegates video analysis to the Gemini API. It is not under active development and is kept only for reference.

If you're looking for fast cloud-based video Q&A, the TypeScript server's approach (passing the YouTube URL directly to Gemini) works well for a quick prototype — but the Python server is the only implementation that will receive ongoing maintenance.

See docs/typescript-server.md for its API reference.

License

MIT

Install Server

license - permissive license

quality

maintenance

How are these scores calculated?

Maintenance

–Maintainers

41dResponse time

–Release cycle

–Releases (12mo)

Commit activity

Resources

Need Help?

Related Servers

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Tools

Latest Blog Posts

Who's Calling? MCP Hosts Are an Identity Blind Spot (And the Spec Knows It)
By Om-Shree-0709 on July 25, 2026.
mcp
Agent Identity
OAuth 2.1
Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/PakmanGames/yt-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

yt-mcp

Table of Contents

System overview

How it works

Prerequisites

Installation

MCP integration

Docker

Tools

get_video_transcript

get_video_frames

get_audio_features

get_full_context

Supported URL Formats

Environment Variables

Development

Testing

Install test dependencies

Run the full suite

Run tests for a specific module

Run a single test by name

Live smoke test against a real video

Documentation

TypeScript Server (archived)

License

Maintenance

Resources

Looking for Admin?

Tools

Latest Blog Posts

MCP directory API

`get_video_transcript`

`get_video_frames`

`get_audio_features`

`get_full_context`