YouTube Transcript Fetcher (YTT)
Enables searching for YouTube videos and fetching high-accuracy transcripts using Whisper AI or YouTube's internal API, with support for batch processing and various output formats.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@YouTube Transcript Fetcher (YTT)get the transcript for https://www.youtube.com/watch?v=a1JTPFfshI0"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
yttranscript-mcp
Fetch, clean, semantically search, and ask questions about YouTube transcripts — captions-first (fast, light, rate‑limit resistant) with an optional local Whisper fallback, an MCP server, a CLI, and an async Python library. No YouTube Data API key required. Everything runs locally.
Built for feeding transcripts to LLMs: the clean format strips rolling auto‑caption duplication, HTML entities, markup, and timestamps so you spend the fewest tokens possible.
What makes it special
Search inside videos. Ask a natural-language question; get the exact timestamped moments with deep-link URLs (
https://youtu.be/ID?t=123). Hybrid BM25 + local embeddings, fused with Reciprocal Rank Fusion and diversified with MMR — exa-quality retrieval, fully local, zero new dependencies (it works with no model at all and gets better when you add one).Ask questions (local RAG).
ytt ask ID "question"retrieves the relevant passages and, if a local LLM is running, writes a grounded answer that cites timestamps. No LLM? You still get the cited passages.Cross-video corpus search. Index a library of videos once (
ytt index …), thenytt find "query"searches across all of them — exa for your own YouTube collection, in a single SQLite file.Beats
yt-dlpfor transcripts, audio-free. List every caption language (ytt langs), machine-translate captions into any language (--translate), and dump rich metadata + chapters (ytt info) — all from the lightweight captions path, no video download.
vs. the tools you already use
yttranscript-mcp |
| exa | |
Captions / subtitles | ✅ captions-first, multi-client anti-throttle | ✅ (downloads via page scrape) | ❌ |
List caption languages | ✅ | ✅ | ❌ |
Translate captions | ✅ | ✅ | ❌ |
Metadata + chapters (no download) | ✅ | ⚠️ | ❌ |
Semantic search inside content | ✅ timestamped passages | ❌ | ✅ (web, cloud) |
Neural/RAG question answering | ✅ local | ❌ | ✅ (cloud) |
Cross-document semantic search | ✅ local corpus index | ❌ | ✅ (cloud) |
Runs fully local / private | ✅ | ✅ | ❌ |
Why it avoids rate limits
Captions first, audio last. It pulls YouTube's own caption tracks (a few KB) instead of downloading audio. Whisper only runs when no captions exist.
No watch‑page scrape on the hot path. A hardcoded public Innertube key skips the ~1 MB HTML fetch that triggers bot detection.
Realistic headers. Every request sends a browser/app
User-Agentand language headers (the defaultpython-requestsUA is the #1 cause of 429s).Multi‑client fallback. Tries
ANDROID_VR → WEB → MWEB → watch‑pageuntil one returns captions, sidestepping client‑specific blocks.Retry/backoff. Transient 429/5xx are retried with exponential backoff + jitter, honoring
Retry-After.Escape hatches.
YTT_PROXYandYTT_COOKIES_FILEroute around IP‑level blocks and age/region gates.Caching. SQLite cache avoids re‑fetching.
Related MCP server: youtube-mcp
Install
pip install yttranscript-mcp # core: captions + CLI + library
pip install "yttranscript-mcp[mcp]" # + MCP server
pip install "yttranscript-mcp[whisper]" # + local Whisper fallback (needs ffmpeg)
pip install "yttranscript-mcp[whisper,torch]" # + GPUThe Whisper fallback needs ffmpeg: winget install ffmpeg (Windows) · brew install ffmpeg (macOS) · sudo apt install ffmpeg (Linux).
CLI
ytt transcript VIDEO_ID # clean, LLM-ready text (default)
ytt transcript "https://youtu.be/VIDEO_ID" # URLs work too
ytt transcript VIDEO_ID -f json # or text | srt | vtt
ytt transcript VIDEO_ID -o out.txt # save to file
ytt transcript VIDEO_ID --no-whisper # captions only, never download audio
ytt transcript VIDEO_ID --translate es # machine-translate captions to Spanish
ytt batch VID1 VID2 VID3 # many videos, concurrent
ytt search "python tutorial" -n 5 # YouTube keyword search
ytt search "python tutorial" --rank # NEURAL re-rank by transcript relevance
ytt search "python tutorial" --with-transcripts
ytt cache-stats [--clean]Semantic search & Q&A (fully local)
# Metadata + chapters, no audio download (beats yt-dlp --dump-json for the essentials)
ytt info VIDEO_ID
# Every available caption language + translation targets (like yt-dlp --list-subs)
ytt langs VIDEO_ID
# Search INSIDE a video — ask a question, get the exact timestamped moments
ytt ask VIDEO_ID "how does the event loop schedule coroutines?"
ytt ask VIDEO_ID "what's the main argument?" --passages-only # skip the LLM answer
# Build a local corpus index, then search across ALL indexed videos
ytt index VID1 VID2 VID3
ytt find "retrieval augmented generation tradeoffs"
ytt corpus # index statsask retrieves the most relevant passages and (if a local LLM is reachable) writes a grounded answer citing timestamps; otherwise it just returns the cited passages. find searches your whole indexed library. Both print deep-link URLs straight to the moment in each video.
Python library
import asyncio
from ytt import get_transcript, search, search_and_get_transcripts
async def main():
# clean = deduplicated, no timestamps — best for LLMs
r = await get_transcript("dQw4w9WgXcQ", output_format="clean")
print(r.source, r.language, r.content[:200])
for v in await search("python tutorial", max_results=5):
print(v.video_id, v.title)
for video, transcript in await search_and_get_transcripts("python", max_results=3):
if transcript:
print(video.title, "→", transcript.content[:80])
asyncio.run(main())Semantic search, RAG, and the cross-video corpus index are all public too:
import asyncio
from ytt import (
search_in_video, ask_video, index_videos, find_in_corpus, get_video_info, search_ranked,
)
async def main():
# Search inside one video → timestamped passages with deep-link URLs.
res = await search_in_video("VIDEO_ID", "the key tradeoffs")
for p in res.passages:
print(p.timestamp, p.score, p.url(), p.text[:80])
# Local RAG: grounded answer + cited passages (falls back to passages if no LLM).
a = await ask_video("VIDEO_ID", "what is the main conclusion?")
print(a.answer or a.note)
# Cross-video semantic search over a local corpus index.
await index_videos(["VID1", "VID2", "VID3"])
for p in await find_in_corpus("retrieval augmented generation"):
print(p.title, p.timestamp, p.url())
# Neural re-ranking of YouTube results by transcript relevance.
for video, score in await search_ranked("python asyncio", max_results=5):
print(f"{score:.3f}", video.title)
meta = await get_video_info("VIDEO_ID")
print(meta.title, meta.view_count, "views;", len(meta.chapters), "chapters")
asyncio.run(main())Embedding backends (how "fully local" stays local)
Semantic search resolves an embedder in this order, and never needs the network by default:
| Backend | Notes |
|
| Offline-safe, zero config |
| Dependency-free hashing embedder | Instant, deterministic, no install |
| Local neural model (CPU/GPU) |
|
| Local Ollama embedding model | Best quality; |
| Any OpenAI-compatible | Point at localhost (llama.cpp, LM Studio, vLLM) |
With no embedding model at all, search still works as fast cross-video BM25. Add a real embedder and it automatically upgrades to hybrid lexical + dense ranking. Nothing leaves the machine unless you configure a remote endpoint.
output_format: clean (default for the CLI/MCP), text, json, srt, vtt, summary.
Local summarization (save tokens)
Compress a transcript to a short summary with a local LLM, so the agent that consumes it ingests a summary instead of the whole transcript. Nothing leaves the machine. Opt-in; needs a local model (default Ollama).
ollama serve
ollama pull qwen3.6:27b # default — smaller: qwen3:8b, qwen3:4b, qwen3.5:2bytt transcript VIDEO_ID --summarize
ytt transcript VIDEO_ID --summarize --summary-model qwen3:8bfrom ytt import get_transcript
r = await get_transcript("VIDEO_ID", output_format="summary")
print(r.content) # bullet-point summaryLong transcripts are summarized map-reduce (chunk → reduce) to fit context.
YTT_SUMMARY_KEEP_ALIVEkeeps the model hot between calls (-1= forever).YTT_SUMMARY_AUTO_PULL=1pulls a missing model on demand.YTT_SUMMARY_PROVIDER=openaitargets any OpenAI-compatible server (llama.cpp, LM Studio, vLLM) viaYTT_SUMMARY_OPENAI_BASE.
MCP server
yttranscript-mcp # or: python -m ytt.mcp.serverClaude Desktop / Cursor config:
{
"mcpServers": {
"yt-transcript": {
"command": "yttranscript-mcp"
}
}
}Tools:
Transcripts:
get_transcript,get_transcripts_batch(default to thecleanformat),summarize_video(local-LLM summary)Metadata:
get_video_info(title/channel/views/chapters, no download),list_caption_languages(tracks + translation targets)Semantic (fully local):
search_transcript(timestamped passages inside a video),ask_video(local RAG with cited timestamps),index_videos+find_in_corpus(cross-video search)Search:
search_videos(addrank=Truefor neural transcript-relevance re-ranking)GPU:
setup_gpu,download_cuda
Configuration (environment variables)
Everything is tunable without code via YTT_* env vars:
Variable | Default | Purpose |
| – |
|
| – | Netscape |
|
| Retry attempts on 429/5xx |
|
| Client‑side token‑bucket rate limit |
|
| Per‑request timeout (s) |
|
| Cache path |
|
| Cache TTL |
|
|
|
|
| Use GPU for Whisper if available |
|
|
|
|
| Local summary model |
|
| Ollama endpoint |
|
| Keep model hot ( |
|
| Pull missing model on demand |
|
|
|
| – | Embedding model (e.g. |
|
| Retrieval window size + overlap |
|
| Corpus index path |
How it works
video id / url → cache → captions (ANDROID_VR → WEB → MWEB → watch page)
↓ none?
Whisper fallback (download audio, transcribe locally)
↓
cache → format (clean / text / json / srt / vtt)Semantic search (ask / find / search --rank) runs on top, fully local:
transcript → timestamped chunks (rolling-caption dedup, overlap)
↓
BM25 (always) + dense embeddings (if a local backend is configured)
↓
Reciprocal Rank Fusion → MMR diversity → ranked passages w/ deep links
↓ (ask only)
local LLM writes a grounded answer citing the timestampsDevelopment
uv sync --extra dev --extra mcp
uv run pytest
uv run ruff check ytt/ && uv run black --check ytt/License
MIT
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/AndrewCTF/YTT'
If you have feedback or need assistance with the MCP directory API, please join our Discord server