media-context-mcp
Your assistant can read text and look at a picture, but it can't watch a video or listen to audio. media-context-mcp fills that gap. Point it at a file or a URL and it hands back clean, model-ready context — sampled frames, a transcript, or the text on screen — without sending anything to the cloud.
Features
Any source — video, audio, or images; a local file or a URL (YouTube, Vimeo, direct links, and 1000+ more).
See video — a quick montage overview, full-resolution stills, scene-change shots, or a dense filmstrip that catches glitches lasting a fraction of a second.
Hear audio — turn speech in a clip, voice note, or podcast into text.
Read screens — pull the exact text off a UI, an error dialog, or a screenshot.
Cheap by design — frames are tiled and downscaled, so a long clip costs a couple of images instead of hundreds.
Private & local — everything runs on your machine. No API keys, no uploads.
Works everywhere — any MCP client: Claude, Cursor, VS Code, and more.
Related MCP server: local_ai_gen
Use cases
Give an LLM video context — turn a clip into frames and text your model can reason over.
Analyze a screen recording — read the on-screen error, walk a UI flow, or debug a bug video from QA.
Summarize a YouTube video — paste a link, get the gist plus a transcript.
Transcribe audio — meetings, standups, voice notes, podcasts → text, locally.
Extract text from a screenshot — pull an exact error, stack trace, or table out of an image.
Extract frames from a video — sampled stills for the model to read.
Catch UI glitches — frame-by-frame, including flickers shorter than a second.
Install
1. Add it to your MCP client. The launch command is always npx -y media-context-mcp.
claude mcp add media-context -- npx -y media-context-mcpSettings → Developer → Edit Config (claude_desktop_config.json). The env block is optional — only needed if the transcription / text-recognition tools aren't on your PATH:
{
"mcpServers": {
"media-context": {
"command": "npx",
"args": ["-y", "media-context-mcp"],
"env": { "WHISPER_BIN": "/path/to/whisper", "TESSERACT_BIN": "/path/to/tesseract" }
}
}
}Add to the client's MCP config (~/.cursor/mcp.json, ~/.codeium/windsurf/mcp_config.json, Cline settings, …):
{
"mcpServers": {
"media-context": { "command": "npx", "args": ["-y", "media-context-mcp"] }
}
}Create .vscode/mcp.json — VS Code uses the servers key:
{
"servers": {
"media-context": { "command": "npx", "args": ["-y", "media-context-mcp"] }
}
}~/.codex/config.toml:
[mcp_servers.media-context]
command = "npx"
args = ["-y", "media-context-mcp"]2. Run setup — one command installs what the server needs via your OS package manager:
npx media-context-mcp setup # everything for files + URLs + text
npx media-context-mcp setup --audio # also enable transcriptioncheck_media_deps shows what's ready at any time, and npx media-context-mcp setup --uninstall removes the tools again. Prefer to install by hand?
The package ships no binaries — it drives tools on your machine. Only ffmpeg is required; the rest are optional, one feature each.
Tool | For | Install |
| required |
|
| URLs |
|
| on-screen text |
|
| transcription |
|
Examples
Just ask your assistant in plain language — it picks the right options for you.
“Summarize
demo.mp4.” — a quick overview from sampled frames.“What error does the app show at the end of
bug.mp4?” — reads the on-screen text.“Transcribe
standup.m4aand list the action items.” — speech to text.“Summarize
https://youtu.be/VIDEO_IDand include the transcript.” — fetches and transcribes.“In
slider.mp4, find the frame where the slider flickers around 0:06.” — scans a dense burst of frames to catch a sub-second glitch.
Want finer control — modes, cropping, language, sampling rate? It's all in the usage guide.
Tools
The server exposes two tools, which your assistant calls automatically.
Tool | What it does |
| Turn a video, audio, or image — file or URL — into model-readable context. Auto-detects the type: video → frames, stills, scene montages, or a dense filmstrip; audio → a transcript; image → the picture plus optional text recognition. Supports cropping, time windows, language, and sampling rate. |
| Report which optional capabilities (URL fetching, transcription, text recognition) are ready, with setup hints. |
Everything runs locally, and each call cleans up its temporary files when it returns.
FAQ
Can Claude (or any LLM) watch a video? Not directly — models take images and text, not video. This server extracts frames and audio transcripts so your assistant can analyze the video.
How do I give Claude Code, Cursor, or VS Code video context? Add the server (see Install), then ask in plain language — it works in any MCP client.
Can it convert video or audio to text? Yes — it samples frames for the model to read and transcribes speech locally.
Does it work offline, without an API key? Yes. Everything runs on your machine; nothing is uploaded and no keys are required.
Does it support YouTube and other links? Yes — any yt-dlp-supported URL.
Is it free? Yes, open source under Apache-2.0.
Development
npm install
npm run build
npm testTests cover the pipeline end-to-end; the integration ones skip themselves when the optional tools aren't installed. Issues and PRs welcome.
License
Apache-2.0 © Vishal Gupta. Free and open — use it however you like.
Maintenance
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/vishalguptax/media-context-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server