media-mcp
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@media-mcptranscribe this YouTube video: https://youtu.be/dQw4w9WgXcQ"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
media-mcp
Social media at your fingertips. 31 tools across Twitter/X, YouTube, Instagram, and video processing — from Claude Desktop, Claude Code, or any MCP client. 100% open source.
Point it at a tweet and get the full text, metrics, and video transcription. Give it a YouTube URL and get the transcript. Drop an Instagram reel and get the media downloaded plus audio transcribed. All transcription runs locally via Whisper — no audio leaves your machine.
The thesis: ears always, eyes only when the ears fail
Small Whisper models are great at hearing but terrible at reading. They mishear unusual names. They can't transcribe text on screen. They skip burned-in captions. For 90% of questions about a video this doesn't matter — the gist is enough.
But when a user asks "what's the install command in this reel?" or "what's the handle he showed?", transcription alone will confidently give the wrong answer. The URL was on screen. The proper noun was spelled out in the caption. Whisper never saw any of it.
media-mcp transcribes with per-token confidence via whisper-cli -ojf and flags uncertainty zones (where Whisper admits it was guessing) and demonstrative phrases ("visit our", "this command", "in the bio" — strong signals that on-screen content is being referenced). The LLM reads those markers and decides whether to call get_video_frames_at on the specific timestamps that need visual verification. Frames only come out when they need to. The LLM's own vision does the reading — no OCR, no second model.
Result: the agent has ears on every video, eyes only where ears fail. Minimum frames, maximum accuracy.
What it does
Fetches tweets, threads, profiles, followers, trends, and search results from Twitter/X (26 tools via TwitterAPI.io REST API)
Transcribes video audio locally using whisper-cli — downloads media, extracts audio with ffmpeg, runs Whisper on your hardware, emits per-token confidence and demonstrative-phrase hits so the LLM knows where the audio channel is unreliable
Downloads Instagram posts, reels, and carousels to local folders via a self-hosted Cobalt instance
Extracts frames from any video URL at configurable FPS — or precisely at an array of timestamps via
get_video_frames_at(cache-aware, no re-download on follow-ups)Monitors Twitter users in real-time and filters tweets by keyword rules
Caches downloaded videos in
~/.media-mcp/cache/videos/(sha256-of-URL keyed, 24h TTL) so transcription + frame-lookup on the same video happens in one download
How it works
The LLM never scrapes HTML or parses DOM. Every tool calls a purpose-built API and returns structured, LLM-ready text.
For text data (tweets, profiles, trends): one REST call to TwitterAPI.io, parsed into formatted output.
For transcription (tweet videos, YouTube, Instagram reels): the pipeline downloads media to the shared cache, extracts audio with ffmpeg (16kHz mono WAV), transcribes with whisper-cli using -ojf (output-json-full) to preserve per-token probabilities, then returns a LLM-readable transcript with inline ⟨token p=0.XX⟩ markers plus summary blocks for uncertainty zones and demonstrative phrases. For YouTube, captions are tried first (instant) — Whisper is only the fallback.
For visual data (Instagram images, video frames): media is downloaded to a local folder and absolute file paths are returned so the LLM can read them directly with vision. Frame extraction has two modes: bulk (extract_video_frames at configurable FPS) and precision (get_video_frames_at — one JPG per timestamp, for targeted verification of transcription-uncertain moments).
Pipeline
URL ──► Detect platform
│
├── Twitter ──► TwitterAPI.io REST ──► structured text
│ │
│ has video? ──► cache ──► ffmpeg ──► whisper-cli -ojf
│ │
│ transcript + confidence markers
│
├── YouTube ──► try captions (instant)
│ │
│ no captions? ──► yt-dlp ──► ffmpeg ──► whisper-cli -ojf
│
├── Instagram ──► Cobalt API ──► download to cache
│ │
│ has video? ──► ffmpeg ──► whisper-cli -ojf
│
├── Video URL ──► cache ──► ffmpeg -vf fps=N ──► frame JPGs
│
└── Video URL + timestamps[] ──► cache ──► ffmpeg -ss each ──► one JPG per timestamp
(for targeted verification when transcription uncertainty demands it)Transcription always includes per-token confidence and demonstrative-phrase scans. The LLM routes to frame extraction when those signals say it's needed.
All transcription is local. All temp files are cleaned up. Downloaded videos live in a shared cache (~/.media-mcp/cache/videos/) for 24h so follow-up calls on the same URL don't re-download. The LLM gets structured text or file paths — never raw API JSON.
Design principles
Structured data, not scraping. Every tool calls a purpose-built API. No HTML parsing, no fragile selectors, no browser automation.
Local transcription only. Audio never leaves the machine. Whisper runs on local hardware.
Captions first, Whisper second. Don't burn compute when the platform already did the work.
One tool, one job. No multi-purpose tools with mode flags. Each tool does exactly one thing.
File paths for visual content. Return absolute paths so the LLM can see images directly.
Ears always, eyes only when ears fail. Transcription is cheap; vision tokens are expensive. The LLM sees frames only at timestamps where Whisper admits it was unsure, or where the speaker is explicitly referencing something on screen. Not at 1 fps. Not as keyframes. Exactly where accuracy actually needs it.
No OCR layer. Claude's vision reads the frames directly. One model doing all multimodal reasoning beats a two-model seam where OCR and vision compete.
See SKILL.md for the full pipeline details, tool reference, and anti-patterns.
Get started
git clone https://github.com/woosal1337/media-mcp.git
cd media-mcp
npm install && npm run buildDownload the Whisper model:
mkdir -p models
curl -L -o models/ggml-base.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.binCreate .env:
cp .env.example .env
# Edit with your keys:
# TWITTER_API_KEY=your_twitterapi_io_key
# WHISPER_MODEL_PATH=/absolute/path/to/models/ggml-base.bin
# COBALT_API_URL=http://localhost:9000 (optional, for Instagram)
# COBALT_API_KEY=your_cobalt_key (optional)
# CLOUDFLARE_ACCOUNT_ID=your_account_id (optional, for fetch_markdown)
# CLOUDFLARE_API_TOKEN=your_api_token (optional, for fetch_markdown)Prerequisites
Dependency | Required | What it does | Install |
Node.js 20+ | Yes | Runs the MCP server |
|
Yes | Audio extraction + frame extraction |
| |
Yes | Local audio transcription |
| |
Yes | Video downloads from YouTube + others |
| |
TwitterAPI.io key | Yes | Powers all Twitter/X tools | |
Cobalt instance | Optional | Instagram downloads | See Cobalt setup |
Configuration
Claude Code
Add to ~/.claude/settings.json:
{
"mcpServers": {
"media-mcp": {
"command": "node",
"args": ["/absolute/path/to/media-mcp/dist/index.js"],
"env": {
"TWITTER_API_KEY": "your_key",
"WHISPER_MODEL_PATH": "/absolute/path/to/media-mcp/models/ggml-base.bin",
"COBALT_API_URL": "http://localhost:9000",
"COBALT_API_KEY": "your_cobalt_key",
"CLOUDFLARE_ACCOUNT_ID": "your_account_id",
"CLOUDFLARE_API_TOKEN": "your_api_token"
}
}
}
}Claude Desktop
Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows) — same structure as above.
Environment variables
Variable | Required | Description |
| Yes | API key from twitterapi.io |
| No | Path to Whisper model (defaults to |
| No | URL of your Cobalt instance (required for Instagram) |
| No | Cobalt API key if auth is enabled |
| No | Cloudflare account ID (required for |
| No | Cloudflare API token with Browser Rendering permission (required for |
Tools
Twitter/X — 26 tools
Fetching tweets
Tool | Action | What it does |
| Fetch + Transcribe | Fetches tweet by URL with text, author, metrics, media, threads, articles. Transcribes video audio via Whisper. |
| Fetch | Recent tweets from a user (paginated, 20/page) |
| Search | Advanced search with operators ( |
| Fetch | Replies to a tweet (paginated, 20/page) |
| Fetch + Sort | Replies with sorting: Relevance, Latest, or Likes |
| Fetch | Quote tweets of a tweet (paginated, 20/page) |
| Fetch | Users who retweeted a tweet (paginated, 100/page) |
| Fetch | Tweets from a Twitter list |
| Fetch | Tweets from a Twitter community |
| Fetch | Trending topics (worldwide or by WOEID location) |
Fetching profiles
Tool | Action | What it does |
| Fetch | User bio, follower counts, verification, location, website |
| Fetch | Extended profile info beyond the basic profile |
| Fetch | Followers of a user (paginated, 200/page) |
| Fetch | Accounts a user follows (paginated, 200/page) |
| Fetch | Tweets mentioning a user (paginated, 20/page) |
| Fetch | Verified (blue check) followers (paginated, 20/page) |
| Search | Search users by keyword |
| Check | Whether user A follows user B and vice versa |
| Fetch | Twitter Space metadata (title, host, speakers, state) |
Real-time monitoring
Tool | Action | What it does |
| Start | Begin real-time monitoring of a user's tweets |
| List | All currently monitored users |
| Stop | Stop monitoring a user |
| Create | Add a keyword filter rule for monitoring |
| List | All active filter rules |
| Delete | Remove a filter rule |
YouTube — 1 tool
Tool | Action | What it does |
| Fetch + Transcribe | Gets video transcript. Tries captions first (instant). Falls back to yt-dlp + ffmpeg + Whisper if no captions. |
Instagram — 1 tool
Tool | Action | What it does |
| Download + Transcribe | Downloads all media (images, videos, carousels) to local folder via Cobalt. Transcribes video audio with Whisper. Returns local file paths. |
Cloudflare — 1 tool
Tool | Action | What it does |
| Extract | Extracts clean markdown from any webpage using Cloudflare Browser Run. Works on JS-heavy pages, SPAs, and sites where simple fetch fails. |
Video — 2 tools
Tool | Action | What it does |
| Download + Extract | Downloads video from any URL, extracts frames at configurable FPS via ffmpeg. Supports time ranges. Returns local frame paths. Cache-aware. |
| Precision Extract | Grabs one JPG per specified timestamp. Pairs with the transcription tools — when the transcript flags uncertainty zones or demonstrative phrases, pass their |
How transcription works
video → cache → ffmpeg -ar 16000 -ac 1 → audio.wav → whisper-cli -ojf → audio.wav.json
│
▼
parse per-token probabilities
│
▼
transcript with ⟨token p=0.XX⟩ markers
+ Uncertainty zones summary (midpoint_s each)
+ Demonstrative phrases block (midpoint_s each)Video is downloaded to
~/.media-mcp/cache/videos/<sha256>.mp4(reused if present, <24h old)ffmpeg extracts audio as 16kHz mono WAV
whisper-cli transcribes locally with
-ojf(output-json-full) — JSON includes per-tokenpvaluesTokens below p=0.5 are merged into contiguous spans (≤150ms gap) and reported as uncertainty zones
The segment text is scanned for demonstrative phrases that typically reference on-screen content
The LLM receives segment-level transcript + uncertainty zones + demonstrative hits, and decides whether to call
get_video_frames_atwith the relevant timestamps
For YouTube, captions are tried first (instant, already timestamped). Whisper is the fallback. All transcription happens locally — no audio is sent to external services.
Cobalt setup
Cobalt is an open-source media downloader supporting 21 platforms. media-mcp uses it for Instagram. You need your own instance — the public API requires JWT auth that doesn't work server-to-server.
Docker (recommended)
# docker-compose.yml
services:
cobalt:
image: ghcr.io/imputnet/cobalt:11
init: true
read_only: true
restart: unless-stopped
ports:
- 9000:9000/tcp
environment:
API_URL: "http://localhost:9000/"
labels:
- com.centurylinklabs.watchtower.scope=cobalt
watchtower:
image: ghcr.io/containrrr/watchtower
restart: unless-stopped
command: --cleanup --scope cobalt --interval 900 --include-restarting
volumes:
- /var/run/docker.sock:/var/run/docker.sockdocker compose up -d
curl http://localhost:9000/ # verifyAdding API key auth
node -e "console.log(crypto.randomUUID())" # generate keyCreate keys.json:
{
"your-uuid": {
"name": "media-mcp",
"limit": "unlimited",
"allowedServices": "all"
}
}Add to cobalt environment:
environment:
API_KEY_URL: "file:///keys.json"
API_AUTH_REQUIRED: 1
volumes:
- ./keys.json:/keys.json:roAdding cookies (for private content)
Create cookies.json with your Instagram sessionid, mount as /cookies.json, and set COOKIE_PATH: "/cookies.json" in environment.
Production hardening
environment:
CORS_WILDCARD: 0
CORS_URL: "http://localhost"
RATELIMIT_WINDOW: 60
RATELIMIT_MAX: 100
DURATION_LIMIT: 10800Supported platforms
Cobalt supports 21 platforms. Currently media-mcp uses it for Instagram. Future versions will add more: YouTube, TikTok, Twitter/X, Reddit, Facebook, Pinterest, Snapchat, Bluesky, Twitch, Vimeo, SoundCloud, Dailymotion, Tumblr, Bilibili, Loom, Streamable, Rutube, Newgrounds, OK.ru, VK.
One-command setup
Copy the contents of PROMPT.md and paste it into Claude Code. It will install all prerequisites, clone the repo, configure everything, and connect media-mcp automatically.
Development
npm run dev # watch mode (recompiles on change)
npm run build # one-time build
npm start # run the serverLicense
MIT
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/woosal1337/media-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server