vidsight
Allows ingestion of YouTube video URLs, enabling the MCP server to download and process YouTube videos for AI agent queries.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@vidsightsearch for 'schedule' in the planning session video"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Status: pre-release / in active development. The core pipeline, SQLite timeline, local provider stack, CLI, and MCP server are working today. Distribution is GitHub-only for now (not yet on PyPI). See
docs/DESIGN.mdfor the full architecture and roadmap.
Why Vidsight
Vidsight is built around a durable timeline instead of a one-off transcript dump:
More than transcripts. Speech (ASR), selected frame captions, and on-screen text (OCR) are aligned on the same timeline, so an agent can retrieve what was said, shown, or written.
Local-first. The core store and query path are local. URL ingest uses
yt-dlp, and model weights may download on first use when you install the optional local stack.Lean by default. The default install is
vidsight[mcp]. Heavy ASR/OCR/VLM dependencies are opt-in with--localorvidsight[local,mcp].
Input can be a local file or a URL. URLs require the local extra because they use yt-dlp.
Process once into a durable store, then search, zoom into time ranges, or assemble cited context.
Related MCP server: YouTube Transcript MCP Server
Install
Option A — agent installer (recommended)
curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/install.sh | bashThis is safe for an agent to run. The default install is lightweight: it installs vidsight[mcp],
creates a self-contained environment under ~/.vidsight/, and registers the MCP server for detected
agent environments. Supported targets today:
Claude Code
Codex
OpenCode
Installer source: install.sh.
Manual target selection:
curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/install.sh | bash -s -- --target claude --target codexInstall the local provider stack (ASR/OCR/VLM) up front:
curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/install.sh | bash -s -- --localFrom a local clone, the same options work:
git clone https://github.com/szepix/vidsight && cd vidsight
bash install.sh --target claude --localRuntime files live under ~/.vidsight/: a dedicated venv/, model cache/, downloaded video
data/, and the vidsight.db timeline. Agent integrations are written to the relevant config
directories. For Claude Code that includes the vidsight skill and /watch slash command; for
Codex/OpenCode it includes MCP config and a matching skill.
Restart/reconnect your agent after install.
Install size & requirements
The default installer is intentionally small. Use --local only on machines that should process video
themselves.
Mode | What gets installed | Typical first install | Disk before videos/model cache | Runtime expectations |
default | Vidsight CLI + MCP server | about 1-3 minutes | hundreds of MB | light CPU/RAM; can query existing timelines |
| default + | about 5-20 minutes | about 1.1 GB on a clean macOS test before model weights | CPU works; GPU/MPS helps vision; 8 GB RAM minimum, 16 GB recommended |
First video ingest can take longer than installation because it may download ASR/VLM model weights and
transcribe the whole video. Disk usage then includes the downloaded source video, extracted audio, the
SQLite timeline, and model caches. Speech-only ingest is the cheapest path; VLM captioning is the heaviest
stage. The default local ingest favors complete context; use --scan speech for a fast transcript-only pass.
Uninstall removes all of the above, including the venv/cache/data/db folder:
curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/uninstall.sh | bashUninstaller source: uninstall.sh.
To only unregister integrations while keeping ~/.vidsight, add --keep-data.
Option B — manual (any MCP client / CLI use)
git clone https://github.com/szepix/vidsight && cd vidsight
pip install ".[local,mcp]" # or ".[mcp]" for a tiny install without the local model stackExtras:
Extra | Pulls in |
| local provider stack: |
| the MCP server ( |
|
|
(mlx Apple acceleration and cloud providers are on the roadmap, not yet shipped.)
System requirement: ffmpeg (brew install ffmpeg · apt install ffmpeg · winget install ffmpeg).
Python: 3.11+.
Windows note: the shell installer is for macOS/Linux/WSL. Native Windows should use the manual
pip install path and the MCP config shown below until a .ps1 installer exists.
Use With Your Agent
After installing with --local, just talk to the agent (the bundled skill triggers on video questions):
"What is this video about? https://www.youtube.com/watch?v=…" "Ingest
~/talks/keynote.mp4, then tell me every moment they show a code example and what the code says."
Or use the slash command in Claude Code: /watch <youtube-url | path | vid_id> [question].
With the lightweight default install, the MCP server can start and query an existing timeline, but ingesting
new videos needs a provider such as faster-whisper. Install the local stack with --local when you want
the agent to process videos on this machine.
When local providers are installed, agents should prefer the high-level watch tool/CLI command.
It ingests the video, collects metadata, stage status, segment counts, transcript text, timeline
samples, and cited retrieval context in one response. Lower-level tools remain available for
diagnostics and follow-up searches.
A plain ingest uses the auto scan preset. auto means a full scan for normal profiles, but with
profile="tiny" it resolves to screen so it does not silently claim VLM while the profile disables
visual captions. Agents should choose a cheaper scan when the request is narrow:
Scan preset | Runs | Best for |
| ASR + OCR + VLM | broad summaries, "what happens", reading visible text, mixed questions |
| ASR + OCR | talks, slides, demos, trailers with important on-screen text |
| ASR only | fast transcript-only summaries or long videos where the user wants a cheap first pass |
For long videos, batches, or unclear cost expectations, the agent should ask before running auto/full.
Optional dependencies. If an optional visual capability needs a package that is not installed, that
stage is skipped and reported in ingest warnings. The agent can then, with your consent, call
install_dependency(...) to install into the server's own venv. For a clean local setup, prefer reinstalling
with --local.
For a manual MCP registration in another client, the default database is ~/.vidsight/vidsight.db:
{
"mcpServers": {
"vidsight": {
"command": "vidsight",
"args": ["mcp"]
}
}
}How It Works
Vidsight turns one video source into one searchable timeline:
Stage | Output |
Resolve + probe | Local file, metadata, stable |
Scene detection | Time ranges for visual sampling |
Audio extraction → ASR |
|
Frame sampling → VLM |
|
Frame sampling → OCR |
|
Embedding + storage | SQLite timeline with baseline text search |
Ingest — resolve a local path or URL (
yt-dlp) to a normalized file; hash it (sha256, so the video id is content-based).Probe & scene-detect — read duration/fps; find shot boundaries with ffmpeg-backed sampling.
Scene-aware sampling — caption + OCR run on a representative frame per scene (not every second). This keeps even a slow CPU vision model cheap, so it works on modest hardware.
Extract — audio goes to ASR; sampled frames go to the vision model (caption) and OCR (on-screen text).
Fuse — every signal becomes a time-stamped row in one SQLite timeline (
kind = speech | caption | ocr | audio_event), with dependency-free text vectors for baseline search. Provenance is recorded.Query — your agent searches, zooms into a time window, or asks a question. Processing is idempotent and resumable: re-ingesting the same video skips finished stages, and an optional stage (VLM/OCR) that fails a dependency check is skipped, reported, and retried on the next ingest — never fatal.
Vidsight retrieves and grounds; your agent's own model does the reasoning. No chat LLM is bundled.
Tools (MCP)
Tool | What it does |
| One-call agent workflow: ingest, then return video metadata, segment counts, stage states, transcript text, timeline excerpt, warnings, and cited context for the question. |
| Lower-level ingest into the timeline store; streams progress and returns |
| List processed videos and their status. |
| Baseline text-vector search across speech / captions / OCR; returns time-stamped hits. |
| Everything in a time window — "what's on screen at 2:00–2:30". |
| Plain transcript, optionally for a range. |
| Retrieval-only: returns assembled, cited context for your agent to answer from. |
| A single segment by id. |
| Self-healing deps: install a package into the server's own venv, then reset its module cache so the new capability works on the next call — no restart. Used (with user consent) when an optional stage reports a missing dependency. |
Choose scan presets per ingest with scan, or choose exact backends with providers, e.g.
providers={"asr": "faster-whisper", "vlm": "moondream2", "ocr": "rapidocr"}.
Examples
CLI (bash / macOS / Linux / WSL)
vidsight watch "https://www.youtube.com/watch?v=..." --question "summarize this video"Lower-level CLI:
VIDEO_ID=$(vidsight ingest ~/talks/keynote.mp4)
vidsight search "pricing slide" --video "$VIDEO_ID"
vidsight timeline "$VIDEO_ID" 120 150
vidsight transcript "$VIDEO_ID" > keynote.txt
vidsight providers # show registered backendsFast transcript-only pass:
VIDEO_ID=$(vidsight ingest ~/talks/keynote.mp4 --scan speech)CLI (PowerShell)
$videoId = vidsight ingest "$HOME\Videos\keynote.mp4"
vidsight search "pricing slide" --video $videoId
vidsight timeline $videoId 120 150
vidsight transcript $videoId | Out-File keynote.txt
vidsight providersAgent prompts (once the MCP server is connected)
"Summarize this lecture and cite the timestamps for each main point."
"Find where the speaker demos the dashboard and read the text shown on screen."
"What product names appear visually in the first 5 minutes?"
Architecture
Layered: a pure Python core engine with thin CLI and MCP frontends — so the engine is testable in isolation and reusable.
core/ ingest · probe · scenes · audio · frames · pipeline · store · query · paths · config
providers/ asr · vlm · ocr · embed (swappable backends, discovered via entry points)
cli/ command-line frontend
mcp/ FastMCP server (thin wrapper over core query + ingest, runs ingest off the event loop)The timeline is the unifying model: every extracted signal is a time-stamped segment row, so search and
retrieval are uniform regardless of which provider produced it. Heavy/optional dependencies are imported
lazily, so the base install stays tiny and import vidsight pulls in no model framework.
Compatibility & Providers
Providers are pluggable behind ABCs; third parties can ship a provider package with Python entry points
without forking the core. The local extra installs the current local model provider set.
Cloud/BYOK providers are not implemented yet. Today there is no built-in support for OpenAI,
Anthropic, Groq, or other API keys. The future cloud extra is reserved for bring-your-own-key
providers.
Step | Default (local) | Notes |
Speech (ASR) | faster-whisper | language auto-detected |
Vision (caption) | moondream2 (transformers) | chooses CUDA, Apple MPS, or CPU; weights download lazily |
On-screen text (OCR) | RapidOCR (ONNX) | |
Embeddings |
| lexical baseline; model embedders can be added later |
Profiles: tiny (CPU, vision off) · balanced (default, vision on) · quality (larger ASR, more frames).
Test coverage currently runs on macOS in development. Linux/macOS/Windows CI is planned.
Note: on Apple Silicon the vision model runs on MPS; the MCP server is registered with
PYTORCH_ENABLE_MPS_FALLBACK=1so any op not yet implemented on MPS falls back to CPU instead of erroring.
Roadmap
Core pipeline + SQLite timeline + local provider stack
MCP server + CLI
Agent installer for Claude Code, Codex, and OpenCode
Claude Code skill and
/watchcommandNative PowerShell installer for Windows
Speaker diarization (who said what)
Audio-event tagging beyond speech (music / applause / …)
cloudandmlxprovider extrasBatch mode for many videos
Timeline browser (web UI)
Contributing
Issues and PRs welcome. Adding a provider (a new ASR/vision/OCR/embedding backend) is the easiest first
contribution — implement the interface and register an entry point; no core fork needed. See
docs/DESIGN.md.
License
Source-available under the PolyForm Noncommercial License 1.0.0. Noncommercial use, modification, and sharing are allowed with preserved author credit. Commercial use is not granted by the repo license; it requires a separate paid written license or agreement with Daniel Szepietowski (szepix) that grants consent and defines compensation.
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
- Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)By Om-Shree-0709 on .Agentic AiPrompt InjectionWebAssembly
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/szepix/vidsight'
If you have feedback or need assistance with the MCP directory API, please join our Discord server