Skip to main content
Glama

Status: pre-release / in active development. The core pipeline, SQLite timeline, local provider stack, CLI, and MCP server are working today. Distribution is GitHub-only for now (not yet on PyPI). See docs/DESIGN.md for the full architecture and roadmap.


Why Vidsight

Vidsight is built around a durable timeline instead of a one-off transcript dump:

  • More than transcripts. Speech (ASR), selected frame captions, and on-screen text (OCR) are aligned on the same timeline, so an agent can retrieve what was said, shown, or written.

  • Local-first. The core store and query path are local. URL ingest uses yt-dlp, and model weights may download on first use when you install the optional local stack.

  • Lean by default. The default install is vidsight[mcp]. Heavy ASR/OCR/VLM dependencies are opt-in with --local or vidsight[local,mcp].

Input can be a local file or a URL. URLs require the local extra because they use yt-dlp. Process once into a durable store, then search, zoom into time ranges, or assemble cited context.

Related MCP server: YouTube Transcript MCP Server

Install

curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/install.sh | bash

This is safe for an agent to run. The default install is lightweight: it installs vidsight[mcp], creates a self-contained environment under ~/.vidsight/, and registers the MCP server for detected agent environments. Supported targets today:

  • Claude Code

  • Codex

  • OpenCode

Installer source: install.sh.

Manual target selection:

curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/install.sh | bash -s -- --target claude --target codex

Install the local provider stack (ASR/OCR/VLM) up front:

curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/install.sh | bash -s -- --local

From a local clone, the same options work:

git clone https://github.com/szepix/vidsight && cd vidsight
bash install.sh --target claude --local

Runtime files live under ~/.vidsight/: a dedicated venv/, model cache/, downloaded video data/, and the vidsight.db timeline. Agent integrations are written to the relevant config directories. For Claude Code that includes the vidsight skill and /watch slash command; for Codex/OpenCode it includes MCP config and a matching skill.

Restart/reconnect your agent after install.

Install size & requirements

The default installer is intentionally small. Use --local only on machines that should process video themselves.

Mode

What gets installed

Typical first install

Disk before videos/model cache

Runtime expectations

default

Vidsight CLI + MCP server

about 1-3 minutes

hundreds of MB

light CPU/RAM; can query existing timelines

--local

default + yt-dlp, ASR, OCR, VLM stack, Torch

about 5-20 minutes

about 1.1 GB on a clean macOS test before model weights

CPU works; GPU/MPS helps vision; 8 GB RAM minimum, 16 GB recommended

First video ingest can take longer than installation because it may download ASR/VLM model weights and transcribe the whole video. Disk usage then includes the downloaded source video, extracted audio, the SQLite timeline, and model caches. Speech-only ingest is the cheapest path; VLM captioning is the heaviest stage. The default local ingest favors complete context; use --scan speech for a fast transcript-only pass.

Uninstall removes all of the above, including the venv/cache/data/db folder:

curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/uninstall.sh | bash

Uninstaller source: uninstall.sh. To only unregister integrations while keeping ~/.vidsight, add --keep-data.

Option B — manual (any MCP client / CLI use)

git clone https://github.com/szepix/vidsight && cd vidsight
pip install ".[local,mcp]"     # or ".[mcp]" for a tiny install without the local model stack

Extras:

Extra

Pulls in

local

local provider stack: yt-dlp, faster-whisper, transformers (<5), torch, accelerate, Pillow, rapidocr-onnxruntime

mcp

the MCP server (mcp[cli])

dev

pytest, ruff, mypy

(mlx Apple acceleration and cloud providers are on the roadmap, not yet shipped.)

System requirement: ffmpeg (brew install ffmpeg · apt install ffmpeg · winget install ffmpeg). Python: 3.11+.

Windows note: the shell installer is for macOS/Linux/WSL. Native Windows should use the manual pip install path and the MCP config shown below until a .ps1 installer exists.

Use With Your Agent

After installing with --local, just talk to the agent (the bundled skill triggers on video questions):

"What is this video about? https://www.youtube.com/watch?v=…" "Ingest ~/talks/keynote.mp4, then tell me every moment they show a code example and what the code says."

Or use the slash command in Claude Code: /watch <youtube-url | path | vid_id> [question].

With the lightweight default install, the MCP server can start and query an existing timeline, but ingesting new videos needs a provider such as faster-whisper. Install the local stack with --local when you want the agent to process videos on this machine.

When local providers are installed, agents should prefer the high-level watch tool/CLI command. It ingests the video, collects metadata, stage status, segment counts, transcript text, timeline samples, and cited retrieval context in one response. Lower-level tools remain available for diagnostics and follow-up searches.

A plain ingest uses the auto scan preset. auto means a full scan for normal profiles, but with profile="tiny" it resolves to screen so it does not silently claim VLM while the profile disables visual captions. Agents should choose a cheaper scan when the request is narrow:

Scan preset

Runs

Best for

auto / full

ASR + OCR + VLM

broad summaries, "what happens", reading visible text, mixed questions

screen

ASR + OCR

talks, slides, demos, trailers with important on-screen text

speech

ASR only

fast transcript-only summaries or long videos where the user wants a cheap first pass

For long videos, batches, or unclear cost expectations, the agent should ask before running auto/full.

Optional dependencies. If an optional visual capability needs a package that is not installed, that stage is skipped and reported in ingest warnings. The agent can then, with your consent, call install_dependency(...) to install into the server's own venv. For a clean local setup, prefer reinstalling with --local.

For a manual MCP registration in another client, the default database is ~/.vidsight/vidsight.db:

{
  "mcpServers": {
    "vidsight": {
      "command": "vidsight",
      "args": ["mcp"]
    }
  }
}

How It Works

Vidsight turns one video source into one searchable timeline:

Stage

Output

Resolve + probe

Local file, metadata, stable vid_... id

Scene detection

Time ranges for visual sampling

Audio extraction → ASR

speech segments

Frame sampling → VLM

caption segments

Frame sampling → OCR

ocr segments

Embedding + storage

SQLite timeline with baseline text search

  1. Ingest — resolve a local path or URL (yt-dlp) to a normalized file; hash it (sha256, so the video id is content-based).

  2. Probe & scene-detect — read duration/fps; find shot boundaries with ffmpeg-backed sampling.

  3. Scene-aware sampling — caption + OCR run on a representative frame per scene (not every second). This keeps even a slow CPU vision model cheap, so it works on modest hardware.

  4. Extract — audio goes to ASR; sampled frames go to the vision model (caption) and OCR (on-screen text).

  5. Fuse — every signal becomes a time-stamped row in one SQLite timeline (kind = speech | caption | ocr | audio_event), with dependency-free text vectors for baseline search. Provenance is recorded.

  6. Query — your agent searches, zooms into a time window, or asks a question. Processing is idempotent and resumable: re-ingesting the same video skips finished stages, and an optional stage (VLM/OCR) that fails a dependency check is skipped, reported, and retried on the next ingest — never fatal.

Vidsight retrieves and grounds; your agent's own model does the reasoning. No chat LLM is bundled.

Tools (MCP)

Tool

What it does

watch(source, question?, profile?, scan?, providers?, k?)

One-call agent workflow: ingest, then return video metadata, segment counts, stage states, transcript text, timeline excerpt, warnings, and cited context for the question.

ingest(source, profile?, scan?, providers?)

Lower-level ingest into the timeline store; streams progress and returns {video_id, status, warnings}. scan="auto" chooses the local preset from the profile unless providers is supplied.

list_videos()

List processed videos and their status.

search(query, video_id?, kinds?, k?)

Baseline text-vector search across speech / captions / OCR; returns time-stamped hits.

get_timeline(video_id, t_start, t_end)

Everything in a time window — "what's on screen at 2:00–2:30".

get_transcript(video_id, t_start?, t_end?)

Plain transcript, optionally for a range.

ask(question, video_id?, kinds?, k?)

Retrieval-only: returns assembled, cited context for your agent to answer from.

get_segment(segment_id)

A single segment by id.

install_dependency(package)

Self-healing deps: install a package into the server's own venv, then reset its module cache so the new capability works on the next call — no restart. Used (with user consent) when an optional stage reports a missing dependency.

Choose scan presets per ingest with scan, or choose exact backends with providers, e.g. providers={"asr": "faster-whisper", "vlm": "moondream2", "ocr": "rapidocr"}.

Examples

CLI (bash / macOS / Linux / WSL)

vidsight watch "https://www.youtube.com/watch?v=..." --question "summarize this video"

Lower-level CLI:

VIDEO_ID=$(vidsight ingest ~/talks/keynote.mp4)
vidsight search "pricing slide" --video "$VIDEO_ID"
vidsight timeline "$VIDEO_ID" 120 150
vidsight transcript "$VIDEO_ID" > keynote.txt
vidsight providers                             # show registered backends

Fast transcript-only pass:

VIDEO_ID=$(vidsight ingest ~/talks/keynote.mp4 --scan speech)

CLI (PowerShell)

$videoId = vidsight ingest "$HOME\Videos\keynote.mp4"
vidsight search "pricing slide" --video $videoId
vidsight timeline $videoId 120 150
vidsight transcript $videoId | Out-File keynote.txt
vidsight providers

Agent prompts (once the MCP server is connected)

  • "Summarize this lecture and cite the timestamps for each main point."

  • "Find where the speaker demos the dashboard and read the text shown on screen."

  • "What product names appear visually in the first 5 minutes?"

Architecture

Layered: a pure Python core engine with thin CLI and MCP frontends — so the engine is testable in isolation and reusable.

core/      ingest · probe · scenes · audio · frames · pipeline · store · query · paths · config
providers/ asr · vlm · ocr · embed     (swappable backends, discovered via entry points)
cli/       command-line frontend
mcp/       FastMCP server (thin wrapper over core query + ingest, runs ingest off the event loop)

The timeline is the unifying model: every extracted signal is a time-stamped segment row, so search and retrieval are uniform regardless of which provider produced it. Heavy/optional dependencies are imported lazily, so the base install stays tiny and import vidsight pulls in no model framework.

Compatibility & Providers

Providers are pluggable behind ABCs; third parties can ship a provider package with Python entry points without forking the core. The local extra installs the current local model provider set.

Cloud/BYOK providers are not implemented yet. Today there is no built-in support for OpenAI, Anthropic, Groq, or other API keys. The future cloud extra is reserved for bring-your-own-key providers.

Step

Default (local)

Notes

Speech (ASR)

faster-whisper base (CTranslate2)

language auto-detected

Vision (caption)

moondream2 (transformers)

chooses CUDA, Apple MPS, or CPU; weights download lazily

On-screen text (OCR)

RapidOCR (ONNX)

Embeddings

SimpleTextEmbedding (dependency-free)

lexical baseline; model embedders can be added later

Profiles: tiny (CPU, vision off) · balanced (default, vision on) · quality (larger ASR, more frames). Test coverage currently runs on macOS in development. Linux/macOS/Windows CI is planned.

Note: on Apple Silicon the vision model runs on MPS; the MCP server is registered with PYTORCH_ENABLE_MPS_FALLBACK=1 so any op not yet implemented on MPS falls back to CPU instead of erroring.

Roadmap

  • Core pipeline + SQLite timeline + local provider stack

  • MCP server + CLI

  • Agent installer for Claude Code, Codex, and OpenCode

  • Claude Code skill and /watch command

  • Native PowerShell installer for Windows

  • Speaker diarization (who said what)

  • Audio-event tagging beyond speech (music / applause / …)

  • cloud and mlx provider extras

  • Batch mode for many videos

  • Timeline browser (web UI)

Contributing

Issues and PRs welcome. Adding a provider (a new ASR/vision/OCR/embedding backend) is the easiest first contribution — implement the interface and register an entry point; no core fork needed. See docs/DESIGN.md.

License

Source-available under the PolyForm Noncommercial License 1.0.0. Noncommercial use, modification, and sharing are allowed with preserved author credit. Commercial use is not granted by the repo license; it requires a separate paid written license or agreement with Daniel Szepietowski (szepix) that grants consent and defines compensation.

A
license - permissive license
-
quality - not tested
B
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/szepix/vidsight'

If you have feedback or need assistance with the MCP directory API, please join our Discord server