Skip to main content
Glama

get_video_preview

Generate a contact-sheet preview of a YouTube video, sampling frames evenly across the video or a time window, with timestamps for each tile to identify scenes and pick moments for closer inspection.

Instructions

Get a visual overview of a YouTube video as ONE tiled contact-sheet image: tiles frames sampled evenly across the video (or across a start..end window of it), plus a text legend mapping each tile to its mm:ss timestamp.

Use this to see what's going on across a video (talking head vs slides vs demo footage, scene changes, "is there a chart anywhere?") and to pick moments worth a closer look. To inspect one part in more detail, call it again with start/end around that part -- but pick the window from the transcript, chapters, or get_most_replayed first and zoom once; don't binary-search the video with repeated sheets, since every returned image stays in context. Read each tile's timestamp from the legend -- do not count grid cells yourself. Tiles are small and not readable: to read a slide, caption, or UI, follow up with get_video_frame(video, at=<that tile's timestamp>). Windows under ~1 minute may return near-duplicate tiles (frames land on keyframes, a few seconds apart).

Requires ffmpeg on the server (the system binary, or the one bundled by the [media] extra).

Args: video: A YouTube URL (watch, youtu.be, shorts, embed, live) or an 11-character video ID. tiles: How many frames to sample (clamped 4..24; default 12). tile_width: Width in pixels of each tile (clamped 160..480; default 320). The whole sheet stays around 1000-1300 px wide at the defaults -- cheap on a vision model's image budget while keeping tiles recognizable. start: Optional window start -- seconds (e.g. 90) or a "mm:ss" / "h:mm:ss" string. Defaults to the beginning of the video. end: Optional window end, same forms. Defaults to (and is clamped to) the video's end.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
videoYes
tilesNo
tile_widthNo
startNo
endNo
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations, the description discloses behavioral traits: returns one image, tiles are non-readable, image dimensions, server dependency (ffmpeg), clamping of parameters, and potential duplicate tiles. Lacks explicit read-only statement but implied; no mention of auth or rate limits, but acceptable for this tool.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-organized with clear sections (what, when, requirements, parameters). Every sentence adds value, but it is slightly lengthy. The embedded Args block is justified given the schema lacks descriptions.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given 5 parameters, no output schema, and no annotations, the description covers purpose, usage, parameter details, constraints, and common pitfalls (near-duplicate tiles). It could be more explicit about the output format (e.g., image dimensions), but overall it is sufficiently complete.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema has 0% description coverage, but the description's Args section adds extensive meaning: explains video accepts URL or ID, tiles and tile_width with clamping and defaults, start/end accept seconds or time strings with defaults. This compensates fully for the schema gap.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Clearly states the tool returns a single tiled contact-sheet image with timestamp legend, distinguishing it from sibling tools like get_video_frame (single frame) and get_transcript (text). The verb 'get' and resource 'visual overview' are specific.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly advises when to use (to get an overview), when not to (avoid binary-searching), and how to zoom in using start/end parameters with guidance to first use transcript or get_most_replayed. Also warns about near-duplicate tiles for short windows.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/realiti4/youtube-context-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server