Analyze a video, audio, or image file or URL
analyze_mediaExtract visual and textual context from video, audio, or image files locally—frames, transcripts, OCR, and scene changes—for AI analysis.
Instructions
Turn a local media file or URL (video, audio, or image) into compact context a model can read — fully local, no paid APIs. Video: montage frames (mode 'sheet', cheapest default), individual stills ('frames'), or scene changes ('scenes'); add transcript and/or ocr. Audio: speech transcript. Image: the picture plus optional OCR. For app/screen recordings use detail:'high' + ocr:true. To catch a transient UI glitch (a flicker/jump lasting <1s), use mode:'filmstrip' with a narrow startSec/endSec window, a high fps (10–15), and a crop around the affected control — it stacks dense frames so you can spot a frame whose value disagrees with the visual. Use the cheap default for everything else. Pass context to frame the analysis.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| source | Yes | Local media file path (video, audio, or image) OR an http(s) URL. | |
| context | No | Optional note about the video to frame the analysis, e.g. 'signup flow, focus on the validation error'. | |
| detail | No | high = readable stills for screen recordings (frames + large scale + png); low = cheap montage. Overrides only the fields you leave unset. | |
| mode | No | sheet = montage grids (cheapest, default); frames = individual stills; scenes = only scene-change montages; filmstrip = dense near-native-fps vertical strip for catching transient UI glitches (pair with a narrow startSec/endSec window, fps, and crop). | sheet |
| format | No | Image encoding. webp = smallest/fewest tokens (default); png = lossless for crisp text. | webp |
| maxFrames | No | Upper bound on sampled frames across the whole window. | |
| grid | No | Tiles per row/column for sheet/scenes modes (grid x grid). | |
| scale | No | Width in px of each frame before tiling. Lower = fewer tokens. | |
| sceneThreshold | No | Scene-change sensitivity for 'scenes' mode (higher = fewer cuts). | |
| fps | No | Explicit sampling rate (frames/sec); overrides the auto rate for sheet/frames/filmstrip. Use a high value (e.g. 10–15) with filmstrip to catch sub-second glitches. | |
| crop | No | Rectangle to crop before sampling — zoom into a UI region for sharper frames/OCR. Pixels, or fractions 0–1 of the frame (e.g. {x:0,y:0.7,width:1,height:0.3} = bottom 30%). | |
| stripRows | No | Tiles stacked per image in 'filmstrip' mode. | |
| startSec | No | Window start in seconds. | |
| endSec | No | Window end in seconds. | |
| transcript | No | Also run local Whisper to produce a speech transcript. | |
| whisperModel | No | Whisper model name (tiny, base, small, medium, large). | small |
| ocr | No | Extract on-screen text via OCR — ideal for app/screen recordings. Implies detail:high unless set. | |
| ocrLang | No | Tesseract language code(s) for OCR, e.g. 'eng' or 'eng+deu'. | eng |
| ocrPsm | No | Tesseract page-segmentation mode. 3 = auto (default), 6 = uniform block, 11 = sparse/scattered UI labels. | |
| ocrMaxFrames | No | Frames to OCR (sampled at full resolution, independent of the display images). | |
| detectJumps | No | Track the on-screen number (e.g. a slider %) across frames and report non-monotonic 'jump-back' glitches with timestamps. Pair with a crop around the value and a narrow window for best results. | |
| maxDurationSec | No | Reject URL downloads longer than this many seconds. | |
| maxFileSizeMb | No | Abort a URL download once it exceeds this size in MB. |