omni_video_ingest
Ingests raw video footage, produces word-level transcripts and a visual scene graph for B-roll searching, and returns the project metadata path.
Instructions
Ingests a directory of video files, generates word-level audio transcripts, and constructs a semantic Visual Scene Graph for B-Roll searching. Returns the path to the generated project metadata.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| request | Yes |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |
Implementation Reference
- server.py:54-101 (handler)The main handler for the omni_video_ingest MCP tool. It takes a directory path, finds all video files (*.mp4, *.mov, *.mkv), transcribes each with ElevenLabs Scribe via helpers/transcribe.py, packs transcripts into a markdown file via helpers/pack_transcripts.py, and stubs a Visual Scene Graph JSON.
@mcp.tool() async def omni_video_ingest(request: IngestRequest) -> str: """ Ingests a directory of video files, generates word-level audio transcripts, and constructs a semantic Visual Scene Graph for B-Roll searching. Returns the path to the generated project metadata. """ directory = Path(request.directory_path).resolve() if not directory.exists(): return f"Error: Directory {directory} not found." edit_dir = directory / "edit" edit_dir.mkdir(exist_ok=True) try: api_key = load_api_key() except Exception as e: return f"Error loading ElevenLabs API Key: {e}" # 1. Audio Transcription & Packing video_files = [] for ext in ["*.mp4", "*.mov", "*.mkv"]: video_files.extend(directory.glob(ext)) if not video_files: return f"Error: No video files found in {directory}." packed_entries = [] for video in video_files: try: json_path = transcribe_one(video=video, edit_dir=edit_dir, api_key=api_key, verbose=False) entry = pack_one_file(json_path, silence_threshold=0.5) packed_entries.append(entry) except Exception as e: return f"Error transcribing {video.name}: {e}" markdown = render_markdown(packed_entries, silence_threshold=0.5) takes_packed_path = edit_dir / "takes_packed.md" takes_packed_path.write_text(markdown) # 2. Visual Scene Graph (Placeholder for Vision Model integration) scene_graph_path = edit_dir / "scene_graph.json" scene_graph_path.write_text(json.dumps({ "status": "pending_vision_extraction", "message": "Visual Scene Graph generation requires Gemini Flash vision API integration." })) return f"Success: Ingested {len(video_files)} videos. Transcript packed at {takes_packed_path}. Visual Scene Graph stubbed." - server.py:16-17 (schema)Pydantic input schema for the omni_video_ingest tool, defining the required 'directory_path' field.
class IngestRequest(BaseModel): directory_path: str = Field(..., description="Absolute path to the directory containing raw video footage.") - server.py:54-54 (registration)The tool is registered as an MCP tool via the @mcp.tool() decorator on the FastMCP instance 'mcp'.
@mcp.tool() - helpers/transcribe.py:90-132 (helper)Helper function called by omni_video_ingest to transcribe a single video file using ElevenLabs Scribe API and save the transcript JSON.
def transcribe_one( video: Path, edit_dir: Path, api_key: str, language: str | None = None, num_speakers: int | None = None, verbose: bool = True, ) -> Path: """Transcribe a single video. Returns path to transcript JSON. Cached: returns existing path immediately if the transcript already exists. """ transcripts_dir = edit_dir / "transcripts" transcripts_dir.mkdir(parents=True, exist_ok=True) out_path = transcripts_dir / f"{video.stem}.json" if out_path.exists(): if verbose: print(f"cached: {out_path.name}") return out_path if verbose: print(f" extracting audio from {video.name}", flush=True) t0 = time.time() with tempfile.TemporaryDirectory() as tmp: audio = Path(tmp) / f"{video.stem}.wav" extract_audio(video, audio) size_mb = audio.stat().st_size / (1024 * 1024) if verbose: print(f" uploading {video.stem}.wav ({size_mb:.1f} MB)", flush=True) payload = call_scribe(audio, api_key, language, num_speakers) out_path.write_text(json.dumps(payload, indent=2)) dt = time.time() - t0 if verbose: kb = out_path.stat().st_size / 1024 print(f" saved: {out_path.name} ({kb:.1f} KB) in {dt:.1f}s") if isinstance(payload, dict) and "words" in payload: print(f" words: {len(payload['words'])}") return out_path - helpers/pack_transcripts.py:137-137 (helper)Helper function called by omni_video_ingest to render packed transcript entries into a human-readable markdown file.
def render_markdown(entries: list[tuple[str, float, list[dict]]], silence_threshold: float) -> str: