Skip to main content
Glama

omni_video_ingest

Ingests raw video footage, produces word-level transcripts and a visual scene graph for B-roll searching, and returns the project metadata path.

Instructions

Ingests a directory of video files, generates word-level audio transcripts, and constructs a semantic Visual Scene Graph for B-Roll searching. Returns the path to the generated project metadata.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
requestYes

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • The main handler for the omni_video_ingest MCP tool. It takes a directory path, finds all video files (*.mp4, *.mov, *.mkv), transcribes each with ElevenLabs Scribe via helpers/transcribe.py, packs transcripts into a markdown file via helpers/pack_transcripts.py, and stubs a Visual Scene Graph JSON.
    @mcp.tool()
    async def omni_video_ingest(request: IngestRequest) -> str:
        """
        Ingests a directory of video files, generates word-level audio transcripts, 
        and constructs a semantic Visual Scene Graph for B-Roll searching.
        Returns the path to the generated project metadata.
        """
        directory = Path(request.directory_path).resolve()
        if not directory.exists():
            return f"Error: Directory {directory} not found."
            
        edit_dir = directory / "edit"
        edit_dir.mkdir(exist_ok=True)
        
        try:
            api_key = load_api_key()
        except Exception as e:
            return f"Error loading ElevenLabs API Key: {e}"
            
        # 1. Audio Transcription & Packing
        video_files = []
        for ext in ["*.mp4", "*.mov", "*.mkv"]:
            video_files.extend(directory.glob(ext))
            
        if not video_files:
            return f"Error: No video files found in {directory}."
            
        packed_entries = []
        for video in video_files:
            try:
                json_path = transcribe_one(video=video, edit_dir=edit_dir, api_key=api_key, verbose=False)
                entry = pack_one_file(json_path, silence_threshold=0.5)
                packed_entries.append(entry)
            except Exception as e:
                return f"Error transcribing {video.name}: {e}"
                
        markdown = render_markdown(packed_entries, silence_threshold=0.5)
        takes_packed_path = edit_dir / "takes_packed.md"
        takes_packed_path.write_text(markdown)
        
        # 2. Visual Scene Graph (Placeholder for Vision Model integration)
        scene_graph_path = edit_dir / "scene_graph.json"
        scene_graph_path.write_text(json.dumps({
            "status": "pending_vision_extraction",
            "message": "Visual Scene Graph generation requires Gemini Flash vision API integration."
        }))
        
        return f"Success: Ingested {len(video_files)} videos. Transcript packed at {takes_packed_path}. Visual Scene Graph stubbed."
  • Pydantic input schema for the omni_video_ingest tool, defining the required 'directory_path' field.
    class IngestRequest(BaseModel):
        directory_path: str = Field(..., description="Absolute path to the directory containing raw video footage.")
  • server.py:54-54 (registration)
    The tool is registered as an MCP tool via the @mcp.tool() decorator on the FastMCP instance 'mcp'.
    @mcp.tool()
  • Helper function called by omni_video_ingest to transcribe a single video file using ElevenLabs Scribe API and save the transcript JSON.
    def transcribe_one(
        video: Path,
        edit_dir: Path,
        api_key: str,
        language: str | None = None,
        num_speakers: int | None = None,
        verbose: bool = True,
    ) -> Path:
        """Transcribe a single video. Returns path to transcript JSON.
    
        Cached: returns existing path immediately if the transcript already exists.
        """
        transcripts_dir = edit_dir / "transcripts"
        transcripts_dir.mkdir(parents=True, exist_ok=True)
        out_path = transcripts_dir / f"{video.stem}.json"
    
        if out_path.exists():
            if verbose:
                print(f"cached: {out_path.name}")
            return out_path
    
        if verbose:
            print(f"  extracting audio from {video.name}", flush=True)
    
        t0 = time.time()
        with tempfile.TemporaryDirectory() as tmp:
            audio = Path(tmp) / f"{video.stem}.wav"
            extract_audio(video, audio)
            size_mb = audio.stat().st_size / (1024 * 1024)
            if verbose:
                print(f"  uploading {video.stem}.wav ({size_mb:.1f} MB)", flush=True)
            payload = call_scribe(audio, api_key, language, num_speakers)
    
        out_path.write_text(json.dumps(payload, indent=2))
        dt = time.time() - t0
    
        if verbose:
            kb = out_path.stat().st_size / 1024
            print(f"  saved: {out_path.name} ({kb:.1f} KB) in {dt:.1f}s")
            if isinstance(payload, dict) and "words" in payload:
                print(f"    words: {len(payload['words'])}")
    
        return out_path
  • Helper function called by omni_video_ingest to render packed transcript entries into a human-readable markdown file.
    def render_markdown(entries: list[tuple[str, float, list[dict]]], silence_threshold: float) -> str:
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden. It discloses the main actions (ingest, transcribe, construct graph, return metadata) but does not detail side effects (e.g., file modifications, resource usage, error behavior, permissions). Context like 'generates' and 'constructs' suggests non-destructive creation, but more depth is needed.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two concise sentences, front-loaded with the core action, followed by additional processing details and output. Every sentence is substantive and non-redundant.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a tool of moderate complexity, the description covers the main functionality and output. An output schema exists, so return value details are assumed to be structured. It lacks information on processing time, error handling, or system requirements, but the core completeness is adequate.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The description does not mention the single parameter 'directory_path', but the input schema provides a clear description. The context signal indicates 0% schema description coverage (though the schema actually has a description), so the tool description adds no parameter meaning beyond the schema. The parameter is simple and inferable from the tool's purpose.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the verb 'ingests' with a specific resource (directory of video files), and outlines the processing steps (transcripts, scene graph) and output (metadata path). It distinguishes from siblings (VFX, preview, render) by indicating this is the intake step.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies use for initial ingestion and processing, but provides no explicit guidance on when to use this tool versus alternatives (e.g., if only transcripts are needed), nor does it mention prerequisites or constraints like supported video formats.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/buildwithtaza/omni-video-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server