Skip to main content
Glama

Generate Video

generate_video

Create realistic 4-8 second videos from text prompts using AI. Supports optional starting image and reference images for consistency, with synchronized audio.

Instructions

Generate videos using Google Veo 3.1 AI model. Creates realistic 4-8 second videos from text prompts with optional first-frame image and reference images for character/style consistency. Supports native audio generation. Processing time: 2-5 minutes for 1080p videos. Returns video file path with optional thumbnail and HTML preview player. ⚠️ IMPORTANT: Video generation is ASYNC and takes 2-5 minutes. The tool will poll for completion automatically.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
promptYesDetailed description of the video to generate. Be specific about actions, camera movements, lighting, and style. Example: "A close-up shot of a futuristic coffee machine brewing a glowing blue espresso, with steam rising dramatically. Cinematic lighting, 4K quality."
modelNoVideo generation model (default: veo-3.1-generate-preview)veo-3.1-generate-preview
aspectRatioNoVideo aspect ratio: 16:9 (landscape) or 9:16 (portrait/vertical)16:9
resolutionNoVideo resolution. Higher resolutions take longer to generate and result in larger files.1080p
durationSecondsNoVideo duration in seconds (4, 6, or 8 seconds)
generateAudioNoGenerate native synchronized audio effects and dialogue based on the prompt
sampleCountNoNumber of video samples to generate (1-4). Each sample is a separate generation.
seedNoOptional seed for deterministic output. Use the same seed with the same prompt for consistent results.
outputPathNoOptional custom output path for the video file (e.g., C:/videos/output.mp4). If not provided, saves to default output directory with timestamped filename.
generateThumbnailNoExtract thumbnail from video (requires ffmpeg installed). Thumbnail is saved alongside video.
generateHTMLPlayerNoGenerate interactive HTML video player with preview and download options
firstFrameImageNoStarting frame image for image-to-video generation. Provide via filePath (local file) or data+mimeType (base64). The video will animate from this image. Supports JPEG, PNG, WebP.
referenceImagesNoUp to 3 reference images for character/style consistency. Each needs a referenceType ("asset" or "style") and an image.
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description fully carries the behavioral transparency burden. It discloses the async processing, auto-polling, processing time estimates, and output features (file path, thumbnail, HTML player). No contradictions exist since annotations are absent.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is front-loaded with the core purpose and structured into two clear paragraphs. It is concise but includes necessary warnings and details. Minor redundancy (e.g., mentioning processing time twice) could be trimmed, but overall efficient.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (13 parameters, nested objects, no output schema), the description adequately covers inputs, behavior, and basic output format. However, it lacks explicit error handling details, failure modes, or return type schema, which would be beneficial for a complex async tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, so parameters are well-documented. The description adds value beyond the schema by summarizing key behaviors (e.g., resolution impact on time, audio generation, thumbnail extraction) and providing a concrete prompt example, which aids agent decision-making.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool generates videos using Google Veo 3.1 AI model, specifying duration (4-8 seconds), input types (text prompts with optional images), and outputs (file path, thumbnail, HTML player). It distinguishes itself from siblings like generate_image (image generation) and transcribe (audio processing) by focusing on video creation.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implicitly guides usage by noting the async nature and 2-5 minute processing time, but it does not explicitly state when to use this tool versus alternatives like generate_image for static images or when not to use it (e.g., for editing existing videos). The warning and auto-polling note help manage expectations.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Raindancer118/gemini-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server