Skip to main content
Glama
mmaudio

MMAudio MCP

Official
by mmaudio

video_to_audio

Extract or generate synchronized audio from video content using AI, allowing customization of sound effects, ambient noise, and atmospheric elements based on descriptive prompts.

Instructions

Generate AI-powered audio from video content using MMAudio technology. Analyzes video frames and generates synchronized audio including sound effects, ambient noise, and atmospheric elements.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
video_urlYesURL of the video file to generate audio for (supports mp4, webm, avi, mov formats)
promptYesDescribe the audio you want to generate (e.g., "forest sounds with birds chirping", "urban traffic noise", "peaceful ocean waves")
negative_promptNoDescribe what you want to avoid in the generated audio (optional)
seedNoRandom seed for reproducible results (optional)
num_stepsNoNumber of inference steps (higher = better quality, slower)
durationNoDuration of generated audio in seconds
cfg_strengthNoClassifier-free guidance strength (higher = more adherence to prompt)

Implementation Reference

  • The primary handler function for the 'video_to_audio' tool. Validates input using VideoToAudioInputSchema, sends POST request to MMAudio API (/api/video-to-audio), handles various HTTP errors, parses and validates response, and returns structured content with audio details.
    async handleVideoToAudio(args) {
      this.ensureConfigured();
    
      try {
        const input = VideoToAudioInputSchema.parse(args);
        
        console.error(`[MMAudio] Starting video-to-audio generation for: ${input.video_url}`);
    
        const response = await fetch(`${this.config.baseUrl}/api/video-to-audio`, {
          method: 'POST',
          headers: {
            'Content-Type': 'application/json',
            'Authorization': `Bearer ${this.config.apiKey}`,
            'User-Agent': 'MMAudio-MCP/1.0.0',
          },
          body: JSON.stringify(input),
          timeout: this.config.timeout,
        });
    
        if (!response.ok) {
          const errorText = await response.text();
          let errorMessage = `HTTP ${response.status}`;
          
          try {
            const errorData = JSON.parse(errorText);
            errorMessage = errorData.error || errorMessage;
          } catch {
            errorMessage = errorText || errorMessage;
          }
    
          if (response.status === 401) {
            throw new McpError(ErrorCode.InvalidRequest, 'Invalid API key. Please check your MMAudio API key.');
          } else if (response.status === 403) {
            throw new McpError(ErrorCode.InvalidRequest, 'Insufficient credits for video-to-audio generation.');
          } else if (response.status === 429) {
            throw new McpError(ErrorCode.InvalidRequest, 'Rate limit exceeded. Please try again later.');
          }
    
          throw new Error(errorMessage);
        }
    
        const result = await response.json();
        const validatedResult = VideoToAudioResponseSchema.parse(result);
    
        console.error(`[MMAudio] Video-to-audio generation completed successfully`);
    
        return {
          content: [
            {
              type: 'text',
              text: JSON.stringify({
                success: true,
                message: 'Audio generated successfully from video',
                result: {
                  audio_url: validatedResult.video.url,
                  content_type: validatedResult.video.content_type,
                  file_name: validatedResult.video.file_name,
                  file_size: validatedResult.video.file_size,
                  duration: input.duration,
                  prompt: input.prompt,
                }
              }, null, 2),
            },
          ],
        };
      } catch (error) {
        if (error instanceof z.ZodError) {
          throw new McpError(
            ErrorCode.InvalidParams,
            `Invalid input parameters: ${error.errors.map(e => `${e.path.join('.')}: ${e.message}`).join(', ')}`
          );
        }
        throw error;
      }
    }
  • Zod input schema defining parameters for video_to_audio tool: video_url (required), prompt (required), and optional parameters for generation control.
    const VideoToAudioInputSchema = z.object({
      video_url: z.string().url('Invalid video URL'),
      prompt: z.string().min(1, 'Prompt is required'),
      negative_prompt: z.string().optional().default(''),
      seed: z.number().int().optional().nullable(),
      num_steps: z.number().int().min(1).max(50).default(25),
      duration: z.number().min(1).max(30).default(8),
      cfg_strength: z.number().min(1).max(10).default(4.5),
    });
  • Zod response schema for video_to_audio tool output, referencing shared AudioResponseSchema (lines 57-62).
    const VideoToAudioResponseSchema = z.object({
      video: AudioResponseSchema,
    });
  • Tool registration in ListToolsRequestSchema handler, defining name, description, and JSON schema mirroring the Zod input schema.
    {
      name: 'video_to_audio',
      description: 'Generate AI-powered audio from video content using MMAudio technology. Analyzes video frames and generates synchronized audio including sound effects, ambient noise, and atmospheric elements.',
      inputSchema: {
        type: 'object',
        properties: {
          video_url: {
            type: 'string',
            format: 'uri',
            description: 'URL of the video file to generate audio for (supports mp4, webm, avi, mov formats)',
          },
          prompt: {
            type: 'string',
            description: 'Describe the audio you want to generate (e.g., "forest sounds with birds chirping", "urban traffic noise", "peaceful ocean waves")',
          },
          negative_prompt: {
            type: 'string',
            description: 'Describe what you want to avoid in the generated audio (optional)',
            default: '',
          },
          seed: {
            type: 'integer',
            description: 'Random seed for reproducible results (optional)',
            nullable: true,
          },
          num_steps: {
            type: 'integer',
            minimum: 1,
            maximum: 50,
            default: 25,
            description: 'Number of inference steps (higher = better quality, slower)',
          },
          duration: {
            type: 'number',
            minimum: 1,
            maximum: 30,
            default: 8,
            description: 'Duration of generated audio in seconds',
          },
          cfg_strength: {
            type: 'number',
            minimum: 1,
            maximum: 10,
            default: 4.5,
            description: 'Classifier-free guidance strength (higher = more adherence to prompt)',
          },
        },
        required: ['video_url', 'prompt'],
      },
    },
  • Dispatch case in CallToolRequestSchema handler that routes 'video_to_audio' calls to the handleVideoToAudio method.
    case 'video_to_audio':
      return await this.handleVideoToAudio(args);
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It mentions that the tool 'analyzes video frames and generates synchronized audio', implying processing and creation, but fails to disclose critical traits like whether it's a read-only or destructive operation, rate limits, authentication needs, or output format (e.g., file type, size). This leaves significant gaps for an AI agent to understand the tool's behavior.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise and front-loaded, with two sentences that directly state the tool's purpose and key functionality. Every sentence earns its place by explaining the core action and the types of audio generated, though it could be slightly more structured by explicitly listing output details.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the complexity of a 7-parameter tool with no annotations and no output schema, the description is incomplete. It lacks information on the output (e.g., audio format, how to access it), error handling, performance expectations, or any constraints beyond what's implied. This makes it inadequate for an AI agent to fully understand the tool's context and usage.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The schema description coverage is 100%, so the input schema already documents all parameters thoroughly. The description adds no additional meaning beyond what's in the schema, such as explaining interactions between parameters or providing usage examples. However, since the schema is comprehensive, a baseline score of 3 is appropriate as the description doesn't need to compensate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('generate AI-powered audio from video content') and resources ('video content'), distinguishing it from sibling tools like 'text_to_audio' by specifying video input. It also mentions the technology used ('MMAudio technology'), which adds specificity.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives like 'text_to_audio' or 'validate_api_key'. It lacks explicit instructions on prerequisites, such as whether the video must be pre-processed or if there are usage limits, leaving the agent without context for tool selection.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/mmaudio/mmaudio-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server