Skip to main content
Glama

create_lipsync

Synchronize mouth movements in videos with audio using text-to-speech or custom audio upload. Works with human faces in real, 3D, or 2D videos to create lip-sync content.

Instructions

Create a lip-sync video by synchronizing mouth movements with audio. Supports both text-to-speech (TTS) with various voice options or custom audio upload. The original video must contain a clear, steady human face with visible mouth. Works with real, 3D, or 2D human characters (not animals). Video length limited to 10 seconds.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
video_urlYesURL of the video to apply lip-sync to (must contain clear human face)
audio_urlNoURL of custom audio file (mp3, wav, flac, ogg; max 20MB, 60s). If provided, TTS parameters are ignored
tts_textNoText for text-to-speech synthesis (used only if audio_url is not provided)
tts_voiceNoVoice style for TTS (default: male-warm). Includes Chinese and English voice options
tts_speedNoSpeech speed for TTS (0.5-2.0, default: 1.0)
model_nameNoModel version to use (default: kling-v2-master)

Implementation Reference

  • Core handler function that implements the lip-sync tool logic by calling the Kling AI lip-sync API endpoint, processing video URLs, handling audio upload vs TTS modes, and returning the task ID.
    async createLipsync(request: LipsyncRequest): Promise<{ task_id: string }> {
      const path = '/v1/videos/lip-sync';
      
      // Process video URL
      const video_url = await this.processImageUrl(request.video_url);
      
      const input: any = {
        video_url: video_url!,
      };
    
      if (request.audio_url) {
        input.mode = 'audio2video';
        input.audio_type = 'url';
        input.audio_url = request.audio_url;
      } else if (request.tts_text) {
        input.mode = 'text2video';
        input.text = request.tts_text;
        input.voice_id = request.tts_voice || 'male-magnetic';
        input.voice_language = 'en';
        input.voice_speed = request.tts_speed || 1.0;
      } else {
        throw new Error('Either audio_url or tts_text must be provided');
      }
    
      const body = { input };
    
      try {
        const response = await this.axiosInstance.post(path, body);
        return response.data.data;
      } catch (error) {
        if (axios.isAxiosError(error)) {
          throw new Error(`Kling API error: ${error.response?.data?.message || error.message}`);
        }
        throw error;
      }
    }
  • MCP protocol handler in the CallToolRequestSchema that parses arguments, validates inputs, calls klingClient.createLipsync, and formats the response with task ID.
    case 'create_lipsync': {
      const lipsyncRequest = {
        video_url: args.video_url as string,
        audio_url: args.audio_url as string | undefined,
        tts_text: args.tts_text as string | undefined,
        tts_voice: args.tts_voice as string | undefined,
        tts_speed: (args.tts_speed as number) ?? 1.0,
        model_name: (args.model_name as 'kling-v1' | 'kling-v1.5' | 'kling-v1.6' | 'kling-v2-master' | undefined) || 'kling-v2-master',
      };
    
      // Validate that either audio_url or tts_text is provided
      if (!lipsyncRequest.audio_url && !lipsyncRequest.tts_text) {
        throw new Error('Either audio_url or tts_text must be provided for lip-sync');
      }
    
      const result = await klingClient.createLipsync(lipsyncRequest);
      
      return {
        content: [
          {
            type: 'text',
            text: `Lip-sync video creation started successfully!\nTask ID: ${result.task_id}\n\nThe video will be processed with ${lipsyncRequest.audio_url ? 'custom audio' : 'text-to-speech'}.\nUse the check_video_status tool with this task ID to check the progress.`,
          },
        ],
      };
    }
  • Tool schema definition including name, description, and detailed inputSchema with properties, enums, and validation for the create_lipsync tool.
    {
      name: 'create_lipsync',
      description: 'Create a lip-sync video by synchronizing mouth movements with audio. Supports both text-to-speech (TTS) with various voice options or custom audio upload. The original video must contain a clear, steady human face with visible mouth. Works with real, 3D, or 2D human characters (not animals). Video length limited to 10 seconds.',
      inputSchema: {
        type: 'object',
        properties: {
          video_url: {
            type: 'string',
            description: 'URL of the video to apply lip-sync to (must contain clear human face)',
          },
          audio_url: {
            type: 'string',
            description: 'URL of custom audio file (mp3, wav, flac, ogg; max 20MB, 60s). If provided, TTS parameters are ignored',
          },
          tts_text: {
            type: 'string',
            description: 'Text for text-to-speech synthesis (used only if audio_url is not provided)',
          },
          tts_voice: {
            type: 'string',
            enum: ['male-warm', 'male-energetic', 'female-gentle', 'female-professional', 'male-deep', 'female-cheerful', 'male-calm', 'female-youthful'],
            description: 'Voice style for TTS (default: male-warm). Includes Chinese and English voice options',
          },
          tts_speed: {
            type: 'number',
            description: 'Speech speed for TTS (0.5-2.0, default: 1.0)',
            minimum: 0.5,
            maximum: 2.0,
          },
          model_name: {
            type: 'string',
            enum: ['kling-v1', 'kling-v1.5', 'kling-v1.6', 'kling-v2-master'],
            description: 'Model version to use (default: kling-v2-master)',
          },
        },
        required: ['video_url'],
      },
    },
  • TypeScript interface defining the LipsyncRequest parameters used by the createLipsync handler.
    export interface LipsyncRequest {
      video_url: string;
      audio_url?: string;
      tts_text?: string;
      tts_voice?: string;
      tts_speed?: number;
      model_name?: 'kling-v1' | 'kling-v1.5' | 'kling-v1.6' | 'kling-v2-master';
    }
  • src/index.ts:467-469 (registration)
    Registration of all tools including create_lipsync via the ListToolsRequestSchema handler that returns the TOOLS array containing the create_lipsync tool definition.
    server.setRequestHandler(ListToolsRequestSchema, async () => ({
      tools: TOOLS,
    }));
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full burden. It discloses key behavioral traits: video constraints (clear face, 10s limit), supported character types (human/3D/2D, not animals), and TTS vs custom audio options. However, it doesn't cover permissions, rate limits, or what happens to the original video.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is appropriately sized (4 sentences) and front-loaded with the core purpose. Every sentence adds value: purpose, input options, video requirements, and constraints. No wasted words or redundant information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a 6-parameter creation tool with no annotations or output schema, the description provides good context about what the tool does and its constraints. It covers the main use case and limitations, though could benefit from more behavioral details about the creation process and output format.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents all 6 parameters thoroughly. The description adds some context about TTS/custom audio trade-offs and video requirements, but doesn't provide additional parameter semantics beyond what's in the schema descriptions.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('create', 'synchronizing') and resources ('lip-sync video', 'mouth movements', 'audio'). It distinguishes from siblings by focusing on lip-sync creation rather than effects, generation, or status checking.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context for when to use this tool: for creating lip-sync videos with specific input requirements (clear human face, video length ≤10s). It doesn't explicitly mention when not to use it or name alternatives among siblings, but the constraints guide appropriate usage.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/199-mcp/mcp-kling'

If you have feedback or need assistance with the MCP directory API, please join our Discord server