Skip to main content
Glama
baidu-xiling

Baidu Digital Human MCP Server

Official
by baidu-xiling

generateDhVideo

Generate digital human videos by combining selected avatar IDs, voice IDs, and text or audio content with customizable settings for resolution, camera angles, backgrounds, and subtitles.

Instructions

#工具说明:根据所选数字人像ID及发音人ID,生成数字人视频。

样例1:

用户输入:用数字人像ID为xxx,发音人ID为yyy的音色,视频的内容是“大家好,我是数字人播报的内容”,使用横屏全身的机位,视频背景用“https://digital-human-material.bj.bcebos.com/-%5BLjava.lang.String%3B%4046f6cc1e.png”,开启自动添加动作,开启字幕,生成一个1080P的数字人视频。 思考过程: 1.用户想要用人像ID生成一个数字人视频,对声音,背景,字幕,分辨率等有要求,不是一个简单的数字人视频,需要使用“generateDhVideo”工具。 2.工具需要FigureId,driveType,text,person,inputAudioUrl,width,hight,cameraID,enable,backgroundimageUrl,autoAnimoji这些参数。 3.FigureId是需要使用的人像ID,所以值为xxx。给的播报内容是文本,所以driveType是文本驱动,text为“大家好,我是数字人播报的内容”。发音人已经提供了ID,所以person的值是yyy,开启自动动作,所以autoAnimoji的值为true,开启字幕,所以enabled的值为true,分辨率为1080P,拆分为width的值为1920,hight的值为1080,backgroundimageUrl的值是“https://digital-human-material.bj.bcebos.com/-%5BLjava.lang.String%3B%4046f6cc1e.png”

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
figureIdNo人像ID
voiceIdNo音色ID
textNo播报内容
inputAudioUrlNo驱动音频URL
resolutionWidthNo分辨率:宽
resolutionHeightNo分辨率:高
backgroundTransparentNo背景是否透明
cameraIdNo数字人相机机位,0:横屏半身, 1:竖屏半身, 2: 横屏全身, 3: 竖屏全身
backgroundImageUrlNo背景图片
callbackUrlNo回调地址
driveTypeNo驱动类型, TEXT:文本驱动, VOICE: 音频驱动TEXT
subtitleEnableNo是否启用字幕
autoAnimojiNo自动添加数字人动作

Implementation Reference

  • The handler function that executes the generateDhVideo tool. It constructs a VideoGenerateRequest with provided parameters and calls the DH API client to generate the video asynchronously.
    async def generateDhVideo(
            figureId: Annotated[str, Field(description="人像ID", default=None)],
            voiceId: Annotated[str, Field(description="音色ID", default=None)],
            text: Annotated[str, Field(description="播报内容", default=None)],
            inputAudioUrl: Annotated[str, Field(description="驱动音频URL", default=None)],
            resolutionWidth: Annotated[int, Field(description="分辨率:宽", default=768)],
            resolutionHeight: Annotated[int, Field(description="分辨率:高", default=1280)],
            backgroundTransparent: Annotated[bool, Field(description="背景是否透明", default=False)],
            cameraId: Annotated[int,
                Field(description="数字人相机机位,0:横屏半身, 1:竖屏半身, 2: 横屏全身, 3: 竖屏全身", default=3)],
            backgroundImageUrl: Annotated[str, Field(description="背景图片", default=None)],
            callbackUrl: Annotated[str, Field(description="回调地址", default=None)],
            driveType: Annotated[Literal["TEXT", "VOICE"],
                Field(description="驱动类型, TEXT:文本驱动, VOICE: 音频驱动", default="TEXT")],
            subtitleEnable: Annotated[bool, Field(description="是否启用字幕", default=False)],
            autoAnimoji: Annotated[bool, Field(description="自动添加数字人动作", default=False)]
    ) -> MCPVideoGenerateResponse:
        """
        Generate a new digital human video using the DH API.
    
        Args:
            figureId: 人像ID
            driveType: 驱动类型, TEXT:文本驱动, VOICE: 音频驱动
            text: 文本内容,播报内容
            voiceId: 音色id,
            inputAudioUrl: 驱动音频URL
            resolutionWidth: 分辨率宽
            resolutionHeight: 分辨率高
            backgroundTransparent: 背景透明
            cameraId: 0:横屏半身, 1:竖屏半身, 2: 横屏全身, 3: 竖屏全身
            subtitleEnable: 字幕
            backgroundImageUrl: 背景图片
            autoAnimoji: 自动添加数字人动作
            callbackUrl: 回调地址
    
        Returns:
            taskId: 任务ID
        """
        try:
            request = VideoGenerateRequest(
                figureId=figureId,
                driveType=driveType,
                text=text,
                ttsParams=TtsParams(person=str(voiceId), speed="5", volume="5", pitch="5"),
                inputAudioUrl=inputAudioUrl,
                videoParams=VideoParams(width=resolutionWidth, height=resolutionHeight, transparent=backgroundTransparent),
                dhParams=DHParams(cameraId=cameraId),
                subtitleParams=SubtitleParams(subtitlePolicy="SRT", enabled=True) if subtitleEnable else None,
                backgroundImageUrl=backgroundImageUrl,
                callbackUrl=callbackUrl,
                autoAnimoji=autoAnimoji,
            )
    
            client = await getDhClient()
            ret = await client.generate_avatar_video(request)
            return ret
        except Exception as e:
            return MCPVideoGenerateResponse(error=str(e))
  • The @mcp.tool decorator registers the generateDhVideo tool with its name and detailed description including usage examples.
    @mcp.tool(
        name="generateDhVideo",
        description=(
        """
    #工具说明:根据所选数字人像ID及发音人ID,生成数字人视频。
    # 样例1:
    用户输入:用数字人像ID为xxx,发音人ID为yyy的音色,视频的内容是“大家好,我是数字人播报的内容”,使用横屏全身的机位,视频背景用\
    “https://digital-human-material.bj.bcebos.com/-%5BLjava.lang.String%3B%4046f6cc1e.png”,\
    开启自动添加动作,开启字幕,生成一个1080P的数字人视频。
    思考过程:
    1.用户想要用人像ID生成一个数字人视频,对声音,背景,字幕,分辨率等有要求,不是一个简单的数字人视频,需要使用“generateDhVideo”工具。
    2.工具需要FigureId,driveType,text,person,inputAudioUrl,width,hight,cameraID,enable,backgroundimageUrl,\
    autoAnimoji这些参数。
    3.FigureId是需要使用的人像ID,所以值为xxx。给的播报内容是文本,所以driveType是文本驱动,text为“大家好,我是数字人播报的内容”。\
    发音人已经提供了ID,所以person的值是yyy,开启自动动作,所以autoAnimoji的值为true,开启字幕,所以enabled的值为true,分辨率为1080P,\
    拆分为width的值为1920,hight的值为1080,backgroundimageUrl的值是\
    “https://digital-human-material.bj.bcebos.com/-%5BLjava.lang.String%3B%4046f6cc1e.png”
        """)
    )
  • Input schema defined via Annotated types and Field descriptions in the function signature for MCP tool validation.
    async def generateDhVideo(
            figureId: Annotated[str, Field(description="人像ID", default=None)],
            voiceId: Annotated[str, Field(description="音色ID", default=None)],
            text: Annotated[str, Field(description="播报内容", default=None)],
            inputAudioUrl: Annotated[str, Field(description="驱动音频URL", default=None)],
            resolutionWidth: Annotated[int, Field(description="分辨率:宽", default=768)],
            resolutionHeight: Annotated[int, Field(description="分辨率:高", default=1280)],
            backgroundTransparent: Annotated[bool, Field(description="背景是否透明", default=False)],
            cameraId: Annotated[int,
                Field(description="数字人相机机位,0:横屏半身, 1:竖屏半身, 2: 横屏全身, 3: 竖屏全身", default=3)],
            backgroundImageUrl: Annotated[str, Field(description="背景图片", default=None)],
            callbackUrl: Annotated[str, Field(description="回调地址", default=None)],
            driveType: Annotated[Literal["TEXT", "VOICE"],
                Field(description="驱动类型, TEXT:文本驱动, VOICE: 音频驱动", default="TEXT")],
            subtitleEnable: Annotated[bool, Field(description="是否启用字幕", default=False)],
            autoAnimoji: Annotated[bool, Field(description="自动添加数字人动作", default=False)]
    ) -> MCPVideoGenerateResponse:
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full burden for behavioral disclosure. It mentions the tool generates videos but doesn't describe what happens after generation (e.g., where the video is stored, if it's returned immediately, processing time, or error conditions). The example implies it's a creation/mutation tool, but there's no information about permissions, rate limits, or side effects. This is a significant gap for a complex tool with 13 parameters.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness2/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is poorly structured and verbose. It includes a lengthy example with a '思考过程' (thought process) section that walks through parameter mapping, which is redundant given the comprehensive schema. The front-loaded tool purpose is clear, but the example occupies most of the description without adding proportional value. The content could be significantly condensed while maintaining utility.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (13 parameters, no annotations, no output schema), the description is incomplete. It focuses heavily on parameter mapping in an example but lacks critical context: what the tool returns (no output schema), behavioral traits like processing time or storage, and differentiation from sibling tools. For a video generation tool with many configuration options, this leaves significant gaps for an AI agent to understand proper usage.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents all 13 parameters thoroughly. The description adds minimal value beyond the schema: it mentions FigureId, driveType, text, person, inputAudioUrl, width, height, cameraID, enable, backgroundimageUrl, and autoAnimoji in the example, but these are already well-documented in the schema with descriptions and defaults. No additional semantic context is provided beyond what's in the structured schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: '根据所选数字人像ID及发音人ID,生成数字人视频' (generate digital human video based on selected figure ID and voice ID). It specifies the verb '生成' (generate) and resource '数字人视频' (digital human video), making the purpose unambiguous. However, it doesn't explicitly differentiate from sibling tools like generateDh123Video or generateLite2dGeneralVideo, which appear to be similar video generation tools.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no explicit guidance on when to use this tool versus alternatives. While it mentions a '思考过程' (thought process) that suggests using this tool for complex video generation with specific parameters, it doesn't name alternative tools or specify exclusion criteria. The example shows usage but lacks comparative context with siblings like generateDh123Video or generateText2Audio.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/baidu-xiling/mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server