ElevenLabs MCP Server

generate_audio_script

Convert text scripts into audio with multiple voices and actors using ElevenLabs text-to-speech API. Supports plain text or structured JSON format for voice assignments.

Instructions

Generate audio from a structured script with multiple voices and actors. Accepts either: 1. Plain text string 2. JSON string with format: { "script": [ { "text": "Text to speak", "voice_id": "optional-voice-id", "actor": "optional-actor-name" }, ... ] }

Input Schema

TableJSON Schema

Name	Required	Description	Default
`script`	Yes	JSON string containing script array or plain text. For JSON format, provide an object with a 'script' array containing objects with 'text' (required), 'voice_id' (optional), and 'actor' (optional) fields.

Implementation Reference

src/elevenlabs_mcp/server.py:414-478 (handler)

Handler function for the 'generate_audio_script' tool within the call_tool method. Parses script input, creates and manages job in database, generates audio via API, and returns embedded audio resource.

elif name == "generate_audio_script":
    script_json = arguments.get("script", "{}")
    script_parts, parse_debug_info = self.parse_script(script_json)
    debug_info.extend(parse_debug_info)

    # Create job record
    job_id = str(uuid.uuid4())
    job = AudioJob(
        id=job_id,
        status="pending",
        script_parts=script_parts,
        total_parts=len(script_parts)
    )
    await self.db.insert_job(job)
    debug_info.append(f"Created job record: {job_id}")

    try:
        job.status = "processing"
        await self.db.update_job(job)

        output_file, api_debug_info, completed_parts = self.api.generate_full_audio(
            script_parts,
            self.output_dir
        )
        debug_info.extend(api_debug_info)

        job.status = "completed"
        job.output_file = str(output_file)
        job.completed_parts = completed_parts
        await self.db.update_job(job)
    except Exception as e:
        job.status = "failed"
        job.error = str(e)
        await self.db.update_job(job)
        raise
    
    # Read the generated audio file and encode it as base64
    with open(output_file, 'rb') as f:
        audio_bytes = f.read()
        audio_base64 = base64.b64encode(audio_bytes).decode('utf-8')
        
    # Generate unique URI for the resource
    filename = Path(output_file).name
    resource_uri = f"audio://{filename}"
        
    # Return both a status message and the audio file content
    return [
        types.TextContent(
            type="text",
            text="\n".join([
                "Audio generation successful. Debug info:",
                *debug_info
            ])
        ),
        types.EmbeddedResource(
            type="resource",
            resource=types.BlobResourceContents(
                uri=resource_uri,
                name=filename,
                blob=audio_base64,
                mimeType="audio/mpeg"
            )
        )
    ]

src/elevenlabs_mcp/server.py:227-252 (registration)

Registration of the 'generate_audio_script' tool in list_tools(), including name, description, and input schema definition.

types.Tool(
    name="generate_audio_script",
    description="""Generate audio from a structured script with multiple voices and actors. 
    Accepts either:
    1. Plain text string
    2. JSON string with format: {
        "script": [
            {
                "text": "Text to speak",
                "voice_id": "optional-voice-id",
                "actor": "optional-actor-name"
            },
            ...
        ]
    }""",
    inputSchema={
        "type": "object",
        "properties": {
            "script": {
                "type": "string",
                "description": "JSON string containing script array or plain text. For JSON format, provide an object with a 'script' array containing objects with 'text' (required), 'voice_id' (optional), and 'actor' (optional) fields."
            }
        },
        "required": ["script"]
    }
),

src/elevenlabs_mcp/server.py:63-132 (helper)

Helper method to parse the script input (JSON or plain text) into structured parts with text, voice_id, and actor.

def parse_script(self, script_json: str) -> tuple[list[dict], list[str]]:
    """
    Parse the input into a list of script parts and collect debug information.
    Accepts:
    1. A JSON string with a script array containing dialogue parts
    2. Plain text to be converted to speech
    
    Each dialogue part should have:
    - text (required): The text to speak
    - voice_id (optional): The voice to use
    - actor (optional): The actor/character name
    
    Args:
        script_json: Input text or JSON string
        
    Returns:
        tuple containing:
            - list of parsed script parts
            - list of debug information strings
    """
    debug_info = []
    debug_info.append(f"Raw input: {script_json}")
    
    script_array = []
    
    # Remove any leading/trailing whitespace
    script_json = script_json.strip()
    
    try:
        # Try to parse as JSON first
        if script_json.startswith('['):
            # Direct array of script parts
            script_array = json.loads(script_json)
        elif script_json.startswith('{'):
            # Object with script array
            script_data = json.loads(script_json)
            script_array = script_data.get('script', [])
        else:
            # Treat as plain text if not JSON formatted
            script_array = [{"text": script_json}]
    except json.JSONDecodeError as e:
        # If JSON parsing fails and input looks like JSON, raise error
        if script_json.startswith('{') or script_json.startswith('['):
            debug_info.append(f"JSON parsing failed: {str(e)}")
            raise Exception("Invalid JSON format")
        # Otherwise treat as plain text
        debug_info.append("Input is plain text")
        script_array = [{"text": script_json}]
    
    script_parts = []
    for part in script_array:
        if not isinstance(part, dict):
            debug_info.append(f"Skipping non-dict part: {part}")
            continue
            
        text = part.get("text", "").strip()
        if not text:
            debug_info.append("Missing or empty text field")
            raise Exception("Missing required field 'text'")
            
        new_part = {
            "text": text,
            "voice_id": part.get("voice_id"),
            "actor": part.get("actor")
        }
        debug_info.append(f"Created part: {new_part}")
        script_parts.append(new_part)
    
    debug_info.append(f"Final script_parts: {script_parts}")
    return script_parts, debug_info

Tool Definition Quality

B3/5.0

Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full burden for behavioral disclosure. It explains the input format but doesn't describe what happens after submission: whether this is an async operation, how to retrieve results, expected output format, error conditions, rate limits, or authentication requirements. For a tool that likely creates audio files, this leaves significant behavioral gaps.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is appropriately sized and front-loaded with the core purpose. The format explanation is necessary but could be slightly more streamlined. Every sentence serves a purpose, though the JSON example takes significant space that might be better in schema examples.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given this is a generation tool with no annotations and no output schema, the description is incomplete. It doesn't explain what the tool returns (audio file URL? job ID? error messages?), how to handle the output, or any post-generation steps. For a tool that presumably creates audio content, this leaves the agent without crucial information about result handling.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents the single parameter. The description adds valuable context by explaining the two acceptable formats (plain text vs. JSON) and providing a concrete JSON structure example. This goes beyond what the schema provides, though it could elaborate more on when to use each format.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Generate audio from a structured script with multiple voices and actors.' This specifies the verb ('generate audio'), resource ('structured script'), and key capabilities ('multiple voices and actors'). However, it doesn't explicitly differentiate from sibling tools like 'generate_audio_simple', which likely has a simpler interface.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. It doesn't mention sibling tools like 'generate_audio_simple' or explain scenarios where this more complex script format is preferred over simpler options. There's no context about prerequisites, limitations, or typical use cases.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
open source
OpenAI
Tool Definition Quality Score (TDQS)
By punkpeye on April 3, 2026.
mcp
The Hackers Who Tracked My Sleep Cycle
By punkpeye on March 26, 2026.
security

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/mamertofabian/elevenlabs-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server