Skip to main content
Glama
mamertofabian

ElevenLabs MCP Server

generate_audio_script

Convert text scripts into audio with multiple voices and actors using ElevenLabs text-to-speech API. Supports plain text or structured JSON format for voice assignments.

Instructions

Generate audio from a structured script with multiple voices and actors. Accepts either: 1. Plain text string 2. JSON string with format: { "script": [ { "text": "Text to speak", "voice_id": "optional-voice-id", "actor": "optional-actor-name" }, ... ] }

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
scriptYesJSON string containing script array or plain text. For JSON format, provide an object with a 'script' array containing objects with 'text' (required), 'voice_id' (optional), and 'actor' (optional) fields.

Implementation Reference

  • Handler function for the 'generate_audio_script' tool within the call_tool method. Parses script input, creates and manages job in database, generates audio via API, and returns embedded audio resource.
    elif name == "generate_audio_script":
        script_json = arguments.get("script", "{}")
        script_parts, parse_debug_info = self.parse_script(script_json)
        debug_info.extend(parse_debug_info)
    
        # Create job record
        job_id = str(uuid.uuid4())
        job = AudioJob(
            id=job_id,
            status="pending",
            script_parts=script_parts,
            total_parts=len(script_parts)
        )
        await self.db.insert_job(job)
        debug_info.append(f"Created job record: {job_id}")
    
        try:
            job.status = "processing"
            await self.db.update_job(job)
    
            output_file, api_debug_info, completed_parts = self.api.generate_full_audio(
                script_parts,
                self.output_dir
            )
            debug_info.extend(api_debug_info)
    
            job.status = "completed"
            job.output_file = str(output_file)
            job.completed_parts = completed_parts
            await self.db.update_job(job)
        except Exception as e:
            job.status = "failed"
            job.error = str(e)
            await self.db.update_job(job)
            raise
        
        # Read the generated audio file and encode it as base64
        with open(output_file, 'rb') as f:
            audio_bytes = f.read()
            audio_base64 = base64.b64encode(audio_bytes).decode('utf-8')
            
        # Generate unique URI for the resource
        filename = Path(output_file).name
        resource_uri = f"audio://{filename}"
            
        # Return both a status message and the audio file content
        return [
            types.TextContent(
                type="text",
                text="\n".join([
                    "Audio generation successful. Debug info:",
                    *debug_info
                ])
            ),
            types.EmbeddedResource(
                type="resource",
                resource=types.BlobResourceContents(
                    uri=resource_uri,
                    name=filename,
                    blob=audio_base64,
                    mimeType="audio/mpeg"
                )
            )
        ]
  • Registration of the 'generate_audio_script' tool in list_tools(), including name, description, and input schema definition.
    types.Tool(
        name="generate_audio_script",
        description="""Generate audio from a structured script with multiple voices and actors. 
        Accepts either:
        1. Plain text string
        2. JSON string with format: {
            "script": [
                {
                    "text": "Text to speak",
                    "voice_id": "optional-voice-id",
                    "actor": "optional-actor-name"
                },
                ...
            ]
        }""",
        inputSchema={
            "type": "object",
            "properties": {
                "script": {
                    "type": "string",
                    "description": "JSON string containing script array or plain text. For JSON format, provide an object with a 'script' array containing objects with 'text' (required), 'voice_id' (optional), and 'actor' (optional) fields."
                }
            },
            "required": ["script"]
        }
    ),
  • Helper method to parse the script input (JSON or plain text) into structured parts with text, voice_id, and actor.
    def parse_script(self, script_json: str) -> tuple[list[dict], list[str]]:
        """
        Parse the input into a list of script parts and collect debug information.
        Accepts:
        1. A JSON string with a script array containing dialogue parts
        2. Plain text to be converted to speech
        
        Each dialogue part should have:
        - text (required): The text to speak
        - voice_id (optional): The voice to use
        - actor (optional): The actor/character name
        
        Args:
            script_json: Input text or JSON string
            
        Returns:
            tuple containing:
                - list of parsed script parts
                - list of debug information strings
        """
        debug_info = []
        debug_info.append(f"Raw input: {script_json}")
        
        script_array = []
        
        # Remove any leading/trailing whitespace
        script_json = script_json.strip()
        
        try:
            # Try to parse as JSON first
            if script_json.startswith('['):
                # Direct array of script parts
                script_array = json.loads(script_json)
            elif script_json.startswith('{'):
                # Object with script array
                script_data = json.loads(script_json)
                script_array = script_data.get('script', [])
            else:
                # Treat as plain text if not JSON formatted
                script_array = [{"text": script_json}]
        except json.JSONDecodeError as e:
            # If JSON parsing fails and input looks like JSON, raise error
            if script_json.startswith('{') or script_json.startswith('['):
                debug_info.append(f"JSON parsing failed: {str(e)}")
                raise Exception("Invalid JSON format")
            # Otherwise treat as plain text
            debug_info.append("Input is plain text")
            script_array = [{"text": script_json}]
        
        script_parts = []
        for part in script_array:
            if not isinstance(part, dict):
                debug_info.append(f"Skipping non-dict part: {part}")
                continue
                
            text = part.get("text", "").strip()
            if not text:
                debug_info.append("Missing or empty text field")
                raise Exception("Missing required field 'text'")
                
            new_part = {
                "text": text,
                "voice_id": part.get("voice_id"),
                "actor": part.get("actor")
            }
            debug_info.append(f"Created part: {new_part}")
            script_parts.append(new_part)
        
        debug_info.append(f"Final script_parts: {script_parts}")
        return script_parts, debug_info
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full burden for behavioral disclosure. It explains the input format but doesn't describe what happens after submission: whether this is an async operation, how to retrieve results, expected output format, error conditions, rate limits, or authentication requirements. For a tool that likely creates audio files, this leaves significant behavioral gaps.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is appropriately sized and front-loaded with the core purpose. The format explanation is necessary but could be slightly more streamlined. Every sentence serves a purpose, though the JSON example takes significant space that might be better in schema examples.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given this is a generation tool with no annotations and no output schema, the description is incomplete. It doesn't explain what the tool returns (audio file URL? job ID? error messages?), how to handle the output, or any post-generation steps. For a tool that presumably creates audio content, this leaves the agent without crucial information about result handling.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents the single parameter. The description adds valuable context by explaining the two acceptable formats (plain text vs. JSON) and providing a concrete JSON structure example. This goes beyond what the schema provides, though it could elaborate more on when to use each format.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Generate audio from a structured script with multiple voices and actors.' This specifies the verb ('generate audio'), resource ('structured script'), and key capabilities ('multiple voices and actors'). However, it doesn't explicitly differentiate from sibling tools like 'generate_audio_simple', which likely has a simpler interface.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. It doesn't mention sibling tools like 'generate_audio_simple' or explain scenarios where this more complex script format is preferred over simpler options. There's no context about prerequisites, limitations, or typical use cases.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/mamertofabian/elevenlabs-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server