Skip to main content
Glama

text_to_speech

Convert text to audio speech using customizable voices and save the output as an audio file. Choose from multiple voice models, adjust speech parameters like speed and stability, and select output formats.

Instructions

Convert text to speech with a given voice and save the output audio file to a given directory. Directory is optional, if not provided, the output file will be saved to $HOME/Desktop. Only one of voice_id or voice_name can be provided. If none are provided, the default voice will be used.

⚠️ COST WARNING: This tool makes an API call to ElevenLabs which may incur costs. Only use when explicitly requested by the user.

 Args:
    text (str): The text to convert to speech.
    voice_name (str, optional): The name of the voice to use.
    model_id (str, optional): The model ID to use for speech synthesis. Options include:
        - eleven_multilingual_v2: High quality multilingual model (29 languages)
        - eleven_flash_v2_5: Fastest model with ultra-low latency (32 languages)
        - eleven_turbo_v2_5: Balanced quality and speed (32 languages)
        - eleven_flash_v2: Fast English-only model
        - eleven_turbo_v2: Balanced English-only model
        - eleven_monolingual_v1: Legacy English model
        Defaults to eleven_multilingual_v2 or environment variable ELEVENLABS_MODEL_ID.
    stability (float, optional): Stability of the generated audio. Determines how stable the voice is and the randomness between each generation. Lower values introduce broader emotional range for the voice. Higher values can result in a monotonous voice with limited emotion. Range is 0 to 1.
    similarity_boost (float, optional): Similarity boost of the generated audio. Determines how closely the AI should adhere to the original voice when attempting to replicate it. Range is 0 to 1.
    style (float, optional): Style of the generated audio. Determines the style exaggeration of the voice. This setting attempts to amplify the style of the original speaker. It does consume additional computational resources and might increase latency if set to anything other than 0. Range is 0 to 1.
    use_speaker_boost (bool, optional): Use speaker boost of the generated audio. This setting boosts the similarity to the original speaker. Using this setting requires a slightly higher computational load, which in turn increases latency.
    speed (float, optional): Speed of the generated audio. Controls the speed of the generated speech. Values range from 0.7 to 1.2, with 1.0 being the default speed. Lower values create slower, more deliberate speech while higher values produce faster-paced speech. Extreme values can impact the quality of the generated speech. Range is 0.7 to 1.2.
    output_directory (str, optional): Directory where files should be saved.
        Defaults to $HOME/Desktop if not provided.
    language: ISO 639-1 language code for the voice.
    output_format (str, optional): Output format of the generated audio. Formatted as codec_sample_rate_bitrate. So an mp3 with 22.05kHz sample rate at 32kbs is represented as mp3_22050_32. MP3 with 192kbps bitrate requires you to be subscribed to Creator tier or above. PCM with 44.1kHz sample rate requires you to be subscribed to Pro tier or above. Note that the μ-law format (sometimes written mu-law, often approximated as u-law) is commonly used for Twilio audio inputs.
        Defaults to "mp3_44100_128". Must be one of:
        mp3_22050_32
        mp3_44100_32
        mp3_44100_64
        mp3_44100_96
        mp3_44100_128
        mp3_44100_192
        pcm_8000
        pcm_16000
        pcm_22050
        pcm_24000
        pcm_44100
        ulaw_8000
        alaw_8000
        opus_48000_32
        opus_48000_64
        opus_48000_96
        opus_48000_128
        opus_48000_192

Returns:
    Text content with the path to the output file and name of the voice used.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
languageNoen
model_idNo
output_directoryNo
output_formatNomp3_44100_128
similarity_boostNo
speedNo
stabilityNo
styleNo
textYes
use_speaker_boostNo
voice_idNo
voice_nameNo

Implementation Reference

  • The core handler function for the 'text_to_speech' tool. Converts input text to speech using the ElevenLabs API, handles voice selection, generates audio with specified parameters, saves the output as an MP3 file, and returns the file path.
    def text_to_speech(
        text: str,
        voice_name: str | None = None,
        output_directory: str | None = None,
        voice_id: str | None = None,
        stability: float = 0.5,
        similarity_boost: float = 0.75,
        style: float = 0,
        use_speaker_boost: bool = True,
        speed: float = 1.0,
        language: str = "en",
        output_format: str = "mp3_44100_128",
        model_id: str | None = None,
    ):
        if text == "":
            make_error("Text is required.")
    
        if voice_id is not None and voice_name is not None:
            make_error("voice_id and voice_name cannot both be provided.")
    
        voice = None
        if voice_id is not None:
            voice = client.voices.get(voice_id=voice_id)
        elif voice_name is not None:
            voices = client.voices.search(search=voice_name)
            if len(voices.voices) == 0:
                make_error("No voices found with that name.")
            voice = next((v for v in voices.voices if v.name == voice_name), None)
            if voice is None:
                make_error(f"Voice with name: {voice_name} does not exist.")
    
        voice_id = voice.voice_id if voice else DEFAULT_VOICE_ID
    
        output_path = make_output_path(output_directory, base_path)
        output_file_name = make_output_file("tts", text, output_path, "mp3")
    
        if model_id is None:
            model_id = (
                "eleven_flash_v2_5"
                if language in ["hu", "no", "vi"]
                else "eleven_multilingual_v2"
            )
    
        audio_data = client.text_to_speech.convert(
            text=text,
            voice_id=voice_id,
            model_id=model_id,
            output_format=output_format,
            voice_settings={
                "stability": stability,
                "similarity_boost": similarity_boost,
                "style": style,
                "use_speaker_boost": use_speaker_boost,
                "speed": speed,
            },
        )
        audio_bytes = b"".join(audio_data)
    
        output_path.parent.mkdir(parents=True, exist_ok=True)
        with open(output_path / output_file_name, "wb") as f:
            f.write(audio_bytes)
    
        return TextContent(
            type="text",
            text=f"Success. File saved as: {output_path / output_file_name}. Voice used: {voice.name if voice else DEFAULT_VOICE_ID}",
        )
  • MCP tool registration decorator for 'text_to_speech', including comprehensive description that defines the input schema, parameters, and usage instructions.
    @mcp.tool(
        description="""Convert text to speech with a given voice and save the output audio file to a given directory.
        Directory is optional, if not provided, the output file will be saved to $HOME/Desktop.
        Only one of voice_id or voice_name can be provided. If none are provided, the default voice will be used.
    
        ⚠️ COST WARNING: This tool makes an API call to ElevenLabs which may incur costs. Only use when explicitly requested by the user.
    
         Args:
            text (str): The text to convert to speech.
            voice_name (str, optional): The name of the voice to use.
            model_id (str, optional): The model ID to use for speech synthesis. Options include:
                - eleven_multilingual_v2: High quality multilingual model (29 languages)
                - eleven_flash_v2_5: Fastest model with ultra-low latency (32 languages)
                - eleven_turbo_v2_5: Balanced quality and speed (32 languages)
                - eleven_flash_v2: Fast English-only model
                - eleven_turbo_v2: Balanced English-only model
                - eleven_monolingual_v1: Legacy English model
                Defaults to eleven_multilingual_v2 or environment variable ELEVENLABS_MODEL_ID.
            stability (float, optional): Stability of the generated audio. Determines how stable the voice is and the randomness between each generation. Lower values introduce broader emotional range for the voice. Higher values can result in a monotonous voice with limited emotion. Range is 0 to 1.
            similarity_boost (float, optional): Similarity boost of the generated audio. Determines how closely the AI should adhere to the original voice when attempting to replicate it. Range is 0 to 1.
            style (float, optional): Style of the generated audio. Determines the style exaggeration of the voice. This setting attempts to amplify the style of the original speaker. It does consume additional computational resources and might increase latency if set to anything other than 0. Range is 0 to 1.
            use_speaker_boost (bool, optional): Use speaker boost of the generated audio. This setting boosts the similarity to the original speaker. Using this setting requires a slightly higher computational load, which in turn increases latency.
            speed (float, optional): Speed of the generated audio. Controls the speed of the generated speech. Values range from 0.7 to 1.2, with 1.0 being the default speed. Lower values create slower, more deliberate speech while higher values produce faster-paced speech. Extreme values can impact the quality of the generated speech. Range is 0.7 to 1.2.
            output_directory (str, optional): Directory where files should be saved.
                Defaults to $HOME/Desktop if not provided.
            language: ISO 639-1 language code for the voice.
            output_format (str, optional): Output format of the generated audio. Formatted as codec_sample_rate_bitrate. So an mp3 with 22.05kHz sample rate at 32kbs is represented as mp3_22050_32. MP3 with 192kbps bitrate requires you to be subscribed to Creator tier or above. PCM with 44.1kHz sample rate requires you to be subscribed to Pro tier or above. Note that the μ-law format (sometimes written mu-law, often approximated as u-law) is commonly used for Twilio audio inputs.
                Defaults to "mp3_44100_128". Must be one of:
                mp3_22050_32
                mp3_44100_32
                mp3_44100_64
                mp3_44100_96
                mp3_44100_128
                mp3_44100_192
                pcm_8000
                pcm_16000
                pcm_22050
                pcm_24000
                pcm_44100
                ulaw_8000
                alaw_8000
                opus_48000_32
                opus_48000_64
                opus_48000_96
                opus_48000_128
                opus_48000_192
    
        Returns:
            Text content with the path to the output file and name of the voice used.
        """
    )
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full burden and does an excellent job disclosing behavioral traits. It clearly explains the cost implications (API call to ElevenLabs), file saving behavior (default directory, optional parameter), voice selection logic (mutual exclusivity rules), and various audio quality parameters with their effects on performance and quality.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness3/5

Is the description appropriately sized, front-loaded, and free of redundancy?

While comprehensive, the description is quite lengthy with extensive parameter documentation that might be better placed in a separate reference. The core purpose and critical warnings are front-loaded appropriately, but the detailed parameter explanations (while valuable) make it less concise than ideal for quick scanning.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (12 parameters, no annotations, no output schema), the description provides complete context. It covers purpose, usage constraints, behavioral details, parameter semantics, and even specifies the return format ('path to the output file and name of the voice used'), making it fully self-contained despite the lack of structured metadata.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 0% schema description coverage and 12 parameters, the description provides exceptional value by explaining every parameter's purpose, constraints, defaults, and practical implications. It goes far beyond what the bare schema provides, offering detailed explanations for model_id options, range constraints, computational trade-offs, and format requirements with tier restrictions.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('convert text to speech', 'save the output audio file') and resources (text, voice, directory). It distinguishes itself from siblings like 'text_to_voice' or 'text_to_sound_effects' by focusing on speech synthesis with ElevenLabs API and file saving functionality.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear usage context with the cost warning and explicit instruction to 'only use when explicitly requested by the user.' However, it doesn't explicitly compare with alternatives like 'text_to_voice' or 'speech_to_speech' to guide when to choose this specific tool over siblings.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/projectservan8n/elevenlabs-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server