Skip to main content
Glama

text_to_audio

Convert text to speech audio files using customizable voices, emotions, and audio settings, then save them to a specified directory.

Instructions

Convert text to audio with a given voice and save the output audio file to a given directory. Directory is optional, if not provided, the output file will be saved to $HOME/Desktop. Voice id is optional, if not provided, the default voice will be used.

COST WARNING: This tool makes an API call to Minimax which may incur costs. Only use when explicitly requested by the user.

Args:
    text (str): The text to convert to speech.
    voice_id (str, optional): The id of the voice to use. For example, "male-qn-qingse"/"audiobook_female_1"/"cute_boy"/"Charming_Lady"...
    model (string, optional): The model to use.
    speed (float, optional): Speed of the generated audio. Controls the speed of the generated speech. Values range from 0.5 to 2.0, with 1.0 being the default speed. 
    vol (float, optional): Volume of the generated audio. Controls the volume of the generated speech. Values range from 0 to 10, with 1 being the default volume.
    pitch (int, optional): Pitch of the generated audio. Controls the speed of the generated speech. Values range from -12 to 12, with 0 being the default speed.
    emotion (str, optional): Emotion of the generated audio. Controls the emotion of the generated speech. Values range ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"], with "happy" being the default emotion.
    sample_rate (int, optional): Sample rate of the generated audio. Controls the sample rate of the generated speech. Values range [8000,16000,22050,24000,32000,44100] with 32000 being the default sample rate.
    bitrate (int, optional): Bitrate of the generated audio. Controls the bitrate of the generated speech. Values range [32000,64000,128000,256000] with 128000 being the default bitrate.
    channel (int, optional): Channel of the generated audio. Controls the channel of the generated speech. Values range [1, 2] with 1 being the default channel.
    format (str, optional): Format of the generated audio. Controls the format of the generated speech. Values range ["pcm", "mp3","flac"] with "mp3" being the default format.
    language_boost (str, optional): Language boost of the generated audio. Controls the language boost of the generated speech. Values range ['Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'auto'] with "auto" being the default language boost.
    output_directory (str): The directory to save the audio to.

Returns:
    Text content with the path to the output file and name of the voice used.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
textYes
output_directoryNo
voice_idNofemale-shaonv
modelNospeech-02-hd
speedNo
volNo
pitchNo
emotionNohappy
sample_rateNo
bitrateNo
channelNo
formatNomp3
language_boostNoauto
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It effectively describes key behaviors: it's a mutation tool (creates/saves files), includes a cost warning for API calls to Minimax, specifies default behaviors (directory, voice, parameters), and mentions the return format. It doesn't cover rate limits or error handling, but provides substantial operational context.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness3/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is appropriately front-loaded with the core purpose and key optional parameters, but becomes verbose with repetitive phrasing ('Controls the... of the generated speech') for each audio parameter. While informative, this repetition reduces efficiency. The structure is logical but could be more streamlined.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (13 parameters, mutation operation, no annotations, no output schema), the description is largely complete. It covers purpose, usage warning, parameter details, and return information. Minor gaps include lack of error scenarios or file naming conventions, but it provides sufficient context for effective use.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The schema description coverage is 0%, so the description must fully compensate. It does so excellently by providing detailed semantics for all 13 parameters: purpose, optionality, default values, value ranges, and examples. This goes far beyond what the bare schema provides, making parameters fully understandable.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('convert text to audio', 'save the output audio file') and resources ('text', 'audio file', 'directory'). It distinguishes from sibling tools like 'play_audio' (which plays rather than creates) and 'voice_clone' (which clones rather than converts text).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context for usage with the COST WARNING section ('Only use when explicitly requested by the user'), which helps guide when to use it. However, it doesn't explicitly mention when NOT to use it or name specific alternatives among siblings (e.g., 'voice_design' or 'music_generation'), though the purpose distinction is implied.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/swesmith-repos/MiniMax-AI__MiniMax-MCP.aa97ac39'

If you have feedback or need assistance with the MCP directory API, please join our Discord server