MiniMax MCP Server

Official

Overview Schema Related Servers Score Discussions

text_to_audio

Convert text into spoken audio using a chosen voice and save it to your specified directory. Adjust speed, pitch, emotion, and other settings to customize the output.

Instructions

Convert text to audio with a given voice and save the output audio file to a given directory. Directory is optional, if not provided, the output file will be saved to $HOME/Desktop. Voice id is optional, if not provided, the default voice will be used.

COST WARNING: This tool makes an API call to Minimax which may incur costs. Only use when explicitly requested by the user.

Args:
    text (str): The text to convert to speech.
    voice_id (str, optional): The id of the voice to use. For example, "male-qn-qingse"/"audiobook_female_1"/"cute_boy"/"Charming_Lady"...
    model (string, optional): The model to use.
    speed (float, optional): Speed of the generated audio. Controls the speed of the generated speech. Values range from 0.5 to 2.0, with 1.0 being the default speed. 
    vol (float, optional): Volume of the generated audio. Controls the volume of the generated speech. Values range from 0 to 10, with 1 being the default volume.
    pitch (int, optional): Pitch of the generated audio. Controls the speed of the generated speech. Values range from -12 to 12, with 0 being the default speed.
    emotion (str, optional): Emotion of the generated audio. Controls the emotion of the generated speech. Values range ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"], with "happy" being the default emotion.
    sample_rate (int, optional): Sample rate of the generated audio. Controls the sample rate of the generated speech. Values range [8000,16000,22050,24000,32000,44100] with 32000 being the default sample rate.
    bitrate (int, optional): Bitrate of the generated audio. Controls the bitrate of the generated speech. Values range [32000,64000,128000,256000] with 128000 being the default bitrate.
    channel (int, optional): Channel of the generated audio. Controls the channel of the generated speech. Values range [1, 2] with 1 being the default channel.
    format (str, optional): Format of the generated audio. Controls the format of the generated speech. Values range ["pcm", "mp3","flac"] with "mp3" being the default format.
    language_boost (str, optional): Language boost of the generated audio. Controls the language boost of the generated speech. Values range ['Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'auto'] with "auto" being the default language boost.
Returns:
    Text content with the path to the output file and name of the voice used.

Input Schema

TableJSON Schema

Name	Required	Default
`vol`	No
`text`	Yes
`model`	No	speech-02-hd
`pitch`	No
`speed`	No
`format`	No	mp3
`bitrate`	No
`channel`	No
`emotion`	No	happy
`voice_id`	No	female-shaonv
`sample_rate`	No
`language_boost`	No	auto
`output_directory`	No

Tool Definition Quality

A4.2/5.0

Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description must disclose all behavioral traits. It mentions the output file creation and return value, but lacks details on overwrite behavior, error handling, or auth requirements. The cost warning adds some transparency, but overall the description is adequate but not comprehensive.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured, starting with a clear summary, then a cost warning, then a detailed parameter list. It is front-loaded and organized, though slightly verbose in parameter explanations. Still, it earns its place.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (13 parameters, no output schema), the description covers purpose, cost, parameter meanings, and return type. It is fairly complete, though it could mention file naming conventions or directory creation behavior.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 0% schema description coverage, the description's Args section provides extensive meaning for all 13 parameters, including ranges, examples, and defaults (e.g., 'speed: 0.5 to 2.0, default 1.0'). This fully compensates for the missing schema descriptions.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states 'Convert text to audio with a given voice and save the output audio file to a given directory,' using a specific verb ('convert') and resource ('text to audio'). It distinguishes from sibling tools like text_to_image and generate_video, as it focuses solely on audio generation from text.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description includes a cost warning advising to only use when explicitly requested, providing explicit usage context. However, it does not differentiate from sibling tools like voice_clone or list_voices, which could be relevant when selecting a voice or cloning.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly
Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
OpenAI
open source

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/MiniMax-AI/MiniMax-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server