text_to_audio
Convert text to audio with customizable voice, speed, and emotion, saving the file to a specified directory. Integrates with MiniMax API for high-quality speech synthesis.
Instructions
Convert text to audio with a given voice and save the output audio file to a given directory. If no directory is provided, the file will be saved to desktop. If no voice ID is provided, the default voice will be used.
Note: This tool calls MiniMax API and may incur costs. Use only when explicitly requested by the user.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| bitrate | No | Bitrate (bps), values: [64000, 96000, 128000, 160000, 192000, 224000, 256000, 320000] | |
| channel | No | Audio channels, values: [1, 2] | |
| emotion | No | Speech emotion, values: ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"] | happy |
| format | No | Audio format, values: ["pcm", "mp3","flac", "wav"] | mp3 |
| languageBoost | No | Enhance the ability to recognize specified languages and dialects. Supported values include: 'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'auto', default is 'auto' | auto |
| model | No | Model to use | speech-02-hd |
| outputDirectory | No | The directory to save the output file. `outputDirectory` is relative to `MINIMAX_MCP_BASE_PATH` (or `basePath` in config). The final save path is `${basePath}/${outputDirectory}`. For example, if `MINIMAX_MCP_BASE_PATH=~/Desktop` and `outputDirectory=workspace`, the output will be saved to `~/Desktop/workspace/` | |
| outputFile | No | Path to save the generated audio file, automatically generated if not provided | |
| pitch | No | Speech pitch | |
| sampleRate | No | Sample rate (Hz), values: [8000, 16000, 22050, 24000, 32000, 44100] | |
| speed | No | Speech speed | |
| subtitleEnable | No | The parameter controls whether the subtitle service is enabled. The model must be 'speech-01-turbo' or 'speech-01-hd'. If this parameter is not provided, the default value is false | |
| text | Yes | Text to convert to audio | |
| voiceId | No | Voice ID to use, e.g. "female-shaonv" | male-qn-qingse |
| vol | No | Speech volume |
Implementation Reference
- src/api/tts.ts:16-111 (handler)Core handler function that implements text-to-audio conversion by calling MiniMax TTS API endpoint /v1/t2a_v2, validates parameters, prepares request payload, handles response (hex to binary conversion, file saving or URL return), includes helper validators for parameters.async generateSpeech(request: TTSRequest): Promise<any> { // Validate required parameters if (!request.text || request.text.trim() === '') { throw new MinimaxRequestError(ERROR_TEXT_REQUIRED); } // Process output file let outputFile = request.outputFile; if (!outputFile) { // If no output file is provided, generate one based on text content const textPrefix = request.text.substring(0, 20).replace(/[^\w]/g, '_'); outputFile = `tts_${textPrefix}_${Date.now()}`; } if (!path.extname(outputFile)) { // If no extension, add one based on format const format = request.format || 'mp3'; outputFile = buildOutputFile(outputFile, request.outputDirectory, format); } // Prepare request data according to MiniMax API nested structure const requestData: Record<string, any> = { model: this.ensureValidModel(request.model), text: request.text, voice_setting: { voice_id: request.voiceId || 'male-qn-qingse', speed: request.speed || 1.0, vol: request.vol || 1.0, pitch: request.pitch || 0, emotion: this.ensureValidEmotion(request.emotion, this.ensureValidModel(request.model)) }, audio_setting: { sample_rate: this.ensureValidSampleRate(request.sampleRate), bitrate: this.ensureValidBitrate(request.bitrate), format: this.ensureValidFormat(request.format), channel: this.ensureValidChannel(request.channel) }, language_boost: request.languageBoost || 'auto', stream: request.stream, subtitle_enable: request.subtitleEnable }; // Add output format (if specified) if (request.outputFormat === RESOURCE_MODE_URL) { requestData.output_format = 'url'; } // Filter out undefined fields (recursive) const filteredData = this.removeUndefinedFields(requestData); try { // Send request const response = await this.api.post<any>('/v1/t2a_v2', filteredData); // Process response const audioData = response?.data?.audio; const subtitleFile = response?.data?.subtitle_file; if (!audioData) { throw new MinimaxRequestError('Could not get audio data from response'); } // If URL mode, return URL directly if (request.outputFormat === RESOURCE_MODE_URL) { return { audio: audioData, subtitle: subtitleFile }; } // If base64 mode, decode and save file try { // Convert hex string to binary const audioBuffer = Buffer.from(audioData, 'hex'); // Ensure output directory exists const outputDir = path.dirname(outputFile); if (!fs.existsSync(outputDir)) { fs.mkdirSync(outputDir, { recursive: true }); } // Write to file fs.writeFileSync(outputFile, audioBuffer); return { audio: outputFile, subtitle: subtitleFile }; } catch (error) { throw new MinimaxRequestError(`Failed to save audio file: ${String(error)}`); } } catch (error) { throw error; } }
- src/mcp-server.ts:123-256 (registration)Tool registration using McpServer.tool() with name 'text_to_audio', detailed description, Zod input schema defining all parameters with descriptions/defaults/validations, and async handler that prepares params and calls TTSAPI.generateSpeechprivate registerTextToAudioTool(): void { this.server.tool( 'text_to_audio', 'Convert text to audio with a given voice and save the output audio file to a given directory. If no directory is provided, the file will be saved to desktop. If no voice ID is provided, the default voice will be used.\n\nNote: This tool calls MiniMax API and may incur costs. Use only when explicitly requested by the user.', { text: z.string().describe('Text to convert to audio'), outputDirectory: COMMON_PARAMETERS_SCHEMA.outputDirectory, voiceId: z.string().optional().default(DEFAULT_VOICE_ID).describe('Voice ID to use, e.g. "female-shaonv"'), model: z.string().optional().default(DEFAULT_SPEECH_MODEL).describe('Model to use'), speed: z.number().min(0.5).max(2.0).optional().default(DEFAULT_SPEED).describe('Speech speed'), vol: z.number().min(0.1).max(10.0).optional().default(DEFAULT_VOLUME).describe('Speech volume'), pitch: z.number().min(-12).max(12).optional().default(DEFAULT_PITCH).describe('Speech pitch'), emotion: z .string() .optional() .default(DEFAULT_EMOTION) .describe('Speech emotion, values: ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"]'), format: z .string() .optional() .default(DEFAULT_FORMAT) .describe('Audio format, values: ["pcm", "mp3","flac", "wav"]'), sampleRate: z .number() .optional() .default(DEFAULT_SAMPLE_RATE) .describe('Sample rate (Hz), values: [8000, 16000, 22050, 24000, 32000, 44100]'), bitrate: z .number() .optional() .default(DEFAULT_BITRATE) .describe('Bitrate (bps), values: [64000, 96000, 128000, 160000, 192000, 224000, 256000, 320000]'), channel: z.number().optional().default(DEFAULT_CHANNEL).describe('Audio channels, values: [1, 2]'), languageBoost: z.string().optional().default(DEFAULT_LANGUAGE_BOOST) .describe(`Enhance the ability to recognize specified languages and dialects. Supported values include: 'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'auto', default is 'auto'`), subtitleEnable: z .boolean() .optional() .default(false) .describe( `The parameter controls whether the subtitle service is enabled. The model must be 'speech-01-turbo' or 'speech-01-hd'. If this parameter is not provided, the default value is false`, ), outputFile: z .string() .optional() .describe('Path to save the generated audio file, automatically generated if not provided'), }, async (args, extra) => { try { // Build TTS request parameters const ttsParams = { text: args.text, outputDirectory: args.outputDirectory, voiceId: args.voiceId || DEFAULT_VOICE_ID, model: args.model || DEFAULT_SPEECH_MODEL, speed: args.speed || DEFAULT_SPEED, vol: args.vol || DEFAULT_VOLUME, pitch: args.pitch || DEFAULT_PITCH, emotion: args.emotion || DEFAULT_EMOTION, format: args.format || DEFAULT_FORMAT, sampleRate: args.sampleRate || DEFAULT_SAMPLE_RATE, bitrate: args.bitrate || DEFAULT_BITRATE, channel: args.channel || DEFAULT_CHANNEL, languageBoost: args.languageBoost || DEFAULT_LANGUAGE_BOOST, subtitleEnable: args.subtitleEnable || false, outputFile: args.outputFile, }; // Use global configuration const requestApiKey = this.config.apiKey; if (!requestApiKey) { throw new Error(ERROR_API_KEY_REQUIRED); } // Update configuration with request-specific parameters const requestConfig: Partial<Config> = { apiKey: requestApiKey, apiHost: this.config.apiHost, resourceMode: this.config.resourceMode, }; // Update API instance const requestApi = new MiniMaxAPI(requestConfig as Config); const requestTtsApi = new TTSAPI(requestApi); // Automatically set resource mode (if not specified) const outputFormat = requestConfig.resourceMode; const ttsRequest = { ...ttsParams, outputFormat, }; // If no output filename is provided, generate one automatically if (!ttsRequest.outputFile) { const textPrefix = ttsRequest.text.substring(0, 20).replace(/[^\w]/g, '_'); ttsRequest.outputFile = `tts_${textPrefix}_${Date.now()}`; } const result = await requestTtsApi.generateSpeech(ttsRequest); // Return different messages based on output format if (outputFormat === RESOURCE_MODE_URL) { return { content: [ { type: 'text', text: `Success. Audio URL: ${result.audio}. ${ttsParams.subtitleEnable ? `Subtitle file saved: ${result.subtitle}` : ''}`, }, ], }; } else { return { content: [ { type: 'text', text: `Audio file saved: ${result.audio}. ${ttsParams.subtitleEnable ? `Subtitle file saved: ${result.subtitle}. ` : ''}Voice used: ${ttsParams.voiceId}`, }, ], }; } } catch (error) { return { content: [ { type: 'text', text: `Failed to generate audio: ${error instanceof Error ? error.message : String(error)}`, }, ], }; } }, ); }
- src/mcp-rest-server.ts:204-245 (schema)JSON schema definition for text_to_audio tool used in REST server's list tools handler, includes all input parameters with types and descriptions.{ name: 'text_to_audio', description: 'Convert text to audio', arguments: [ { name: 'text', description: 'Text to convert to audio', required: true }, { name: 'outputDirectory', description: OUTPUT_DIRECTORY_DESCRIPTION, required: false }, { name: 'voiceId', description: 'Voice ID to use, e.g. "female-shaonv"', required: false }, { name: 'model', description: 'Model to use', required: false }, { name: 'speed', description: 'Speech speed (0.5-2.0)', required: false }, { name: 'vol', description: 'Speech volume (0.1-10.0)', required: false }, { name: 'pitch', description: 'Speech pitch (-12 to 12)', required: false }, { name: 'emotion', description: 'Speech emotion, values: ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"]', required: false }, { name: 'format', description: 'Audio format, values: ["pcm", "mp3","flac", "wav"]', required: false }, { name: 'sampleRate', description: 'Sample rate (Hz), values: [8000, 16000, 22050, 24000, 32000, 44100]', required: false }, { name: 'bitrate', description: 'Bitrate (bps), values: [64000, 96000, 128000, 160000, 192000, 224000, 256000, 320000]', required: false }, { name: 'channel', description: 'Audio channels, values: [1, 2]', required: false }, { name: 'languageBoost', description: `Enhance the ability to recognize specified languages and dialects. Supported values include: 'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'auto', default is 'auto'`, required: false }, { name: 'subtitleEnable', description: `The parameter controls whether the subtitle service is enabled. The model must be 'speech-01-turbo' or 'speech-01-hd'. If this parameter is not provided, the default value is false`, required: false }, { name: 'outputFile', description: 'Output file path, auto-generated if not provided', required: false } ], inputSchema: { type: 'object', properties: { text: { type: 'string' }, outputDirectory: { type: 'string' }, voiceId: { type: 'string' }, model: { type: 'string' }, speed: { type: 'number' }, vol: { type: 'number' }, pitch: { type: 'number' }, emotion: { type: 'string' }, format: { type: 'string' }, sampleRate: { type: 'number' }, bitrate: { type: 'number' }, channel: { type: 'number' }, languageBoost: { type: 'string' }, subtitleEnable: { type: 'boolean' }, outputFile: { type: 'string' } }, required: ['text'] } },
- src/mcp-rest-server.ts:256-489 (handler)Handler wrapper in REST server that dispatches to mediaService.generateSpeech with retry logic.}, required: ['voiceType'] } }, { name: 'play_audio', description: 'Play audio file. Supports WAV and MP3 formats. Does not support video.', arguments: [ { name: 'inputFilePath', description: 'Path to audio file to play', required: true }, { name: 'isUrl', description: 'Whether audio file is a URL', required: false } ], inputSchema: { type: 'object', properties: { inputFilePath: { type: 'string' }, isUrl: { type: 'boolean' } }, required: ['inputFilePath'] } }, { name: 'text_to_image', description: 'Generate image based on text prompt', arguments: [ { name: 'prompt', description: 'Text prompt for image generation', required: true }, { name: 'model', description: 'Model to use', required: false }, { name: 'aspectRatio', description: 'Image aspect ratio, values: ["1:1", "16:9","4:3", "3:2", "2:3", "3:4", "9:16", "21:9"]', required: false }, { name: 'n', description: 'Number of images to generate (1-9)', required: false }, { name: 'promptOptimizer', description: 'Whether to optimize prompt', required: false }, { name: 'outputDirectory', description: OUTPUT_DIRECTORY_DESCRIPTION, required: false }, { name: 'outputFile', description: 'Output file path, auto-generated if not provided', required: false } ], inputSchema: { type: 'object', properties: { prompt: { type: 'string' }, model: { type: 'string' }, aspectRatio: { type: 'string' }, n: { type: 'number' }, promptOptimizer: { type: 'boolean' }, outputDirectory: { type: 'string' }, outputFile: { type: 'string' } }, required: ['prompt'] } }, { name: 'generate_video', description: 'Generate video based on text prompt', arguments: [ { name: 'prompt', description: 'Text prompt for video generation', required: true }, { name: 'model', description: 'Model to use, values: ["T2V-01", "T2V-01-Director", "I2V-01", "I2V-01-Director", "I2V-01-live", "MiniMax-Hailuo-02"]', required: false }, { name: 'firstFrameImage', description: 'First frame image', required: false }, { name: 'outputDirectory', description: OUTPUT_DIRECTORY_DESCRIPTION, required: false }, { name: 'outputFile', description: 'Output file path, auto-generated if not provided', required: false }, { name: 'async_mode', description: 'Whether to use async mode. Defaults to False. If True, the video generation task will be submitted asynchronously and the response will return a task_id. Should use `query_video_generation` tool to check the status of the task and get the result', required: false }, { name: 'resolution', description: 'The resolution of the video. The model must be "MiniMax-Hailuo-02". Values range ["768P", "1080P"]', required: false }, { name: 'duration', description: 'The duration of the video. The model must be "MiniMax-Hailuo-02". Values can be 6 and 10.', required: false }, ], inputSchema: { type: 'object', properties: { prompt: { type: 'string' }, model: { type: 'string' }, firstFrameImage: { type: 'string' }, outputDirectory: { type: 'string' }, outputFile: { type: 'string' }, async_mode: { type: 'boolean' }, resolution: { type: 'string' }, duration: { type: 'number' } }, required: ['prompt'] } }, { name: 'voice_clone', description: 'Clone voice using provided audio file', arguments: [ { name: 'voiceId', description: 'Voice ID to use', required: true }, { name: 'audioFile', description: 'Audio file path', required: true }, { name: 'text', description: 'Text for demo audio', required: false }, { name: 'outputDirectory', description: OUTPUT_DIRECTORY_DESCRIPTION, required: false }, { name: 'isUrl', description: 'Whether audio file is a URL', required: false } ], inputSchema: { type: 'object', properties: { voiceId: { type: 'string' }, audioFile: { type: 'string' }, text: { type: 'string' }, outputDirectory: { type: 'string' }, isUrl: { type: 'boolean' } }, required: ['voiceId', 'audioFile'] } }, { name: 'image_to_video', description: 'Generate video based on image', arguments: [ { name: 'prompt', description: 'Text prompt for video generation', required: true }, { name: 'firstFrameImage', description: 'Path to first frame image', required: true }, { name: 'model', description: 'Model to use, values: ["I2V-01", "I2V-01-Director", "I2V-01-live"]', required: false }, { name: 'outputDirectory', description: OUTPUT_DIRECTORY_DESCRIPTION, required: false }, { name: 'outputFile', description: 'Output file path, auto-generated if not provided', required: false }, { name: 'async_mode', description: 'Whether to use async mode. Defaults to False. If True, the video generation task will be submitted asynchronously and the response will return a task_id. Should use `query_video_generation` tool to check the status of the task and get the result', required: false } ], inputSchema: { type: 'object', properties: { prompt: { type: 'string' }, firstFrameImage: { type: 'string' }, model: { type: 'string' }, outputDirectory: { type: 'string' }, outputFile: { type: 'string' }, async_mode: { type: 'boolean' } }, required: ['prompt', 'firstFrameImage'] } }, { name: 'music_generation', description: 'Generate music based on text prompt and lyrics', arguments: [ { name: 'prompt', description: 'Music creation inspiration describing style, mood, scene, etc.', required: true }, { name: 'lyrics', description: 'Song lyrics for music generation.\nUse newline (\\n) to separate each line of lyrics. Supports lyric structure tags [Intro][Verse][Chorus][Bridge][Outro]\nto enhance musicality. Character range: [10, 600] (each Chinese character, punctuation, and letter counts as 1 character)', required: true }, { name: 'sampleRate', description: 'Sample rate of generated music', required: false }, { name: 'bitrate', description: 'Bitrate of generated music', required: false }, { name: 'format', description: 'Format of generated music', required: false }, { name: 'outputDirectory', description: OUTPUT_DIRECTORY_DESCRIPTION, required: false } ], inputSchema: { type: 'object', properties: { prompt: { type: 'string' }, lyrics: { type: 'string' }, sampleRate: { type: 'number' }, bitrate: { type: 'number' }, format: { type: 'string' }, outputDirectory: { type: 'string' } }, required: ['prompt', 'lyrics'] } }, { name: 'voice_design', description: 'Generate a voice based on description prompts', arguments: [ { name: 'prompt', description: 'The prompt to generate the voice from', required: true }, { name: 'previewText', description: 'The text to preview the voice', required: true }, { name: 'voiceId', description: 'The id of the voice to use', required: false }, { name: 'outputDirectory', description: OUTPUT_DIRECTORY_DESCRIPTION, required: false } ], inputSchema: { type: 'object', properties: { prompt: { type: 'string' }, previewText: { type: 'string' }, voiceId: { type: 'string' }, outputDirectory: { type: 'string' } }, required: ['prompt', 'previewText'] } } ] }; } catch (error) { throw this.wrapError('Failed to get tool list', error); } }); // Call tool handler this.server.setRequestHandler(CallToolRequestSchema, async (request) => { const toolName = request.params.tool; const toolParams = request.params.params || {}; try { // Create configuration and API instance for this request const requestConfig = this.getRequestConfig(request); const requestApi = new MiniMaxAPI(requestConfig); const mediaService = new MediaService(requestApi); // Log API key (partially hidden) const apiKey = this.extractApiKeyFromRequest(request); const maskedKey = apiKey ? `${apiKey.substring(0, 4)}****${apiKey.substring(apiKey.length - 4)}` : 'not provided'; // console.log(`[${new Date().toISOString()}] Using API key: ${maskedKey} to call tool: ${toolName}`); // Choose different handler function based on tool name switch (toolName) { case 'text_to_audio': return await this.handleTextToAudio(toolParams, requestApi, mediaService); case 'list_voices': return await this.handleListVoices(toolParams, requestApi, mediaService); case 'play_audio': return await this.handlePlayAudio(toolParams); case 'text_to_image': return await this.handleTextToImage(toolParams, requestApi, mediaService); case 'generate_video': return await this.handleGenerateVideo(toolParams, requestApi, mediaService); case 'voice_clone': return await this.handleVoiceClone(toolParams, requestApi, mediaService); case 'image_to_video': return await this.handleImageToVideo(toolParams, requestApi, mediaService); case 'query_video_generation': return await this.handleVideoGenerationQuery(toolParams, requestApi, mediaService); case 'music_generation': return await this.handleGenerateMusic(toolParams, requestApi, mediaService); case 'voice_design': return await this.handleVoiceDesign(toolParams, requestApi, mediaService); default: throw new Error(`Unknown tool: ${toolName}`); } } catch (error) { throw this.wrapError(`Failed to call tool ${toolName}`, error); } }); } /** * Handle text to speech request */ private async handleTextToAudio(args: any, api: MiniMaxAPI, mediaService: MediaService, attempt = 1): Promise<any> {
- src/services/media-service.ts:70-78 (handler)MediaService wrapper that delegates text-to-audio execution to TTSAPI.generateSpeech.public async generateSpeech(params: any): Promise<string> { this.checkInitialized(); try { return await this.ttsApi.generateSpeech(params); } catch (error) { // console.error(`[${new Date().toISOString()}] Failed to generate speech:`, error); throw this.wrapError('Failed to generate speech', error); } }