Skip to main content
Glama

generate_speech

Convert text to speech with customizable options like emotion, pitch, speed, and voice selection using the Minimax MCP Tools API. Save audio files in multiple formats for diverse use cases.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
bitrateNoBitrate of the generated audio (for MP3 only)
channelNoNumber of audio channels (1=mono, 2=stereo)
emotionNoEmotion of the speech
formatNoAudio format
languageBoostNoEnhance recognition of specific languages
latexReadNoWhether to read LaTeX formulas
modelNoModel version to use for speech generation. speech-02-hd is the newest high-definition model with better quality and naturalness. speech-02-turbo has excellent performance with low latency.
outputFileYesAbsolute path to save the generated audio file
pitchNoSpeech pitch (-12 to 12)
pronunciationDictNoList of pronunciation replacements
sampleRateNoSample rate of the generated audio
speedNoSpeech speed (0.5-2.0)
streamNoWhether to use streaming mode
subtitleEnableNoWhether to enable subtitle generation
textYesText to convert to speech
timberWeightsNoVoice timber weights for voice mixing
voiceIdNoVoice ID to use. Options include: - Male voices: male-qn-qingse (青涩青年), male-qn-jingying (精英青年), male-qn-badao (霸道青年), male-qn-daxuesheng (青年大学生) - Female voices: female-shaonv (少女), female-yujie (御姐), female-chengshu (成熟女性), female-tianmei (甜美女性) - Presenters: presenter_male (男性主持人), presenter_female (女性主持人) - Audiobooks: audiobook_male_1 (男性有声书1), audiobook_male_2 (男性有声书2), audiobook_female_1 (女性有声书1), audiobook_female_2 (女性有声书2) - Beta voices: male-qn-qingse-jingpin (青涩青年-beta), male-qn-jingying-jingpin (精英青年-beta), male-qn-badao-jingpin (霸道青年-beta), male-qn-daxuesheng-jingpin (青年大学生-beta), female-shaonv-jingpin (少女音色-beta), female-yujie-jingpin (御姐音色-beta), female-chengshu-jingpin (成熟女性-beta), female-tianmei-jingpin (甜美女性-beta) - Character voices: clever_boy (聪明男童), cute_boy (可爱男童), lovely_girl (萌萌女童), cartoon_pig (卡通猪小琪), bingjiao_didi (病娇弟弟), junlang_nanyou (俊朗男友), chunzhen_xuedi (纯真学弟), lengdan_xiongzhang (冷淡学长), badao_shaoye (霸道少爷), tianxin_xiaoling (甜心小玲), qiaopi_mengmei (俏皮萌妹), wumei_yujie (妩媚御姐), diadia_xuemei (嗲嗲学妹), danya_xuejie (淡雅学姐) - Western characters: Santa_Claus, Grinch, Rudolph, Arnold, Charming_Santa, Charming_Lady, Sweet_Girl, Cute_Elf, Attractive_Girl, Serene_Woman
volumeNoSpeech volume (0.1-10.0)

Implementation Reference

  • Core handler function implementing text-to-speech generation: constructs API payload, sends POST request to Minimax TTS endpoint, processes response data including saving the audio file to disk.
    async generateSpeech(params: TextToSpeechParams): Promise<TTSResult> { try { // Build API payload (MCP handles validation) const payload = this.buildPayload(params); // Make API request const response = await this.post(API_CONFIG.ENDPOINTS.TEXT_TO_SPEECH, payload) as TTSResponse; // Process response return await this.processTTSResponse(response, params); } catch (error: any) { const processedError = ErrorHandler.handleAPIError(error); ErrorHandler.logError(processedError, { service: 'tts', params }); // Throw the error so task manager can properly mark it as failed throw processedError; } }
  • Input schema (Zod) defining parameters for the generateSpeech tool, including text, voice settings, audio format, and optional voice modifications.
    export const textToSpeechSchema = z.object({ text: z.string() .min(1, 'Text is required') .max(CONSTRAINTS.TTS.TEXT_MAX_LENGTH, `Text to convert to speech. Max ${CONSTRAINTS.TTS.TEXT_MAX_LENGTH} characters. Use newlines for paragraph breaks. For custom pauses, insert <#x#> where x is seconds (0.01-99.99, max 2 decimals). Pause markers must be between pronounceable text and cannot be consecutive`), outputFile: filePathSchema.describe('Absolute path for audio file'), highQuality: z.boolean() .default(false) .describe('Use high-quality model (speech-02-hd) for audiobooks/premium content. Default: false (uses faster speech-02-turbo)'), voiceId: z.enum(Object.keys(VOICES) as [VoiceId, ...VoiceId[]]) .default('female-shaonv' as VoiceId) .describe(`Voice ID for speech generation. Available voices: ${Object.keys(VOICES).map(id => `${id} (${VOICES[id as VoiceId]?.name || id})`).join(', ')}`), speed: z.number() .min(CONSTRAINTS.TTS.SPEED_MIN) .max(CONSTRAINTS.TTS.SPEED_MAX) .default(1.0) .describe(`Speech speed multiplier (${CONSTRAINTS.TTS.SPEED_MIN}-${CONSTRAINTS.TTS.SPEED_MAX}). Higher values = faster speech`), volume: z.number() .min(CONSTRAINTS.TTS.VOLUME_MIN) .max(CONSTRAINTS.TTS.VOLUME_MAX) .default(1.0) .describe(`Audio volume level (${CONSTRAINTS.TTS.VOLUME_MIN}-${CONSTRAINTS.TTS.VOLUME_MAX}). Higher values = louder audio`), pitch: z.number() .min(CONSTRAINTS.TTS.PITCH_MIN) .max(CONSTRAINTS.TTS.PITCH_MAX) .default(0) .describe(`Pitch adjustment in semitones (${CONSTRAINTS.TTS.PITCH_MIN} to ${CONSTRAINTS.TTS.PITCH_MAX}). Negative = lower pitch, Positive = higher pitch`), emotion: z.enum(CONSTRAINTS.TTS.EMOTIONS as readonly [Emotion, ...Emotion[]]) .default('neutral' as Emotion) .describe(`Emotional tone of the speech. Options: ${CONSTRAINTS.TTS.EMOTIONS.join(', ')}`), format: z.enum(CONSTRAINTS.TTS.FORMATS as readonly [AudioFormat, ...AudioFormat[]]) .default('mp3' as AudioFormat) .describe(`Output audio format. Options: ${CONSTRAINTS.TTS.FORMATS.join(', ')}`), sampleRate: z.enum(CONSTRAINTS.TTS.SAMPLE_RATES as readonly [SampleRate, ...SampleRate[]]) .default("32000" as SampleRate) .describe(`Audio sample rate in Hz. Options: ${CONSTRAINTS.TTS.SAMPLE_RATES.join(', ')}`), bitrate: z.enum(CONSTRAINTS.TTS.BITRATES as readonly [Bitrate, ...Bitrate[]]) .default("128000" as Bitrate) .describe(`Audio bitrate in bps. Options: ${CONSTRAINTS.TTS.BITRATES.join(', ')}`), languageBoost: z.string().default('auto').describe('Enhance recognition for specific languages/dialects. Options: Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto. Use "auto" for automatic detection'), intensity: z.number() .int() .min(CONSTRAINTS.TTS.VOICE_MODIFY_INTENSITY_MIN) .max(CONSTRAINTS.TTS.VOICE_MODIFY_INTENSITY_MAX) .optional() .describe('Voice intensity adjustment (-100 to 100). Values closer to -100 make voice more robust, closer to 100 make voice softer'), timbre: z.number() .int() .min(CONSTRAINTS.TTS.VOICE_MODIFY_TIMBRE_MIN) .max(CONSTRAINTS.TTS.VOICE_MODIFY_TIMBRE_MAX) .optional() .describe('Voice timbre adjustment (-100 to 100). Values closer to -100 make voice more mellow, closer to 100 make voice more crisp'), sound_effects: z.enum(CONSTRAINTS.TTS.SOUND_EFFECTS as readonly [SoundEffect, ...SoundEffect[]]) .optional() .describe(getSoundEffectsDescription()) });
  • Helper method to build the API payload for the Minimax TTS endpoint from input parameters, applying defaults and cleaning undefined values.
    private buildPayload(params: TextToSpeechParams): TTSPayload { const ttsDefaults = DEFAULTS.TTS as any; // Map highQuality parameter to appropriate Speech 2.6 model const model = (params as any).highQuality ? 'speech-2.6-hd' : 'speech-2.6-turbo'; const payload: TTSPayload = { model: model, text: params.text, voice_setting: { voice_id: params.voiceId || ttsDefaults.voiceId, speed: params.speed || ttsDefaults.speed, vol: params.volume || ttsDefaults.volume, pitch: params.pitch || ttsDefaults.pitch, emotion: params.emotion || ttsDefaults.emotion }, audio_setting: { sample_rate: parseInt(params.sampleRate || ttsDefaults.sampleRate), bitrate: parseInt(params.bitrate || ttsDefaults.bitrate), format: params.format || ttsDefaults.format, channel: ttsDefaults.channel } }; // Add optional parameters if (params.languageBoost) { payload.language_boost = params.languageBoost; } // Add voice modify parameters if present if (params.intensity !== undefined || params.timbre !== undefined || params.sound_effects !== undefined) { payload.voice_modify = {}; if (params.intensity !== undefined) { payload.voice_modify.intensity = params.intensity; } if (params.timbre !== undefined) { payload.voice_modify.timbre = params.timbre; } if (params.sound_effects !== undefined) { payload.voice_modify.sound_effects = params.sound_effects; } } // Voice mixing feature removed for simplicity // Filter out undefined values return this.cleanPayload(payload) as TTSPayload; }
  • Helper method to process TTS API response: decodes hex audio data, saves to file, constructs result object with metadata.
    private async processTTSResponse(response: TTSResponse, params: TextToSpeechParams): Promise<TTSResult> { const audioHex = response.data?.audio; if (!audioHex) { throw new Error('No audio data received from API'); } // Convert hex to bytes and save const audioBytes = Buffer.from(audioHex, 'hex'); await FileHandler.writeFile(params.outputFile, audioBytes); const ttsDefaults = DEFAULTS.TTS as any; const result: TTSResult = { audioFile: params.outputFile, voiceUsed: params.voiceId || ttsDefaults.voiceId, model: (params as any).highQuality ? 'speech-2.6-hd' : 'speech-2.6-turbo', duration: response.data?.duration || null, format: params.format || ttsDefaults.format, sampleRate: parseInt(params.sampleRate || ttsDefaults.sampleRate), bitrate: parseInt(params.bitrate || ttsDefaults.bitrate) }; // Subtitles feature removed for simplicity return result;
  • src/index.ts:90-119 (registration)
    MCP tool registration for speech generation (named 'submit_speech_generation'), which queues the generateSpeech call via task manager for rate limiting.
    server.registerTool( "submit_speech_generation", { title: "Submit Speech Generation Task", description: "Convert text to speech asynchronously. RECOMMENDED: Submit multiple tasks in batch to saturate rate limits, then call task_barrier once to wait for all completions. Returns task ID only - actual files available after task_barrier.", inputSchema: textToSpeechSchema.shape }, async (params: unknown): Promise<ToolResponse> => { try { const validatedParams = validateTTSParams(params); const { taskId } = await taskManager.submitTTSTask(async () => { return await ttsService.generateSpeech(validatedParams); }); return { content: [{ type: "text", text: `Task ${taskId} submitted` }] }; } catch (error: any) { ErrorHandler.logError(error, { tool: 'submit_speech_generation', params }); return { content: [{ type: "text", text: `❌ Failed to submit TTS task: ${ErrorHandler.formatErrorForUser(error)}` }] }; } }

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/PsychArch/minimax-mcp-tools'

If you have feedback or need assistance with the MCP directory API, please join our Discord server