Submit Speech Generation Task

submit_speech_generation

Convert text to speech asynchronously. Submit text with voice, emotion, and audio settings; get a task ID. Use task_barrier to collect completed audio files.

Instructions

Convert text to speech asynchronously. RECOMMENDED: Submit multiple tasks in batch to saturate rate limits, then call task_barrier once to wait for all completions. Returns task ID only - actual files available after task_barrier.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`text`	Yes
`outputFile`	Yes	Absolute path for audio file
`highQuality`	No	Use high-quality model (speech-2.8-hd) for audiobooks/premium content. Default: false (uses faster speech-2.8-turbo)
`voiceId`	No	Voice ID for speech generation. Available voices: male-qn-qingse (青涩青年音色), male-qn-jingying (精英青年音色), male-qn-badao (霸道青年音色), male-qn-daxuesheng (青年大学生音色), female-shaonv (少女音色), female-yujie (御姐音色), female-chengshu (成熟女性音色), female-tianmei (甜美女性音色), presenter_male (男性主持人), presenter_female (女性主持人), audiobook_male_1 (男性有声书1), audiobook_male_2 (男性有声书2), audiobook_female_1 (女性有声书1), audiobook_female_2 (女性有声书2), male-qn-qingse-jingpin (青涩青年音色-beta), male-qn-jingying-jingpin (精英青年音色-beta), male-qn-badao-jingpin (霸道青年音色-beta), male-qn-daxuesheng-jingpin (青年大学生音色-beta), female-shaonv-jingpin (少女音色-beta), female-yujie-jingpin (御姐音色-beta), female-chengshu-jingpin (成熟女性音色-beta), female-tianmei-jingpin (甜美女性音色-beta), clever_boy (聪明男童), cute_boy (可爱男童), lovely_girl (萌萌女童), cartoon_pig (卡通猪小琪), bingjiao_didi (病娇弟弟), junlang_nanyou (俊朗男友), chunzhen_xuedi (纯真学弟), lengdan_xiongzhang (冷淡学长), badao_shaoye (霸道少爷), tianxin_xiaoling (甜心小玲), qiaopi_mengmei (俏皮萌妹), wumei_yujie (妩媚御姐), diadia_xuemei (嗲嗲学妹), danya_xuejie (淡雅学姐), Santa_Claus (Santa Claus), Grinch (Grinch), Rudolph (Rudolph), Arnold (Arnold), Charming_Santa (Charming Santa), Charming_Lady (Charming Lady), Sweet_Girl (Sweet Girl), Cute_Elf (Cute Elf), Attractive_Girl (Attractive Girl), Serene_Woman (Serene Woman)	female-shaonv
`speed`	No	Speech speed multiplier (0.5-2). Higher values = faster speech
`volume`	No	Audio volume level (0.1-10). Higher values = louder audio
`pitch`	No	Pitch adjustment in semitones (-12 to 12). Negative = lower pitch, Positive = higher pitch
`emotion`	No	Emotional tone of the speech. Options: neutral, happy, sad, angry, fearful, disgusted, surprised	neutral
`format`	No	Output audio format. Options: mp3, wav, flac, pcm	mp3
`sampleRate`	No	Audio sample rate in Hz. Options: 8000, 16000, 22050, 24000, 32000, 44100	32000
`bitrate`	No	Audio bitrate in bps. Options: 64000, 96000, 128000, 160000, 192000, 224000, 256000, 320000	128000
`languageBoost`	No	Enhance recognition for specific languages/dialects. Options: Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto. Use "auto" for automatic detection	auto
`intensity`	No	Voice intensity adjustment (-100 to 100). Values closer to -100 make voice more robust, closer to 100 make voice softer
`timbre`	No	Voice timbre adjustment (-100 to 100). Values closer to -100 make voice more mellow, closer to 100 make voice more crisp
`sound_effects`	No	Sound effects. Options: spacious_echo (空旷回音), auditorium_echo (礼堂广播), lofi_telephone (电话失真), robotic (机械音). Only one sound effect can be used per request

Implementation Reference

src/index.ts:89-120 (registration)

Tool registration for 'submit_speech_generation' via server.registerTool(), with title, description, and inputSchema from textToSpeechSchema.

// Text-to-speech tool
server.registerTool(
  "submit_speech_generation",
  {
    title: "Submit Speech Generation Task", 
    description: "Convert text to speech asynchronously. RECOMMENDED: Submit multiple tasks in batch to saturate rate limits, then call task_barrier once to wait for all completions. Returns task ID only - actual files available after task_barrier.",
    inputSchema: textToSpeechSchema.shape
  },
  async (params: unknown): Promise<ToolResponse> => {
    try {
      const validatedParams = validateTTSParams(params);
      const { taskId } = await taskManager.submitTTSTask(async () => {
        return await ttsService.generateSpeech(validatedParams);
      });

      return {
        content: [{
          type: "text",
          text: `Task ${taskId} submitted`
        }]
      };
    } catch (error: any) {
      ErrorHandler.logError(error, { tool: 'submit_speech_generation', params });
      return {
        content: [{
          type: "text",
          text: `❌ Failed to submit TTS task: ${ErrorHandler.formatErrorForUser(error)}`
        }]
      };
    }
  }
);

src/index.ts:97-120 (handler)

Handler function that validates params via validateTTSParams, submits async TTS task via taskManager.submitTTSTask calling ttsService.generateSpeech, and returns task ID.

  async (params: unknown): Promise<ToolResponse> => {
    try {
      const validatedParams = validateTTSParams(params);
      const { taskId } = await taskManager.submitTTSTask(async () => {
        return await ttsService.generateSpeech(validatedParams);
      });

      return {
        content: [{
          type: "text",
          text: `Task ${taskId} submitted`
        }]
      };
    } catch (error: any) {
      ErrorHandler.logError(error, { tool: 'submit_speech_generation', params });
      return {
        content: [{
          type: "text",
          text: `❌ Failed to submit TTS task: ${ErrorHandler.formatErrorForUser(error)}`
        }]
      };
    }
  }
);

src/config/schemas.ts:72-141 (schema)

Input schema (textToSpeechSchema) defining all TTS parameters: text, outputFile, highQuality, voiceId, speed, volume, pitch, emotion, format, sampleRate, bitrate, languageBoost, intensity, timbre, sound_effects.

export const textToSpeechSchema = z.object({
  text: z.string()
    .min(1, 'Text is required')
    .max(CONSTRAINTS.TTS.TEXT_MAX_LENGTH, `Text to convert to speech. Max ${CONSTRAINTS.TTS.TEXT_MAX_LENGTH} characters. Use newlines for paragraph breaks. For custom pauses, insert <#x#> where x is seconds (0.01-99.99, max 2 decimals). Pause markers must be between pronounceable text and cannot be consecutive`),
    
  outputFile: filePathSchema.describe('Absolute path for audio file'),
  
  highQuality: z.boolean()
    .default(false)
    .describe('Use high-quality model (speech-2.8-hd) for audiobooks/premium content. Default: false (uses faster speech-2.8-turbo)'),
    
  voiceId: z.enum(Object.keys(VOICES) as [VoiceId, ...VoiceId[]])
    .default('female-shaonv' as VoiceId)
    .describe(`Voice ID for speech generation. Available voices: ${Object.keys(VOICES).map(id => `${id} (${VOICES[id as VoiceId]?.name || id})`).join(', ')}`),
    
  speed: z.number()
    .min(CONSTRAINTS.TTS.SPEED_MIN)
    .max(CONSTRAINTS.TTS.SPEED_MAX)
    .default(1.0)
    .describe(`Speech speed multiplier (${CONSTRAINTS.TTS.SPEED_MIN}-${CONSTRAINTS.TTS.SPEED_MAX}). Higher values = faster speech`),
    
  volume: z.number()
    .min(CONSTRAINTS.TTS.VOLUME_MIN)
    .max(CONSTRAINTS.TTS.VOLUME_MAX)
    .default(1.0)
    .describe(`Audio volume level (${CONSTRAINTS.TTS.VOLUME_MIN}-${CONSTRAINTS.TTS.VOLUME_MAX}). Higher values = louder audio`),
    
  pitch: z.number()
    .min(CONSTRAINTS.TTS.PITCH_MIN)
    .max(CONSTRAINTS.TTS.PITCH_MAX)
    .default(0)
    .describe(`Pitch adjustment in semitones (${CONSTRAINTS.TTS.PITCH_MIN} to ${CONSTRAINTS.TTS.PITCH_MAX}). Negative = lower pitch, Positive = higher pitch`),
    
  emotion: z.enum(CONSTRAINTS.TTS.EMOTIONS as readonly [Emotion, ...Emotion[]])
    .default('neutral' as Emotion)
    .describe(`Emotional tone of the speech. Options: ${CONSTRAINTS.TTS.EMOTIONS.join(', ')}`),
    
  format: z.enum(CONSTRAINTS.TTS.FORMATS as readonly [AudioFormat, ...AudioFormat[]])
    .default('mp3' as AudioFormat)
    .describe(`Output audio format. Options: ${CONSTRAINTS.TTS.FORMATS.join(', ')}`),
    
  sampleRate: z.enum(CONSTRAINTS.TTS.SAMPLE_RATES as readonly [SampleRate, ...SampleRate[]])
    .default("32000" as SampleRate)
    .describe(`Audio sample rate in Hz. Options: ${CONSTRAINTS.TTS.SAMPLE_RATES.join(', ')}`),
    
  bitrate: z.enum(CONSTRAINTS.TTS.BITRATES as readonly [Bitrate, ...Bitrate[]])
    .default("128000" as Bitrate)  
    .describe(`Audio bitrate in bps. Options: ${CONSTRAINTS.TTS.BITRATES.join(', ')}`),
    
  languageBoost: z.string().default('auto').describe('Enhance recognition for specific languages/dialects. Options: Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto. Use "auto" for automatic detection'),

  
  intensity: z.number()
    .int()
    .min(CONSTRAINTS.TTS.VOICE_MODIFY_INTENSITY_MIN)
    .max(CONSTRAINTS.TTS.VOICE_MODIFY_INTENSITY_MAX)
    .optional()
    .describe('Voice intensity adjustment (-100 to 100). Values closer to -100 make voice more robust, closer to 100 make voice softer'),
    
  timbre: z.number()
    .int()
    .min(CONSTRAINTS.TTS.VOICE_MODIFY_TIMBRE_MIN)
    .max(CONSTRAINTS.TTS.VOICE_MODIFY_TIMBRE_MAX)
    .optional()
    .describe('Voice timbre adjustment (-100 to 100). Values closer to -100 make voice more mellow, closer to 100 make voice more crisp'),
    
  sound_effects: z.enum(CONSTRAINTS.TTS.SOUND_EFFECTS as readonly [SoundEffect, ...SoundEffect[]])
    .optional()
    .describe(getSoundEffectsDescription())
});

src/config/schemas.ts:338-348 (helper)

validateTTSParams function that parses and validates TTS parameters using textToSpeechSchema.

export function validateTTSParams(params: unknown): TextToSpeechParams {
  try {
    return textToSpeechSchema.parse(params);
  } catch (error) {
    if (error instanceof z.ZodError) {
      const messages = error.errors.map(e => `${e.path.join('.')}: ${e.message}`);
      throw new Error(`Validation failed: ${messages.join(', ')}`);
    }
    throw error;
  }
}

src/services/tts-service.ts:60-78 (handler)

TextToSpeechService.generateSpeech method that builds API payload, makes request to text-to-speech endpoint, processes response, and saves audio file.

async generateSpeech(params: TextToSpeechParams): Promise<TTSResult> {
  try {
    // Build API payload (MCP handles validation)
    const payload = this.buildPayload(params);
    
    // Make API request
    const response = await this.post(API_CONFIG.ENDPOINTS.TEXT_TO_SPEECH, payload) as TTSResponse;
    
    // Process response
    return await this.processTTSResponse(response, params);
    
  } catch (error: any) {
    const processedError = ErrorHandler.handleAPIError(error);
    ErrorHandler.logError(processedError, { service: 'tts', params });
    
    // Throw the error so task manager can properly mark it as failed
    throw processedError;
  }
}

Minimax MCP Tools

Submit Speech Generation Task

Instructions

Input Schema

Implementation Reference

Tool Definition Quality

Other Tools

Latest Blog Posts

MCP directory API