Submit Speech Generation Task
submit_speech_generationConvert text to speech asynchronously. Submit text with voice, emotion, and audio settings; get a task ID. Use task_barrier to collect completed audio files.
Instructions
Convert text to speech asynchronously. RECOMMENDED: Submit multiple tasks in batch to saturate rate limits, then call task_barrier once to wait for all completions. Returns task ID only - actual files available after task_barrier.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| text | Yes | ||
| outputFile | Yes | Absolute path for audio file | |
| highQuality | No | Use high-quality model (speech-2.8-hd) for audiobooks/premium content. Default: false (uses faster speech-2.8-turbo) | |
| voiceId | No | Voice ID for speech generation. Available voices: male-qn-qingse (青涩青年音色), male-qn-jingying (精英青年音色), male-qn-badao (霸道青年音色), male-qn-daxuesheng (青年大学生音色), female-shaonv (少女音色), female-yujie (御姐音色), female-chengshu (成熟女性音色), female-tianmei (甜美女性音色), presenter_male (男性主持人), presenter_female (女性主持人), audiobook_male_1 (男性有声书1), audiobook_male_2 (男性有声书2), audiobook_female_1 (女性有声书1), audiobook_female_2 (女性有声书2), male-qn-qingse-jingpin (青涩青年音色-beta), male-qn-jingying-jingpin (精英青年音色-beta), male-qn-badao-jingpin (霸道青年音色-beta), male-qn-daxuesheng-jingpin (青年大学生音色-beta), female-shaonv-jingpin (少女音色-beta), female-yujie-jingpin (御姐音色-beta), female-chengshu-jingpin (成熟女性音色-beta), female-tianmei-jingpin (甜美女性音色-beta), clever_boy (聪明男童), cute_boy (可爱男童), lovely_girl (萌萌女童), cartoon_pig (卡通猪小琪), bingjiao_didi (病娇弟弟), junlang_nanyou (俊朗男友), chunzhen_xuedi (纯真学弟), lengdan_xiongzhang (冷淡学长), badao_shaoye (霸道少爷), tianxin_xiaoling (甜心小玲), qiaopi_mengmei (俏皮萌妹), wumei_yujie (妩媚御姐), diadia_xuemei (嗲嗲学妹), danya_xuejie (淡雅学姐), Santa_Claus (Santa Claus), Grinch (Grinch), Rudolph (Rudolph), Arnold (Arnold), Charming_Santa (Charming Santa), Charming_Lady (Charming Lady), Sweet_Girl (Sweet Girl), Cute_Elf (Cute Elf), Attractive_Girl (Attractive Girl), Serene_Woman (Serene Woman) | female-shaonv |
| speed | No | Speech speed multiplier (0.5-2). Higher values = faster speech | |
| volume | No | Audio volume level (0.1-10). Higher values = louder audio | |
| pitch | No | Pitch adjustment in semitones (-12 to 12). Negative = lower pitch, Positive = higher pitch | |
| emotion | No | Emotional tone of the speech. Options: neutral, happy, sad, angry, fearful, disgusted, surprised | neutral |
| format | No | Output audio format. Options: mp3, wav, flac, pcm | mp3 |
| sampleRate | No | Audio sample rate in Hz. Options: 8000, 16000, 22050, 24000, 32000, 44100 | 32000 |
| bitrate | No | Audio bitrate in bps. Options: 64000, 96000, 128000, 160000, 192000, 224000, 256000, 320000 | 128000 |
| languageBoost | No | Enhance recognition for specific languages/dialects. Options: Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto. Use "auto" for automatic detection | auto |
| intensity | No | Voice intensity adjustment (-100 to 100). Values closer to -100 make voice more robust, closer to 100 make voice softer | |
| timbre | No | Voice timbre adjustment (-100 to 100). Values closer to -100 make voice more mellow, closer to 100 make voice more crisp | |
| sound_effects | No | Sound effects. Options: spacious_echo (空旷回音), auditorium_echo (礼堂广播), lofi_telephone (电话失真), robotic (机械音). Only one sound effect can be used per request |
Implementation Reference
- src/index.ts:89-120 (registration)Tool registration for 'submit_speech_generation' via server.registerTool(), with title, description, and inputSchema from textToSpeechSchema.
// Text-to-speech tool server.registerTool( "submit_speech_generation", { title: "Submit Speech Generation Task", description: "Convert text to speech asynchronously. RECOMMENDED: Submit multiple tasks in batch to saturate rate limits, then call task_barrier once to wait for all completions. Returns task ID only - actual files available after task_barrier.", inputSchema: textToSpeechSchema.shape }, async (params: unknown): Promise<ToolResponse> => { try { const validatedParams = validateTTSParams(params); const { taskId } = await taskManager.submitTTSTask(async () => { return await ttsService.generateSpeech(validatedParams); }); return { content: [{ type: "text", text: `Task ${taskId} submitted` }] }; } catch (error: any) { ErrorHandler.logError(error, { tool: 'submit_speech_generation', params }); return { content: [{ type: "text", text: `❌ Failed to submit TTS task: ${ErrorHandler.formatErrorForUser(error)}` }] }; } } ); - src/index.ts:97-120 (handler)Handler function that validates params via validateTTSParams, submits async TTS task via taskManager.submitTTSTask calling ttsService.generateSpeech, and returns task ID.
async (params: unknown): Promise<ToolResponse> => { try { const validatedParams = validateTTSParams(params); const { taskId } = await taskManager.submitTTSTask(async () => { return await ttsService.generateSpeech(validatedParams); }); return { content: [{ type: "text", text: `Task ${taskId} submitted` }] }; } catch (error: any) { ErrorHandler.logError(error, { tool: 'submit_speech_generation', params }); return { content: [{ type: "text", text: `❌ Failed to submit TTS task: ${ErrorHandler.formatErrorForUser(error)}` }] }; } } ); - src/config/schemas.ts:72-141 (schema)Input schema (textToSpeechSchema) defining all TTS parameters: text, outputFile, highQuality, voiceId, speed, volume, pitch, emotion, format, sampleRate, bitrate, languageBoost, intensity, timbre, sound_effects.
export const textToSpeechSchema = z.object({ text: z.string() .min(1, 'Text is required') .max(CONSTRAINTS.TTS.TEXT_MAX_LENGTH, `Text to convert to speech. Max ${CONSTRAINTS.TTS.TEXT_MAX_LENGTH} characters. Use newlines for paragraph breaks. For custom pauses, insert <#x#> where x is seconds (0.01-99.99, max 2 decimals). Pause markers must be between pronounceable text and cannot be consecutive`), outputFile: filePathSchema.describe('Absolute path for audio file'), highQuality: z.boolean() .default(false) .describe('Use high-quality model (speech-2.8-hd) for audiobooks/premium content. Default: false (uses faster speech-2.8-turbo)'), voiceId: z.enum(Object.keys(VOICES) as [VoiceId, ...VoiceId[]]) .default('female-shaonv' as VoiceId) .describe(`Voice ID for speech generation. Available voices: ${Object.keys(VOICES).map(id => `${id} (${VOICES[id as VoiceId]?.name || id})`).join(', ')}`), speed: z.number() .min(CONSTRAINTS.TTS.SPEED_MIN) .max(CONSTRAINTS.TTS.SPEED_MAX) .default(1.0) .describe(`Speech speed multiplier (${CONSTRAINTS.TTS.SPEED_MIN}-${CONSTRAINTS.TTS.SPEED_MAX}). Higher values = faster speech`), volume: z.number() .min(CONSTRAINTS.TTS.VOLUME_MIN) .max(CONSTRAINTS.TTS.VOLUME_MAX) .default(1.0) .describe(`Audio volume level (${CONSTRAINTS.TTS.VOLUME_MIN}-${CONSTRAINTS.TTS.VOLUME_MAX}). Higher values = louder audio`), pitch: z.number() .min(CONSTRAINTS.TTS.PITCH_MIN) .max(CONSTRAINTS.TTS.PITCH_MAX) .default(0) .describe(`Pitch adjustment in semitones (${CONSTRAINTS.TTS.PITCH_MIN} to ${CONSTRAINTS.TTS.PITCH_MAX}). Negative = lower pitch, Positive = higher pitch`), emotion: z.enum(CONSTRAINTS.TTS.EMOTIONS as readonly [Emotion, ...Emotion[]]) .default('neutral' as Emotion) .describe(`Emotional tone of the speech. Options: ${CONSTRAINTS.TTS.EMOTIONS.join(', ')}`), format: z.enum(CONSTRAINTS.TTS.FORMATS as readonly [AudioFormat, ...AudioFormat[]]) .default('mp3' as AudioFormat) .describe(`Output audio format. Options: ${CONSTRAINTS.TTS.FORMATS.join(', ')}`), sampleRate: z.enum(CONSTRAINTS.TTS.SAMPLE_RATES as readonly [SampleRate, ...SampleRate[]]) .default("32000" as SampleRate) .describe(`Audio sample rate in Hz. Options: ${CONSTRAINTS.TTS.SAMPLE_RATES.join(', ')}`), bitrate: z.enum(CONSTRAINTS.TTS.BITRATES as readonly [Bitrate, ...Bitrate[]]) .default("128000" as Bitrate) .describe(`Audio bitrate in bps. Options: ${CONSTRAINTS.TTS.BITRATES.join(', ')}`), languageBoost: z.string().default('auto').describe('Enhance recognition for specific languages/dialects. Options: Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto. Use "auto" for automatic detection'), intensity: z.number() .int() .min(CONSTRAINTS.TTS.VOICE_MODIFY_INTENSITY_MIN) .max(CONSTRAINTS.TTS.VOICE_MODIFY_INTENSITY_MAX) .optional() .describe('Voice intensity adjustment (-100 to 100). Values closer to -100 make voice more robust, closer to 100 make voice softer'), timbre: z.number() .int() .min(CONSTRAINTS.TTS.VOICE_MODIFY_TIMBRE_MIN) .max(CONSTRAINTS.TTS.VOICE_MODIFY_TIMBRE_MAX) .optional() .describe('Voice timbre adjustment (-100 to 100). Values closer to -100 make voice more mellow, closer to 100 make voice more crisp'), sound_effects: z.enum(CONSTRAINTS.TTS.SOUND_EFFECTS as readonly [SoundEffect, ...SoundEffect[]]) .optional() .describe(getSoundEffectsDescription()) }); - src/config/schemas.ts:338-348 (helper)validateTTSParams function that parses and validates TTS parameters using textToSpeechSchema.
export function validateTTSParams(params: unknown): TextToSpeechParams { try { return textToSpeechSchema.parse(params); } catch (error) { if (error instanceof z.ZodError) { const messages = error.errors.map(e => `${e.path.join('.')}: ${e.message}`); throw new Error(`Validation failed: ${messages.join(', ')}`); } throw error; } } - src/services/tts-service.ts:60-78 (handler)TextToSpeechService.generateSpeech method that builds API payload, makes request to text-to-speech endpoint, processes response, and saves audio file.
async generateSpeech(params: TextToSpeechParams): Promise<TTSResult> { try { // Build API payload (MCP handles validation) const payload = this.buildPayload(params); // Make API request const response = await this.post(API_CONFIG.ENDPOINTS.TEXT_TO_SPEECH, payload) as TTSResponse; // Process response return await this.processTTSResponse(response, params); } catch (error: any) { const processedError = ErrorHandler.handleAPIError(error); ErrorHandler.logError(processedError, { service: 'tts', params }); // Throw the error so task manager can properly mark it as failed throw processedError; } }