Skip to main content
Glama

Submit Speech Generation Task

submit_speech_generation

Convert text to speech asynchronously. Submit text with voice, emotion, and audio settings; get a task ID. Use task_barrier to collect completed audio files.

Instructions

Convert text to speech asynchronously. RECOMMENDED: Submit multiple tasks in batch to saturate rate limits, then call task_barrier once to wait for all completions. Returns task ID only - actual files available after task_barrier.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
textYes
outputFileYesAbsolute path for audio file
highQualityNoUse high-quality model (speech-2.8-hd) for audiobooks/premium content. Default: false (uses faster speech-2.8-turbo)
voiceIdNoVoice ID for speech generation. Available voices: male-qn-qingse (青涩青年音色), male-qn-jingying (精英青年音色), male-qn-badao (霸道青年音色), male-qn-daxuesheng (青年大学生音色), female-shaonv (少女音色), female-yujie (御姐音色), female-chengshu (成熟女性音色), female-tianmei (甜美女性音色), presenter_male (男性主持人), presenter_female (女性主持人), audiobook_male_1 (男性有声书1), audiobook_male_2 (男性有声书2), audiobook_female_1 (女性有声书1), audiobook_female_2 (女性有声书2), male-qn-qingse-jingpin (青涩青年音色-beta), male-qn-jingying-jingpin (精英青年音色-beta), male-qn-badao-jingpin (霸道青年音色-beta), male-qn-daxuesheng-jingpin (青年大学生音色-beta), female-shaonv-jingpin (少女音色-beta), female-yujie-jingpin (御姐音色-beta), female-chengshu-jingpin (成熟女性音色-beta), female-tianmei-jingpin (甜美女性音色-beta), clever_boy (聪明男童), cute_boy (可爱男童), lovely_girl (萌萌女童), cartoon_pig (卡通猪小琪), bingjiao_didi (病娇弟弟), junlang_nanyou (俊朗男友), chunzhen_xuedi (纯真学弟), lengdan_xiongzhang (冷淡学长), badao_shaoye (霸道少爷), tianxin_xiaoling (甜心小玲), qiaopi_mengmei (俏皮萌妹), wumei_yujie (妩媚御姐), diadia_xuemei (嗲嗲学妹), danya_xuejie (淡雅学姐), Santa_Claus (Santa Claus), Grinch (Grinch), Rudolph (Rudolph), Arnold (Arnold), Charming_Santa (Charming Santa), Charming_Lady (Charming Lady), Sweet_Girl (Sweet Girl), Cute_Elf (Cute Elf), Attractive_Girl (Attractive Girl), Serene_Woman (Serene Woman)female-shaonv
speedNoSpeech speed multiplier (0.5-2). Higher values = faster speech
volumeNoAudio volume level (0.1-10). Higher values = louder audio
pitchNoPitch adjustment in semitones (-12 to 12). Negative = lower pitch, Positive = higher pitch
emotionNoEmotional tone of the speech. Options: neutral, happy, sad, angry, fearful, disgusted, surprisedneutral
formatNoOutput audio format. Options: mp3, wav, flac, pcmmp3
sampleRateNoAudio sample rate in Hz. Options: 8000, 16000, 22050, 24000, 32000, 4410032000
bitrateNoAudio bitrate in bps. Options: 64000, 96000, 128000, 160000, 192000, 224000, 256000, 320000128000
languageBoostNoEnhance recognition for specific languages/dialects. Options: Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto. Use "auto" for automatic detectionauto
intensityNoVoice intensity adjustment (-100 to 100). Values closer to -100 make voice more robust, closer to 100 make voice softer
timbreNoVoice timbre adjustment (-100 to 100). Values closer to -100 make voice more mellow, closer to 100 make voice more crisp
sound_effectsNoSound effects. Options: spacious_echo (空旷回音), auditorium_echo (礼堂广播), lofi_telephone (电话失真), robotic (机械音). Only one sound effect can be used per request

Implementation Reference

  • src/index.ts:89-120 (registration)
    Tool registration for 'submit_speech_generation' via server.registerTool(), with title, description, and inputSchema from textToSpeechSchema.
    // Text-to-speech tool
    server.registerTool(
      "submit_speech_generation",
      {
        title: "Submit Speech Generation Task", 
        description: "Convert text to speech asynchronously. RECOMMENDED: Submit multiple tasks in batch to saturate rate limits, then call task_barrier once to wait for all completions. Returns task ID only - actual files available after task_barrier.",
        inputSchema: textToSpeechSchema.shape
      },
      async (params: unknown): Promise<ToolResponse> => {
        try {
          const validatedParams = validateTTSParams(params);
          const { taskId } = await taskManager.submitTTSTask(async () => {
            return await ttsService.generateSpeech(validatedParams);
          });
    
          return {
            content: [{
              type: "text",
              text: `Task ${taskId} submitted`
            }]
          };
        } catch (error: any) {
          ErrorHandler.logError(error, { tool: 'submit_speech_generation', params });
          return {
            content: [{
              type: "text",
              text: `❌ Failed to submit TTS task: ${ErrorHandler.formatErrorForUser(error)}`
            }]
          };
        }
      }
    );
  • Handler function that validates params via validateTTSParams, submits async TTS task via taskManager.submitTTSTask calling ttsService.generateSpeech, and returns task ID.
      async (params: unknown): Promise<ToolResponse> => {
        try {
          const validatedParams = validateTTSParams(params);
          const { taskId } = await taskManager.submitTTSTask(async () => {
            return await ttsService.generateSpeech(validatedParams);
          });
    
          return {
            content: [{
              type: "text",
              text: `Task ${taskId} submitted`
            }]
          };
        } catch (error: any) {
          ErrorHandler.logError(error, { tool: 'submit_speech_generation', params });
          return {
            content: [{
              type: "text",
              text: `❌ Failed to submit TTS task: ${ErrorHandler.formatErrorForUser(error)}`
            }]
          };
        }
      }
    );
  • Input schema (textToSpeechSchema) defining all TTS parameters: text, outputFile, highQuality, voiceId, speed, volume, pitch, emotion, format, sampleRate, bitrate, languageBoost, intensity, timbre, sound_effects.
    export const textToSpeechSchema = z.object({
      text: z.string()
        .min(1, 'Text is required')
        .max(CONSTRAINTS.TTS.TEXT_MAX_LENGTH, `Text to convert to speech. Max ${CONSTRAINTS.TTS.TEXT_MAX_LENGTH} characters. Use newlines for paragraph breaks. For custom pauses, insert <#x#> where x is seconds (0.01-99.99, max 2 decimals). Pause markers must be between pronounceable text and cannot be consecutive`),
        
      outputFile: filePathSchema.describe('Absolute path for audio file'),
      
      highQuality: z.boolean()
        .default(false)
        .describe('Use high-quality model (speech-2.8-hd) for audiobooks/premium content. Default: false (uses faster speech-2.8-turbo)'),
        
      voiceId: z.enum(Object.keys(VOICES) as [VoiceId, ...VoiceId[]])
        .default('female-shaonv' as VoiceId)
        .describe(`Voice ID for speech generation. Available voices: ${Object.keys(VOICES).map(id => `${id} (${VOICES[id as VoiceId]?.name || id})`).join(', ')}`),
        
      speed: z.number()
        .min(CONSTRAINTS.TTS.SPEED_MIN)
        .max(CONSTRAINTS.TTS.SPEED_MAX)
        .default(1.0)
        .describe(`Speech speed multiplier (${CONSTRAINTS.TTS.SPEED_MIN}-${CONSTRAINTS.TTS.SPEED_MAX}). Higher values = faster speech`),
        
      volume: z.number()
        .min(CONSTRAINTS.TTS.VOLUME_MIN)
        .max(CONSTRAINTS.TTS.VOLUME_MAX)
        .default(1.0)
        .describe(`Audio volume level (${CONSTRAINTS.TTS.VOLUME_MIN}-${CONSTRAINTS.TTS.VOLUME_MAX}). Higher values = louder audio`),
        
      pitch: z.number()
        .min(CONSTRAINTS.TTS.PITCH_MIN)
        .max(CONSTRAINTS.TTS.PITCH_MAX)
        .default(0)
        .describe(`Pitch adjustment in semitones (${CONSTRAINTS.TTS.PITCH_MIN} to ${CONSTRAINTS.TTS.PITCH_MAX}). Negative = lower pitch, Positive = higher pitch`),
        
      emotion: z.enum(CONSTRAINTS.TTS.EMOTIONS as readonly [Emotion, ...Emotion[]])
        .default('neutral' as Emotion)
        .describe(`Emotional tone of the speech. Options: ${CONSTRAINTS.TTS.EMOTIONS.join(', ')}`),
        
      format: z.enum(CONSTRAINTS.TTS.FORMATS as readonly [AudioFormat, ...AudioFormat[]])
        .default('mp3' as AudioFormat)
        .describe(`Output audio format. Options: ${CONSTRAINTS.TTS.FORMATS.join(', ')}`),
        
      sampleRate: z.enum(CONSTRAINTS.TTS.SAMPLE_RATES as readonly [SampleRate, ...SampleRate[]])
        .default("32000" as SampleRate)
        .describe(`Audio sample rate in Hz. Options: ${CONSTRAINTS.TTS.SAMPLE_RATES.join(', ')}`),
        
      bitrate: z.enum(CONSTRAINTS.TTS.BITRATES as readonly [Bitrate, ...Bitrate[]])
        .default("128000" as Bitrate)  
        .describe(`Audio bitrate in bps. Options: ${CONSTRAINTS.TTS.BITRATES.join(', ')}`),
        
      languageBoost: z.string().default('auto').describe('Enhance recognition for specific languages/dialects. Options: Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto. Use "auto" for automatic detection'),
    
      
      intensity: z.number()
        .int()
        .min(CONSTRAINTS.TTS.VOICE_MODIFY_INTENSITY_MIN)
        .max(CONSTRAINTS.TTS.VOICE_MODIFY_INTENSITY_MAX)
        .optional()
        .describe('Voice intensity adjustment (-100 to 100). Values closer to -100 make voice more robust, closer to 100 make voice softer'),
        
      timbre: z.number()
        .int()
        .min(CONSTRAINTS.TTS.VOICE_MODIFY_TIMBRE_MIN)
        .max(CONSTRAINTS.TTS.VOICE_MODIFY_TIMBRE_MAX)
        .optional()
        .describe('Voice timbre adjustment (-100 to 100). Values closer to -100 make voice more mellow, closer to 100 make voice more crisp'),
        
      sound_effects: z.enum(CONSTRAINTS.TTS.SOUND_EFFECTS as readonly [SoundEffect, ...SoundEffect[]])
        .optional()
        .describe(getSoundEffectsDescription())
    });
  • validateTTSParams function that parses and validates TTS parameters using textToSpeechSchema.
    export function validateTTSParams(params: unknown): TextToSpeechParams {
      try {
        return textToSpeechSchema.parse(params);
      } catch (error) {
        if (error instanceof z.ZodError) {
          const messages = error.errors.map(e => `${e.path.join('.')}: ${e.message}`);
          throw new Error(`Validation failed: ${messages.join(', ')}`);
        }
        throw error;
      }
    }
  • TextToSpeechService.generateSpeech method that builds API payload, makes request to text-to-speech endpoint, processes response, and saves audio file.
    async generateSpeech(params: TextToSpeechParams): Promise<TTSResult> {
      try {
        // Build API payload (MCP handles validation)
        const payload = this.buildPayload(params);
        
        // Make API request
        const response = await this.post(API_CONFIG.ENDPOINTS.TEXT_TO_SPEECH, payload) as TTSResponse;
        
        // Process response
        return await this.processTTSResponse(response, params);
        
      } catch (error: any) {
        const processedError = ErrorHandler.handleAPIError(error);
        ErrorHandler.logError(processedError, { service: 'tts', params });
        
        // Throw the error so task manager can properly mark it as failed
        throw processedError;
      }
    }
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Describes async behavior, return value (task ID), and dependency on task_barrier, but lacks details on error handling, auth, or rate limit specifics.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three sentences, front-loaded with purpose, then guidance, then return info. No superfluous content.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Covers core functionality and async workflow adequately given 15 parameters and no output schema, but could mention voice selection or edge cases.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 93%, so description adds little beyond the schema. Baseline 3 is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool converts text to speech asynchronously, distinguishing it from sibling tools like submit_image_generation and task_barrier.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly recommends batching submissions to saturate rate limits and using task_barrier for completion, providing actionable guidance.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/PsychArch/minimax-mcp-tools'

If you have feedback or need assistance with the MCP directory API, please join our Discord server