Skip to main content
Glama

text_to_audio

Convert text to audio with customizable voice, speed, and emotion, saving the file to a specified directory. Integrates with MiniMax API for high-quality speech synthesis.

Instructions

Convert text to audio with a given voice and save the output audio file to a given directory. If no directory is provided, the file will be saved to desktop. If no voice ID is provided, the default voice will be used.

Note: This tool calls MiniMax API and may incur costs. Use only when explicitly requested by the user.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
bitrateNoBitrate (bps), values: [64000, 96000, 128000, 160000, 192000, 224000, 256000, 320000]
channelNoAudio channels, values: [1, 2]
emotionNoSpeech emotion, values: ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"]happy
formatNoAudio format, values: ["pcm", "mp3","flac", "wav"]mp3
languageBoostNoEnhance the ability to recognize specified languages and dialects. Supported values include: 'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'auto', default is 'auto'auto
modelNoModel to usespeech-02-hd
outputDirectoryNoThe directory to save the output file. `outputDirectory` is relative to `MINIMAX_MCP_BASE_PATH` (or `basePath` in config). The final save path is `${basePath}/${outputDirectory}`. For example, if `MINIMAX_MCP_BASE_PATH=~/Desktop` and `outputDirectory=workspace`, the output will be saved to `~/Desktop/workspace/`
outputFileNoPath to save the generated audio file, automatically generated if not provided
pitchNoSpeech pitch
sampleRateNoSample rate (Hz), values: [8000, 16000, 22050, 24000, 32000, 44100]
speedNoSpeech speed
subtitleEnableNoThe parameter controls whether the subtitle service is enabled. The model must be 'speech-01-turbo' or 'speech-01-hd'. If this parameter is not provided, the default value is false
textYesText to convert to audio
voiceIdNoVoice ID to use, e.g. "female-shaonv"male-qn-qingse
volNoSpeech volume

Implementation Reference

  • Core handler function that implements text-to-audio conversion by calling MiniMax TTS API endpoint /v1/t2a_v2, validates parameters, prepares request payload, handles response (hex to binary conversion, file saving or URL return), includes helper validators for parameters.
    async generateSpeech(request: TTSRequest): Promise<any> {
      // Validate required parameters
      if (!request.text || request.text.trim() === '') {
        throw new MinimaxRequestError(ERROR_TEXT_REQUIRED);
      }
    
      // Process output file
      let outputFile = request.outputFile;
      if (!outputFile) {
        // If no output file is provided, generate one based on text content
        const textPrefix = request.text.substring(0, 20).replace(/[^\w]/g, '_');
        outputFile = `tts_${textPrefix}_${Date.now()}`;
      }
    
      if (!path.extname(outputFile)) {
        // If no extension, add one based on format
        const format = request.format || 'mp3';
        outputFile = buildOutputFile(outputFile, request.outputDirectory, format);
      }
    
      // Prepare request data according to MiniMax API nested structure
      const requestData: Record<string, any> = {
        model: this.ensureValidModel(request.model),
        text: request.text,
        voice_setting: {
          voice_id: request.voiceId || 'male-qn-qingse',
          speed: request.speed || 1.0,
          vol: request.vol || 1.0,
          pitch: request.pitch || 0,
          emotion: this.ensureValidEmotion(request.emotion, this.ensureValidModel(request.model))
        },
        audio_setting: {
          sample_rate: this.ensureValidSampleRate(request.sampleRate),
          bitrate: this.ensureValidBitrate(request.bitrate),
          format: this.ensureValidFormat(request.format),
          channel: this.ensureValidChannel(request.channel)
        },
        language_boost: request.languageBoost || 'auto',
        stream: request.stream,
        subtitle_enable: request.subtitleEnable
      };
    
      // Add output format (if specified)
      if (request.outputFormat === RESOURCE_MODE_URL) {
        requestData.output_format = 'url';
      }
    
      // Filter out undefined fields (recursive)
      const filteredData = this.removeUndefinedFields(requestData);
    
      try {
        // Send request
        const response = await this.api.post<any>('/v1/t2a_v2', filteredData);
    
        // Process response
        const audioData = response?.data?.audio;
    
        const subtitleFile = response?.data?.subtitle_file;
    
        if (!audioData) {
          throw new MinimaxRequestError('Could not get audio data from response');
        }
    
        // If URL mode, return URL directly
        if (request.outputFormat === RESOURCE_MODE_URL) {
          return {
            audio: audioData,
            subtitle: subtitleFile
          };
        }
    
        // If base64 mode, decode and save file
        try {
          // Convert hex string to binary
          const audioBuffer = Buffer.from(audioData, 'hex');
    
          // Ensure output directory exists
          const outputDir = path.dirname(outputFile);
          if (!fs.existsSync(outputDir)) {
            fs.mkdirSync(outputDir, { recursive: true });
          }
    
          // Write to file
          fs.writeFileSync(outputFile, audioBuffer);
    
          return {
            audio: outputFile,
            subtitle: subtitleFile
          };
        } catch (error) {
          throw new MinimaxRequestError(`Failed to save audio file: ${String(error)}`);
        }
      } catch (error) {
        throw error;
      }
    }
  • Tool registration using McpServer.tool() with name 'text_to_audio', detailed description, Zod input schema defining all parameters with descriptions/defaults/validations, and async handler that prepares params and calls TTSAPI.generateSpeech
    private registerTextToAudioTool(): void {
      this.server.tool(
        'text_to_audio',
        'Convert text to audio with a given voice and save the output audio file to a given directory. If no directory is provided, the file will be saved to desktop. If no voice ID is provided, the default voice will be used.\n\nNote: This tool calls MiniMax API and may incur costs. Use only when explicitly requested by the user.',
        {
          text: z.string().describe('Text to convert to audio'),
          outputDirectory: COMMON_PARAMETERS_SCHEMA.outputDirectory,
          voiceId: z.string().optional().default(DEFAULT_VOICE_ID).describe('Voice ID to use, e.g. "female-shaonv"'),
          model: z.string().optional().default(DEFAULT_SPEECH_MODEL).describe('Model to use'),
          speed: z.number().min(0.5).max(2.0).optional().default(DEFAULT_SPEED).describe('Speech speed'),
          vol: z.number().min(0.1).max(10.0).optional().default(DEFAULT_VOLUME).describe('Speech volume'),
          pitch: z.number().min(-12).max(12).optional().default(DEFAULT_PITCH).describe('Speech pitch'),
          emotion: z
            .string()
            .optional()
            .default(DEFAULT_EMOTION)
            .describe('Speech emotion, values: ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"]'),
          format: z
            .string()
            .optional()
            .default(DEFAULT_FORMAT)
            .describe('Audio format, values: ["pcm", "mp3","flac", "wav"]'),
          sampleRate: z
            .number()
            .optional()
            .default(DEFAULT_SAMPLE_RATE)
            .describe('Sample rate (Hz), values: [8000, 16000, 22050, 24000, 32000, 44100]'),
          bitrate: z
            .number()
            .optional()
            .default(DEFAULT_BITRATE)
            .describe('Bitrate (bps), values: [64000, 96000, 128000, 160000, 192000, 224000, 256000, 320000]'),
          channel: z.number().optional().default(DEFAULT_CHANNEL).describe('Audio channels, values: [1, 2]'),
          languageBoost: z.string().optional().default(DEFAULT_LANGUAGE_BOOST)
            .describe(`Enhance the ability to recognize specified languages and dialects. Supported values include: 'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'auto', default is 'auto'`),
          subtitleEnable: z
            .boolean()
            .optional()
            .default(false)
            .describe(
              `The parameter controls whether the subtitle service is enabled. The model must be 'speech-01-turbo' or 'speech-01-hd'. If this parameter is not provided, the default value is false`,
            ),
          outputFile: z
            .string()
            .optional()
            .describe('Path to save the generated audio file, automatically generated if not provided'),
        },
        async (args, extra) => {
          try {
            // Build TTS request parameters
            const ttsParams = {
              text: args.text,
              outputDirectory: args.outputDirectory,
              voiceId: args.voiceId || DEFAULT_VOICE_ID,
              model: args.model || DEFAULT_SPEECH_MODEL,
              speed: args.speed || DEFAULT_SPEED,
              vol: args.vol || DEFAULT_VOLUME,
              pitch: args.pitch || DEFAULT_PITCH,
              emotion: args.emotion || DEFAULT_EMOTION,
              format: args.format || DEFAULT_FORMAT,
              sampleRate: args.sampleRate || DEFAULT_SAMPLE_RATE,
              bitrate: args.bitrate || DEFAULT_BITRATE,
              channel: args.channel || DEFAULT_CHANNEL,
              languageBoost: args.languageBoost || DEFAULT_LANGUAGE_BOOST,
              subtitleEnable: args.subtitleEnable || false,
              outputFile: args.outputFile,
            };
    
            // Use global configuration
            const requestApiKey = this.config.apiKey;
    
            if (!requestApiKey) {
              throw new Error(ERROR_API_KEY_REQUIRED);
            }
    
            // Update configuration with request-specific parameters
            const requestConfig: Partial<Config> = {
              apiKey: requestApiKey,
              apiHost: this.config.apiHost,
              resourceMode: this.config.resourceMode,
            };
    
            // Update API instance
            const requestApi = new MiniMaxAPI(requestConfig as Config);
            const requestTtsApi = new TTSAPI(requestApi);
    
            // Automatically set resource mode (if not specified)
            const outputFormat = requestConfig.resourceMode;
            const ttsRequest = {
              ...ttsParams,
              outputFormat,
            };
    
            // If no output filename is provided, generate one automatically
            if (!ttsRequest.outputFile) {
              const textPrefix = ttsRequest.text.substring(0, 20).replace(/[^\w]/g, '_');
              ttsRequest.outputFile = `tts_${textPrefix}_${Date.now()}`;
            }
    
            const result = await requestTtsApi.generateSpeech(ttsRequest);
    
            // Return different messages based on output format
            if (outputFormat === RESOURCE_MODE_URL) {
              return {
                content: [
                  {
                    type: 'text',
                    text: `Success. Audio URL: ${result.audio}. ${ttsParams.subtitleEnable ? `Subtitle file saved: ${result.subtitle}` : ''}`,
                  },
                ],
              };
            } else {
              return {
                content: [
                  {
                    type: 'text',
                    text: `Audio file saved: ${result.audio}. ${ttsParams.subtitleEnable ? `Subtitle file saved: ${result.subtitle}. ` : ''}Voice used: ${ttsParams.voiceId}`,
                  },
                ],
              };
            }
          } catch (error) {
            return {
              content: [
                {
                  type: 'text',
                  text: `Failed to generate audio: ${error instanceof Error ? error.message : String(error)}`,
                },
              ],
            };
          }
        },
      );
    }
  • JSON schema definition for text_to_audio tool used in REST server's list tools handler, includes all input parameters with types and descriptions.
    {
      name: 'text_to_audio',
      description: 'Convert text to audio',
      arguments: [
        { name: 'text', description: 'Text to convert to audio', required: true },
        { name: 'outputDirectory', description: OUTPUT_DIRECTORY_DESCRIPTION, required: false },
        { name: 'voiceId', description: 'Voice ID to use, e.g. "female-shaonv"', required: false },
        { name: 'model', description: 'Model to use', required: false },
        { name: 'speed', description: 'Speech speed (0.5-2.0)', required: false },
        { name: 'vol', description: 'Speech volume (0.1-10.0)', required: false },
        { name: 'pitch', description: 'Speech pitch (-12 to 12)', required: false },
        { name: 'emotion', description: 'Speech emotion, values: ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"]', required: false },
        { name: 'format', description: 'Audio format, values: ["pcm", "mp3","flac", "wav"]', required: false },
        { name: 'sampleRate', description: 'Sample rate (Hz), values: [8000, 16000, 22050, 24000, 32000, 44100]', required: false },
        { name: 'bitrate', description: 'Bitrate (bps), values: [64000, 96000, 128000, 160000, 192000, 224000, 256000, 320000]', required: false },
        { name: 'channel', description: 'Audio channels, values: [1, 2]', required: false },
        { name: 'languageBoost', description: `Enhance the ability to recognize specified languages and dialects. Supported values include: 'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'auto', default is 'auto'`, required: false },
        { name: 'subtitleEnable', description: `The parameter controls whether the subtitle service is enabled. The model must be 'speech-01-turbo' or 'speech-01-hd'. If this parameter is not provided, the default value is false`, required: false },
        { name: 'outputFile', description: 'Output file path, auto-generated if not provided', required: false }
      ],
      inputSchema: {
        type: 'object',
        properties: {
          text: { type: 'string' },
          outputDirectory: { type: 'string' },
          voiceId: { type: 'string' },
          model: { type: 'string' },
          speed: { type: 'number' },
          vol: { type: 'number' },
          pitch: { type: 'number' },
          emotion: { type: 'string' },
          format: { type: 'string' },
          sampleRate: { type: 'number' },
          bitrate: { type: 'number' },
          channel: { type: 'number' },
          languageBoost: { type: 'string' },
          subtitleEnable: { type: 'boolean' },
          outputFile: { type: 'string' }
        },
        required: ['text']
      }
    },
  • Handler wrapper in REST server that dispatches to mediaService.generateSpeech with retry logic.
                  },
                  required: ['voiceType']
                }
              },
              {
                name: 'play_audio',
                description: 'Play audio file. Supports WAV and MP3 formats. Does not support video.',
                arguments: [
                  { name: 'inputFilePath', description: 'Path to audio file to play', required: true },
                  { name: 'isUrl', description: 'Whether audio file is a URL', required: false }
                ],
                inputSchema: {
                  type: 'object',
                  properties: {
                    inputFilePath: { type: 'string' },
                    isUrl: { type: 'boolean' }
                  },
                  required: ['inputFilePath']
                }
              },
              {
                name: 'text_to_image',
                description: 'Generate image based on text prompt',
                arguments: [
                  { name: 'prompt', description: 'Text prompt for image generation', required: true },
                  { name: 'model', description: 'Model to use', required: false },
                  { name: 'aspectRatio', description: 'Image aspect ratio, values: ["1:1", "16:9","4:3", "3:2", "2:3", "3:4", "9:16", "21:9"]', required: false },
                  { name: 'n', description: 'Number of images to generate (1-9)', required: false },
                  { name: 'promptOptimizer', description: 'Whether to optimize prompt', required: false },
                  { name: 'outputDirectory', description: OUTPUT_DIRECTORY_DESCRIPTION, required: false },
                  { name: 'outputFile', description: 'Output file path, auto-generated if not provided', required: false }
                ],
                inputSchema: {
                  type: 'object',
                  properties: {
                    prompt: { type: 'string' },
                    model: { type: 'string' },
                    aspectRatio: { type: 'string' },
                    n: { type: 'number' },
                    promptOptimizer: { type: 'boolean' },
                    outputDirectory: { type: 'string' },
                    outputFile: { type: 'string' }
                  },
                  required: ['prompt']
                }
              },
              {
                name: 'generate_video',
                description: 'Generate video based on text prompt',
                arguments: [
                  { name: 'prompt', description: 'Text prompt for video generation', required: true },
                  { name: 'model', description: 'Model to use, values: ["T2V-01", "T2V-01-Director", "I2V-01", "I2V-01-Director", "I2V-01-live", "MiniMax-Hailuo-02"]', required: false },
                  { name: 'firstFrameImage', description: 'First frame image', required: false },
                  { name: 'outputDirectory', description: OUTPUT_DIRECTORY_DESCRIPTION, required: false },
                  { name: 'outputFile', description: 'Output file path, auto-generated if not provided', required: false },
                  { name: 'async_mode', description: 'Whether to use async mode. Defaults to False. If True, the video generation task will be submitted asynchronously and the response will return a task_id. Should use `query_video_generation` tool to check the status of the task and get the result', required: false },
                  { name: 'resolution', description: 'The resolution of the video. The model must be "MiniMax-Hailuo-02". Values range ["768P", "1080P"]', required: false },
                  { name: 'duration', description: 'The duration of the video. The model must be "MiniMax-Hailuo-02". Values can be 6 and 10.', required: false },
                ],
                inputSchema: {
                  type: 'object',
                  properties: {
                    prompt: { type: 'string' },
                    model: { type: 'string' },
                    firstFrameImage: { type: 'string' },
                    outputDirectory: { type: 'string' },
                    outputFile: { type: 'string' },
                    async_mode: { type: 'boolean' },
                    resolution: { type: 'string' },
                    duration: { type: 'number' }
                  },
                  required: ['prompt']
                }
              },
              {
                name: 'voice_clone',
                description: 'Clone voice using provided audio file',
                arguments: [
                  { name: 'voiceId', description: 'Voice ID to use', required: true },
                  { name: 'audioFile', description: 'Audio file path', required: true },
                  { name: 'text', description: 'Text for demo audio', required: false },
                  { name: 'outputDirectory', description: OUTPUT_DIRECTORY_DESCRIPTION, required: false },
                  { name: 'isUrl', description: 'Whether audio file is a URL', required: false }
                ],
                inputSchema: {
                  type: 'object',
                  properties: {
                    voiceId: { type: 'string' },
                    audioFile: { type: 'string' },
                    text: { type: 'string' },
                    outputDirectory: { type: 'string' },
                    isUrl: { type: 'boolean' }
                  },
                  required: ['voiceId', 'audioFile']
                }
              },
              {
                name: 'image_to_video',
                description: 'Generate video based on image',
                arguments: [
                  { name: 'prompt', description: 'Text prompt for video generation', required: true },
                  { name: 'firstFrameImage', description: 'Path to first frame image', required: true },
                  { name: 'model', description: 'Model to use, values: ["I2V-01", "I2V-01-Director", "I2V-01-live"]', required: false },
                  { name: 'outputDirectory', description: OUTPUT_DIRECTORY_DESCRIPTION, required: false },
                  { name: 'outputFile', description: 'Output file path, auto-generated if not provided', required: false },
                  { name: 'async_mode', description: 'Whether to use async mode. Defaults to False. If True, the video generation task will be submitted asynchronously and the response will return a task_id. Should use `query_video_generation` tool to check the status of the task and get the result', required: false }
                ],
                inputSchema: {
                  type: 'object',
                  properties: {
                    prompt: { type: 'string' },
                    firstFrameImage: { type: 'string' },
                    model: { type: 'string' },
                    outputDirectory: { type: 'string' },
                    outputFile: { type: 'string' },
                    async_mode: { type: 'boolean' }
                  },
                  required: ['prompt', 'firstFrameImage']
                }
              },
              {
                name: 'music_generation',
                description: 'Generate music based on text prompt and lyrics',
                arguments: [
                  { name: 'prompt', description: 'Music creation inspiration describing style, mood, scene, etc.', required: true },
                  { name: 'lyrics', description: 'Song lyrics for music generation.\nUse newline (\\n) to separate each line of lyrics. Supports lyric structure tags [Intro][Verse][Chorus][Bridge][Outro]\nto enhance musicality. Character range: [10, 600] (each Chinese character, punctuation, and letter counts as 1 character)', required: true },
                  { name: 'sampleRate', description: 'Sample rate of generated music', required: false },
                  { name: 'bitrate', description: 'Bitrate of generated music', required: false },
                  { name: 'format', description: 'Format of generated music', required: false },
                  { name: 'outputDirectory', description: OUTPUT_DIRECTORY_DESCRIPTION, required: false }
                ],
                inputSchema: {
                  type: 'object',
                  properties: {
                    prompt: { type: 'string' },
                    lyrics: { type: 'string' },
                    sampleRate: { type: 'number' },
                    bitrate: { type: 'number' },
                    format: { type: 'string' },
                    outputDirectory: { type: 'string' }
                  },
                  required: ['prompt', 'lyrics']
                }
              },
              {
                name: 'voice_design',
                description: 'Generate a voice based on description prompts',
                arguments: [
                  { name: 'prompt', description: 'The prompt to generate the voice from', required: true },
                  { name: 'previewText', description: 'The text to preview the voice', required: true },
                  { name: 'voiceId', description: 'The id of the voice to use', required: false },
                  { name: 'outputDirectory', description: OUTPUT_DIRECTORY_DESCRIPTION, required: false }
                ],
                inputSchema: {
                  type: 'object',
                  properties: {
                    prompt: { type: 'string' },
                    previewText: { type: 'string' },
                    voiceId: { type: 'string' },
                    outputDirectory: { type: 'string' }
                  },
                  required: ['prompt', 'previewText']
                }
              }
            ]
          };
        } catch (error) {
          throw this.wrapError('Failed to get tool list', error);
        }
      });
    
      // Call tool handler
      this.server.setRequestHandler(CallToolRequestSchema, async (request) => {
        const toolName = request.params.tool;
        const toolParams = request.params.params || {};
    
        try {
          // Create configuration and API instance for this request
          const requestConfig = this.getRequestConfig(request);
          const requestApi = new MiniMaxAPI(requestConfig);
          const mediaService = new MediaService(requestApi);
    
          // Log API key (partially hidden)
          const apiKey = this.extractApiKeyFromRequest(request);
          const maskedKey = apiKey
            ? `${apiKey.substring(0, 4)}****${apiKey.substring(apiKey.length - 4)}`
            : 'not provided';
          // console.log(`[${new Date().toISOString()}] Using API key: ${maskedKey} to call tool: ${toolName}`);
    
          // Choose different handler function based on tool name
          switch (toolName) {
            case 'text_to_audio':
              return await this.handleTextToAudio(toolParams, requestApi, mediaService);
    
            case 'list_voices':
              return await this.handleListVoices(toolParams, requestApi, mediaService);
    
            case 'play_audio':
              return await this.handlePlayAudio(toolParams);
    
            case 'text_to_image':
              return await this.handleTextToImage(toolParams, requestApi, mediaService);
    
            case 'generate_video':
              return await this.handleGenerateVideo(toolParams, requestApi, mediaService);
    
            case 'voice_clone':
              return await this.handleVoiceClone(toolParams, requestApi, mediaService);
    
            case 'image_to_video':
              return await this.handleImageToVideo(toolParams, requestApi, mediaService);
    
            case 'query_video_generation':
              return await this.handleVideoGenerationQuery(toolParams, requestApi, mediaService);
            
            case 'music_generation':
              return await this.handleGenerateMusic(toolParams, requestApi, mediaService);
    
            case 'voice_design':
              return await this.handleVoiceDesign(toolParams, requestApi, mediaService);
    
            default:
              throw new Error(`Unknown tool: ${toolName}`);
          }
        } catch (error) {
          throw this.wrapError(`Failed to call tool ${toolName}`, error);
        }
      });
    }
    
    /**
     * Handle text to speech request
     */
    private async handleTextToAudio(args: any, api: MiniMaxAPI, mediaService: MediaService, attempt = 1): Promise<any> {
  • MediaService wrapper that delegates text-to-audio execution to TTSAPI.generateSpeech.
    public async generateSpeech(params: any): Promise<string> {
      this.checkInitialized();
      try {
        return await this.ttsApi.generateSpeech(params);
      } catch (error) {
        // console.error(`[${new Date().toISOString()}] Failed to generate speech:`, error);
        throw this.wrapError('Failed to generate speech', error);
      }
    }
Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/MiniMax-AI/MiniMax-MCP-JS'

If you have feedback or need assistance with the MCP directory API, please join our Discord server