Skip to main content
Glama

speak

Convert text to speech using Rime's API for audio output when users request spoken responses or need verbal explanations after completing commands.

Instructions

Speak text aloud using Rime's text-to-speech API. Should be used when user asks you to speak or to announce and explain when you finish a command

User configuration:

WHO_TO_ADDRESS: user

WHEN_TO_SPEAK: when asked to speak or when finishing a command

VOICE: cove

GUIDANCE: Use the speak tool to convert text to speech when the user requests audio output or when providing verbal responses

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
textYesThe text to speak aloud
speakerNoThe voice to use (defaults to 'cove')
speedAlphaNoSpeech speed multiplier (default: 1.0)
reduceLatencyNoWhether to optimize for lower latency (default: false)

Implementation Reference

  • index.ts:97-130 (handler)
    The doSpeak function implements the core execution logic for the 'speak' tool, handling parameters and delegating TTS to playText while returning structured content.
    async function doSpeak(params: {
      text: string;
      speaker?: string;
      speedAlpha?: number;
      reduceLatency?: boolean;
    }) {
      try {
        // Use the playText function from stream-audio.ts
        await playText(params.text, {
          speaker: params.speaker || "cove",
          speedAlpha: params.speedAlpha || 1.0,
          reduceLatency: params.reduceLatency || false,
        });
    
        return {
          content: [
            {
              type: "text",
              text: JSON.stringify({
                success: true,
                text: params.text,
                speaker: params.speaker || "cove",
              }),
            },
          ],
        };
      } catch (error: unknown) {
        log("ERROR", `Error: ${error instanceof Error ? error.message : String(error)}`);
        throw new McpError(
          ErrorCode.InternalError,
          `Rime API error: ${error instanceof Error ? error.message : String(error)}`
        );
      }
    }
  • Defines the SPEAK_TOOL object, including name, description, and detailed inputSchema for validating 'speak' tool parameters.
    const SPEAK_TOOL: Tool = {
      name: "speak",
      description: `Speak text aloud using Rime's text-to-speech API. Should be used when user asks you to speak or to announce and explain when you finish a command
        
    User configuration:
    
    ${WHO_TO_ADDRESS ? `WHO_TO_ADDRESS: ${WHO_TO_ADDRESS}` : ""}
    
    ${WHEN_TO_SPEAK ? `WHEN_TO_SPEAK: ${WHEN_TO_SPEAK}` : ""}
    
    ${VOICE ? `VOICE: ${VOICE}` : ""}
    
    ${GUIDANCE ? `GUIDANCE: ${GUIDANCE}` : ""}
        `,
      inputSchema: {
        type: "object",
        properties: {
          text: {
            type: "string",
            description: "The text to speak aloud",
          },
          speaker: {
            type: "string",
            description: `The voice to use (defaults to '${VOICE}')`,
          },
          speedAlpha: {
            type: "number",
            description: "Speech speed multiplier (default: 1.0)",
          },
          reduceLatency: {
            type: "boolean",
            description: "Whether to optimize for lower latency (default: false)",
          },
        },
        required: ["text"],
      },
    };
  • index.ts:89-91 (registration)
    Registers the 'speak' tool by including it in the response to ListToolsRequestSchema.
    server.setRequestHandler(ListToolsRequestSchema, async () => ({
      tools: [SPEAK_TOOL],
    }));
  • The CallToolRequestSchema handler dispatches calls to the 'speak' tool by invoking the doSpeak function.
    server.setRequestHandler(CallToolRequestSchema, async (request) => {
      if (request.params.name === "speak") {
        console.error("Speak tool called with:", request.params.arguments);
        const input = request.params.arguments as {
          text: string;
          speaker?: string;
          speedAlpha?: number;
          reduceLatency?: boolean;
        };
        return doSpeak(input);
      }
    
      throw new McpError(ErrorCode.MethodNotFound, `Unknown tool: ${request.params.name}`);
    });
  • Supporting function playText that handles the Rime TTS API call, audio file management, and playback using system audio players.
    export async function playText(text: string, customConfig?: Partial<TtsConfig>): Promise<void> {
      const config: TtsConfig = { ...DEFAULT_CONFIG, ...customConfig };
    
      console.error("Starting Rime TTS with text:");
      console.error(`"${text}"`);
    
      try {
        const apiKey = getApiKey();
    
        // Create temporary directory for audio files
        const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "rime-stream-"));
        const audioFilePath = path.join(tmpDir, "audio.mp3");
    
        const cleanup = () => {
          try {
            fs.rmSync(tmpDir, { recursive: true, force: true });
          } catch (error) {
            console.error("Failed to clean up temporary directory:", error);
          }
        };
    
        // Prepare API request
        const modelId = findModelId(config.speaker);
    
        const options = {
          method: "POST",
          headers: {
            Accept: "audio/mp3",
            Authorization: `Bearer ${apiKey}`,
            "Content-Type": "application/json",
          },
          body: JSON.stringify({
            speaker: config.speaker,
            text: text,
            modelId: modelId,
            lang: "eng",
            samplingRate: config.samplingRate,
            speedAlpha: config.speedAlpha,
            reduceLatency: config.reduceLatency,
          }),
        };
    
        // Make API request
        console.error("Sending request to Rime API...");
        const response = await fetch("https://users.rime.ai/v1/rime-tts", options);
    
        if (!response.ok) {
          const errorText = await response.text();
          throw new Error(
            `API request failed: ${response.status} ${response.statusText} - ${errorText}`
          );
        }
    
        // Get audio data as arrayBuffer
        const audioBuffer = await response.arrayBuffer();
    
        // Write audio data to file
        fs.writeFileSync(audioFilePath, Buffer.from(audioBuffer));
        console.error(`Audio saved to ${audioFilePath}`);
    
        return new Promise((resolve, reject) => {
          try {
            console.error("Starting audio playback...");
            const player = getAudioPlayerCommand();
    
            const playerProcess = spawn(player.cmd, [...player.args, audioFilePath]);
    
            playerProcess.stdout?.on("data", (data) => {
              console.error(`Player output: ${data}`);
            });
    
            playerProcess.stderr?.on("data", (data) => {
              console.error(`Player error: ${data}`);
            });
    
            playerProcess.on("close", (code) => {
              console.error(`Player process exited with code ${code || 0}`);
              cleanup();
              resolve();
            });
    
            playerProcess.on("error", (error: Error) => {
              console.error("Player process error:", error);
              cleanup();
              reject(error);
            });
          } catch (err) {
            cleanup();
            reject(err);
          }
        });
      } catch (error) {
        console.error("Error:", error);
        throw error;
      }
    }
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries the full burden. It mentions the API ('Rime's text-to-speech API') and usage context, but lacks details on behavioral traits such as rate limits, authentication requirements, error handling, or output format. The description does not contradict annotations, but it provides only basic operational context without deeper behavioral insights.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness3/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is somewhat verbose and includes redundant sections like 'User configuration:' with WHO_TO_ADDRESS, WHEN_TO_SPEAK, VOICE, and GUIDANCE, which could be integrated more efficiently. While it provides useful information, the structure is not optimally front-loaded, and some sentences (e.g., the configuration headers) do not add significant value beyond the core description.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's moderate complexity (4 parameters, no output schema, no annotations), the description is fairly complete. It covers purpose, usage guidelines, and basic context, but lacks details on behavioral aspects like performance or errors. Without annotations or output schema, it does enough to guide usage but could be more comprehensive for full transparency.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema fully documents all parameters. The description does not add any parameter-specific information beyond what the schema provides (e.g., it mentions 'VOICE: cove' but the schema already describes the 'speaker' parameter with a default). Baseline score of 3 is appropriate as the schema handles parameter documentation effectively.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Speak text aloud using Rime's text-to-speech API.' It specifies the verb ('speak') and resource ('text'), and distinguishes it from potential alternatives by mentioning the specific API. However, since there are no sibling tools, the differentiation aspect is not applicable, preventing a perfect score.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit usage guidelines: 'Should be used when user asks you to speak or to announce and explain when you finish a command' and 'Use the speak tool to convert text to speech when the user requests audio output or when providing verbal responses.' It clearly defines when to use the tool, including specific scenarios, making it highly actionable for an AI agent.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/MatthewDailey/rime-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server