Skip to main content
Glama
tusharpatil2912

Pollinations Multimodal MCP Server

respondAudio

Convert text prompts into audio responses using customizable voice and format options for accessible content creation.

Instructions

Generate an audio response to a text prompt

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
promptYesThe text prompt to respond to with audio
voiceNoVoice to use for audio generation (default: "alloy")
formatNoFormat of the audio (mp3, wav, etc.)
voiceInstructionsNoAdditional instructions for voice character/style (e.g., "Speak with enthusiasm" or "Use a calm tone")

Implementation Reference

  • The handler function that implements the core logic of the respondAudio tool. It takes a text prompt, generates audio using the Pollinations Text-to-Speech API, converts it to base64, optionally plays it, and returns an MCP response with the audio data.
    async function respondAudio(params) {
        const {
            prompt,
            voice = "alloy",
            format = "mp3",
            voiceInstructions,
            audioPlayer,
            tempDir,
        } = params;
    
        if (!prompt || typeof prompt !== "string") {
            throw new Error("Prompt is required and must be a string");
        }
    
        // Prepare the query parameters
        const queryParams = {
            model: "openai-audio",
            voice,
            format,
        };
    
        // Prepare the prompt
        let finalPrompt = prompt;
    
        // Add voice instructions if provided
        if (voiceInstructions) {
            finalPrompt = `${voiceInstructions}\n\n${prompt}`;
        }
    
        // Build the URL using the utility function
        const url = buildUrl(
            AUDIO_API_BASE_URL,
            encodeURIComponent(finalPrompt),
            queryParams,
        );
    
        try {
            // Fetch the audio from the URL
            const response = await fetch(url);
    
            if (!response.ok) {
                throw new Error(`Failed to generate audio: ${response.statusText}`);
            }
    
            // Get the audio data as an ArrayBuffer
            const audioBuffer = await response.arrayBuffer();
    
            // Convert the ArrayBuffer to a base64 string
            const base64Data = Buffer.from(audioBuffer).toString("base64");
    
            // Determine the mime type from the format
            const mimeType = `audio/${format === "mp3" ? "mpeg" : format}`;
    
            // Play the audio if an audio player is provided
            if (audioPlayer) {
                const tempDirPath = tempDir || os.tmpdir();
                await playAudio(
                    base64Data,
                    mimeType,
                    "respond_audio",
                    audioPlayer,
                    tempDirPath,
                );
            }
    
            // Return the response in MCP format
            return createMCPResponse([
                {
                    type: "audio",
                    data: base64Data,
                    mimeType,
                },
                createTextContent(
                    `Generated audio response for prompt: "${prompt}"\n\nVoice: ${voice}\nFormat: ${format}`,
                ),
            ]);
        } catch (error) {
            console.error("Error generating audio:", error);
            throw error;
        }
    }
  • The registration entry for the respondAudio tool within the exported audioTools array, which is used to register the tool with the MCP server. Includes name, description, input schema, and reference to the handler function.
    [
        "respondAudio",
        "Generate an audio response to a text prompt",
        {
            prompt: z
                .string()
                .describe("The text prompt to respond to with audio"),
            voice: z
                .string()
                .optional()
                .describe(
                    'Voice to use for audio generation (default: "alloy")',
                ),
            format: z
                .string()
                .optional()
                .describe("Format of the audio (mp3, wav, etc.)"),
            voiceInstructions: z
                .string()
                .optional()
                .describe(
                    'Additional instructions for voice character/style (e.g., "Speak with enthusiasm" or "Use a calm tone")',
                ),
        },
        respondAudio,
    ],
  • Zod schema defining the input parameters for the respondAudio tool, including prompt, voice, format, and voiceInstructions.
    {
        prompt: z
            .string()
            .describe("The text prompt to respond to with audio"),
        voice: z
            .string()
            .optional()
            .describe(
                'Voice to use for audio generation (default: "alloy")',
            ),
        format: z
            .string()
            .optional()
            .describe("Format of the audio (mp3, wav, etc.)"),
        voiceInstructions: z
            .string()
            .optional()
            .describe(
                'Additional instructions for voice character/style (e.g., "Speak with enthusiasm" or "Use a calm tone")',
            ),
    },
  • Helper function used by respondAudio to play the generated audio in the terminal using the provided audioPlayer, saving to a temporary file first.
    function playAudio(audioData, mimeType, prefix, audioPlayer, tempDir) {
        if (!audioPlayer || !tempDir) {
            return Promise.resolve();
        }
    
        return new Promise((resolve, reject) => {
            try {
                const format = getFormatFromMimeType(mimeType);
                const tempFile = path.join(
                    tempDir,
                    `${prefix}_${Date.now()}.${format}`,
                );
                fs.writeFileSync(tempFile, Buffer.from(audioData, "base64"));
    
                audioPlayer.play(tempFile, (err) => {
                    if (err) {
                        console.error("Error playing audio:", err);
                    }
    
                    // Clean up temp file after playing
                    try {
                        fs.unlinkSync(tempFile);
                    } catch (e) {
                        console.error("Error removing temp file:", e);
                    }
    
                    resolve();
                });
            } catch (error) {
                console.error("Error playing audio:", error);
                reject(error);
            }
        });
    }
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It states the tool generates audio but lacks critical details: it doesn't specify if this is a read-only or mutative operation, what permissions or authentication might be required, potential rate limits, output format details beyond parameters, or error handling. For a tool with 4 parameters and no annotations, this is a significant gap in transparency.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is extremely concise and front-loaded: 'Generate an audio response to a text prompt' is a single, clear sentence that directly states the tool's function. There is no wasted verbiage or unnecessary elaboration, making it efficient and easy to parse for an AI agent.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the complexity (4 parameters, no output schema, no annotations), the description is incomplete. It doesn't explain what the tool returns (e.g., audio data, a file URL, or metadata), error conditions, or behavioral traits like mutability or side effects. For a tool that likely involves audio generation with multiple inputs, more context is needed to ensure proper usage.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The description implies the 'prompt' parameter but doesn't add meaning beyond what the input schema provides. With 100% schema description coverage, all parameters are documented in the schema (e.g., 'prompt' as text input, 'voice' with default, 'format' as audio type, 'voiceInstructions' for style). The description doesn't elaborate on parameter interactions or usage examples, so it meets the baseline for high schema coverage without adding extra value.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Generate an audio response to a text prompt.' It specifies the verb ('generate'), resource ('audio response'), and input ('text prompt'), making it easy to understand what the tool does. However, it doesn't explicitly differentiate from sibling tools like 'sayText' or 'listAudioVoices,' which would require more specific context about when to use each.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. It doesn't mention sibling tools like 'sayText' (which might be for text-to-speech) or 'listAudioVoices' (which could list available voices), nor does it specify prerequisites, exclusions, or contextual cues for selection. This leaves the agent with minimal direction for tool selection.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/tusharpatil2912/pollinations-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server