Skip to main content
Glama
hoangdn3

OpenRouter MCP Multimodal Server

by hoangdn3

mcp_openrouter_analyze_audio

Transcribe audio files from URLs or local paths to extract text content. Supports wav and mp3 formats using OpenRouter models for accurate speech-to-text conversion.

Instructions

Transcribe audio files and provide raw content. Supports wav/mp3 files from CDN URLs or local paths.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
audio_urlYesPath or URL to the audio file (supports CDN URLs, local file paths, wav/mp3 formats)
modelNoOpenRouter model to use (e.g., "mistralai/voxtral-small-24b-2507", "openai/gpt-4o-audio-preview")

Implementation Reference

  • Core handler function for the mcp_openrouter_analyze_audio tool. Fetches audio from URL or local path, encodes it to base64, detects format (wav/mp3), constructs OpenAI chat completion request with input_audio, calls OpenRouter models (with fallbacks to backup and free models), and returns JSON-structured analysis result or error.
    export async function handleAnalyzeAudio(
      request: { params: { arguments: AnalyzeAudioToolRequest } },
      openai: OpenAI,
      defaultModel?: string
    ): Promise<any> {
      try {
        const args = request.params.arguments;
    
        // Validate input
        if (!args.audio_url) {
          throw new McpError(
            ErrorCode.InvalidParams,
            'audio_url parameter is required'
          );
        }
    
        // Fetch audio
        const buffer = await fetchAudio(args.audio_url);
        const format = detectAudioFormat(buffer, args.audio_url);
        const base64 = buffer.toString('base64');
    
        // Build content array
        const content: Array<{
          type: string;
          text?: string;
          input_audio?: {
            data: string;
            format: string;
          };
        }> = [];
    
        // Add fixed transcription instruction
        content.push({
          type: 'text',
          text: 'Please transcribe and provide me the raw content of this audio.'
        });
    
        // Add audio
        content.push({
          type: 'input_audio',
          input_audio: {
            data: base64,
            format: format
          }
        });
    
        // Select model
        let model = args.model || defaultModel || DEFAULT_AUDIO_MODEL;
        console.error(`[Audio Tool] Using AUDIO model: ${model}`);
    
        // Try primary model first
        try {
          const completion = await openai.chat.completions.create({
            model,
            messages: [{
              role: 'user',
              content
            }] as any
          });
    
          const response = completion as any;
          return {
            content: [
              {
                type: 'text',
                text: JSON.stringify({
                  id: response.id,
                  analysis: completion.choices[0].message.content || '',
                  model: response.model,
                  usage: response.usage
                }),
              },
            ],
          };
        } catch (primaryError: any) {
          // Try backup model
          const backupModel = process.env.OPENROUTER_DEFAULT_MODEL_AUDIO_BACKUP;
          if (backupModel && backupModel !== model) {
            try {
              const completion = await openai.chat.completions.create({
                model: backupModel,
                messages: [{
                  role: 'user',
                  content
                }] as any
              });
    
              const resp = completion as any;
              return {
                content: [
                  {
                    type: 'text',
                    text: JSON.stringify({
                      id: resp.id,
                      analysis: completion.choices[0].message.content || '',
                      model: resp.model,
                      usage: resp.usage
                    }),
                  },
                ],
              };
            } catch (backupError: any) {
              // Try free audio model
              const freeModel = await findSuitableFreeAudioModel(openai);
              if (freeModel && freeModel !== model && freeModel !== backupModel) {
                const completion = await openai.chat.completions.create({
                  model: freeModel,
                  messages: [{
                    role: 'user',
                    content
                  }] as any
                });
    
                const resp = completion as any;
                return {
                  content: [
                    {
                      type: 'text',
                      text: JSON.stringify({
                        id: resp.id,
                        analysis: completion.choices[0].message.content || '',
                        model: resp.model,
                        usage: resp.usage
                      }),
                    },
                  ],
                };
              } else {
                throw backupError;
              }
            }
          } else {
            // No backup, try free model directly
            const freeModel = await findSuitableFreeAudioModel(openai);
            if (freeModel && freeModel !== model) {
              const completion = await openai.chat.completions.create({
                model: freeModel,
                messages: [{
                  role: 'user',
                  content
                }] as any
              });
    
              const resp = completion as any;
              return {
                content: [
                  {
                    type: 'text',
                    text: JSON.stringify({
                      id: resp.id,
                      analysis: completion.choices[0].message.content || '',
                      model: resp.model,
                      usage: resp.usage
                    }),
                  },
                ],
              };
            } else {
              throw primaryError;
            }
          }
        }
      } catch (error) {
        if (error instanceof McpError) {
          throw error;
        }
    
        return {
          content: [
            {
              type: 'text',
              text: JSON.stringify({
                error: error instanceof Error ? error.message : String(error),
                model: request.params.arguments.model || defaultModel || DEFAULT_AUDIO_MODEL,
                usage: { prompt_tokens: 0, completion_tokens: 0, total_tokens: 0 }
              }),
            },
          ],
          isError: true,
        };
      }
    }
  • Registers the mcp_openrouter_analyze_audio tool in the MCP server's listTools handler, defining its name, description, and input schema.
    {
      name: 'mcp_openrouter_analyze_audio',
      description: 'Transcribe audio files and provide raw content. Supports wav/mp3 files from CDN URLs or local paths.',
      inputSchema: {
        type: 'object',
        properties: {
          audio_url: {
            type: 'string',
            description: 'Path or URL to the audio file (supports CDN URLs, local file paths, wav/mp3 formats)',
          },
          model: {
            type: 'string',
            description: 'OpenRouter model to use (e.g., "mistralai/voxtral-small-24b-2507", "openai/gpt-4o-audio-preview")',
          },
        },
        required: ['audio_url'],
      },
    },
  • Switch case in CallToolRequestSchema handler that routes calls to 'mcp_openrouter_analyze_audio' to the handleAnalyzeAudio function.
    case 'mcp_openrouter_analyze_audio':
      return handleAnalyzeAudio({
        params: {
          arguments: request.params.arguments as unknown as AnalyzeAudioToolRequest
        }
      }, this.openai, this.defaultAudioModel);
  • TypeScript interface defining the input parameters for the analyze audio tool.
    export interface AnalyzeAudioToolRequest {
      audio_url: string;
      model?: string;
    }
  • Helper function to fetch audio data from remote URL or local file path and return as Buffer.
    async function fetchAudio(audioPath: string): Promise<Buffer> {
      if (isUrl(audioPath)) {
        // Fetch from URL
        const response = await fetch(audioPath);
        if (!response.ok) {
          throw new Error(`Failed to fetch audio: ${response.statusText}`);
        }
        const arrayBuffer = await response.arrayBuffer();
        return Buffer.from(arrayBuffer);
      } else {
        // Read from local file
        const normalizedPath = normalizePath(audioPath);
        let resolvedPath = normalizedPath;
    
        if (!path.isAbsolute(resolvedPath)) {
          resolvedPath = path.resolve(process.cwd(), resolvedPath);
        }
    
        return await fs.readFile(resolvedPath);
      }
    }
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries the full burden of behavioral disclosure. It mentions that the tool transcribes audio and provides raw content, but it lacks details on behavioral traits such as rate limits, authentication needs, error handling, or whether the operation is read-only or has side effects. This is a significant gap for a tool with no annotation coverage.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise and well-structured in a single sentence, front-loaded with the core purpose ('Transcribe audio files and provide raw content') followed by supporting details. Every part earns its place without redundancy, making it efficient for an agent to parse.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (audio transcription with model selection), lack of annotations, and no output schema, the description is moderately complete. It covers the basic purpose and input sources but misses behavioral context and output details. It's adequate as a minimum viable description but has clear gaps in guiding the agent fully.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the input schema already documents both parameters (audio_url and model) thoroughly. The description adds minimal value beyond the schema by reiterating supported formats and sources, but it doesn't provide additional semantic context like examples of model choices or usage tips. Baseline 3 is appropriate as the schema does the heavy lifting.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Transcribe audio files and provide raw content.' It specifies the action (transcribe), resource (audio files), and output (raw content). However, it doesn't explicitly differentiate from sibling tools like mcp_openrouter_chat_completion or mcp_openrouter_analyze_image, which handle different media types, so it misses the top score.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage by mentioning supported file formats (wav/mp3) and sources (CDN URLs or local paths), but it doesn't provide explicit guidance on when to use this tool versus alternatives. For example, it doesn't compare to sibling tools like mcp_openrouter_chat_completion for text-based tasks or specify prerequisites. This leaves some ambiguity for the agent.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/hoangdn3/mcp-ocr-fallback'

If you have feedback or need assistance with the MCP directory API, please join our Discord server