Skip to main content
Glama
Garblesnarff

Gemini MCP Server for Claude Desktop

gemini-transcribe-audio

Convert audio files to text using AI transcription with language and context support for accurate results.

Instructions

Transcribe audio files to text using Gemini's multimodal capabilities (with learned user preferences)

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
file_pathYesPath to the audio file to transcribe (supports MP3, WAV, FLAC, AAC, OGG, WEBM)
languageNoOptional language hint for better transcription accuracy (e.g., "en", "es", "fr")
contextNoOptional context for intelligent enhancement (e.g., "medical", "legal", "technical")
preserve_spelled_acronymsNoKeep spelled-out letters (U-R-L) instead of converting to acronyms (URL)

Implementation Reference

  • The primary handler function that executes the gemini-transcribe-audio tool. Validates inputs, reads audio file, constructs enhanced prompts (with intelligence system or verbatim mode), calls Gemini service for transcription, post-processes if verbatim, and learns from interaction.
    async execute(args) {
      const filePath = validateNonEmptyString(args.file_path, 'file_path');
      const language = args.language ? validateString(args.language, 'language') : '';
      const context = args.context ? validateString(args.context, 'context') : null;
      const preserveSpelledAcronyms = args.preserve_spelled_acronyms || false;
    
      log(`Transcribing audio file: "${filePath}" with context: ${context || 'general'}`, this.name);
    
      try {
        validateFileSize(filePath, config.MAX_AUDIO_SIZE_MB);
        const audioBuffer = readFileAsBuffer(filePath);
        const audioBase64 = audioBuffer.toString('base64');
        const mimeType = getMimeType(filePath, config.SUPPORTED_AUDIO_MIMES);
    
        log(`Audio file loaded: ${(audioBuffer.length / 1024).toFixed(2)}KB, MIME type: ${mimeType}`, this.name);
    
        let basePrompt = 'Please transcribe this audio file accurately. Provide the complete text of what is spoken.';
        if (language) {
          basePrompt += ` The audio is in ${language}.`;
        }
    
        let enhancedPrompt = basePrompt;
        if (this.intelligenceSystem.initialized && context !== 'verbatim') {
          try {
            enhancedPrompt = await this.intelligenceSystem.enhancePrompt(basePrompt, context, this.name);
            log('Applied Tool Intelligence enhancement', this.name);
          } catch (err) {
            log(`Tool Intelligence enhancement failed: ${err.message}`, this.name);
          }
        } else if (context === 'verbatim') {
          // For verbatim mode, apply special enhancement directly
          enhancedPrompt = 'TRANSCRIBE VERBATIM: Provide ONLY an exact word-for-word transcription of the spoken content. Do NOT analyze, interpret, or explain the content. ';
          
          enhancedPrompt += 'Include all utterances, pauses, filler words (um, uh, etc.), repetitions, and incomplete sentences. Preserve emotional expressions in brackets like [laughs] or [sighs]. ';
          
          if (preserveSpelledAcronyms) {
            enhancedPrompt += 'Keep spelled-out letters exactly as spoken with hyphens (like U-R-L, H-T-T-P-S, not URL or HTTPS). ';
          } else {
            enhancedPrompt += 'Keep spelled-out letters as hyphenated (like U-R-L not URL). ';
          }
          
          enhancedPrompt += 'Maintain original punctuation including ellipses (...) and dashes. Do not add any artifacts or sounds at the end of the transcription. ';
          enhancedPrompt += 'CRITICAL: Do NOT provide analysis, interpretation, summary, or explanation. ONLY provide the exact spoken words. ';
          enhancedPrompt += 'This is a verbatim transcription task - transcribe exactly what is spoken, nothing more.';
          
          log('Applied verbatim transcription enhancement', this.name);
        }
    
        let transcriptionPrompt = enhancedPrompt;
        if (context) {
          transcriptionPrompt += ` Context: ${context}`;
        }
    
        let transcriptionText = await this.geminiService.transcribeAudio('AUDIO_TRANSCRIPTION', audioBase64, mimeType, transcriptionPrompt);
    
        // Post-process transcription for verbatim mode
        if (context === 'verbatim' && transcriptionText) {
          transcriptionText = this.cleanVerbatimTranscription(transcriptionText, preserveSpelledAcronyms);
        }
    
        if (transcriptionText) {
          log('Audio transcription completed successfully', this.name);
    
          if (this.intelligenceSystem.initialized) {
            try {
              await this.intelligenceSystem.learnFromInteraction(basePrompt, enhancedPrompt, `Transcription completed successfully: ${transcriptionText.length} characters`, context, this.name);
              log('Tool Intelligence learned from interaction', this.name);
            } catch (err) {
              log(`Tool Intelligence learning failed: ${err.message}`, this.name);
            }
          }
    
          let finalResponse = `✓ Audio file transcribed successfully:\n\n**File:** ${filePath}\n**Size:** ${(audioBuffer.length / 1024).toFixed(2)}KB\n**Format:** ${filePath.split('.').pop().toUpperCase()}\n\n**Transcription:**\n${transcriptionText}`; // eslint-disable-line max-len
          if (context && this.intelligenceSystem.initialized) {
            finalResponse += `\n\n---\n_Enhancement applied based on context: ${context}_`;
          }
    
          return {
            content: [
              {
                type: 'text',
                text: finalResponse,
              },
            ],
          };
        }
        log('No transcription text generated', this.name);
        return {
          content: [
            {
              type: 'text',
              text: `Could not transcribe audio file: "${filePath}". The audio may be unclear, too quiet, or in an unsupported language.`,
            },
          ],
        };
      } catch (error) {
        log(`Error transcribing audio: ${error.message}`, this.name);
        throw new Error(`Error transcribing audio: ${error.message}`);
      }
    }
  • Input schema/JSON Schema definition for the tool parameters passed to the constructor.
    {
      type: 'object',
      properties: {
        file_path: {
          type: 'string',
          description: 'Path to the audio file to transcribe (supports MP3, WAV, FLAC, AAC, OGG, WEBM)',
        },
        language: {
          type: 'string',
          description: 'Optional language hint for better transcription accuracy (e.g., "en", "es", "fr")',
        },
        context: {
          type: 'string',
          description: 'Optional context for intelligent enhancement (e.g., "medical", "legal", "technical")',
        },
        preserve_spelled_acronyms: {
          type: 'boolean',
          description: 'Keep spelled-out letters (U-R-L) instead of converting to acronyms (URL)',
        },
      },
      required: ['file_path'],
    },
  • Instantiates and registers the AudioTranscriptionTool with name 'gemini-transcribe-audio' in the central tool registry.
    registerTool(new AudioTranscriptionTool(intelligenceSystem, geminiService));
  • Supporting helper method for post-processing verbatim transcriptions to remove artifacts, format emotional expressions in brackets, and clean formatting.
    cleanVerbatimTranscription(text, preserveSpelledAcronyms = false) {
      // Remove common end-of-file artifacts (pppp, Ppppppppppp)
      text = text.replace(/\s*[Pp]+\s*$/, '');
      
      // Fix spacing before punctuation
      text = text.replace(/\s+([.!?,;:])/g, '$1');
      
      // Fix common emotional expressions that may have lost brackets
      const emotions = [
        'laughs?', 'laugh', 'laughing', 'laughed',
        'sighs?', 'sigh', 'sighing', 'sighed',
        'clears? throat', 'clearing throat', 'cleared throat',
        'coughs?', 'cough', 'coughing', 'coughed',
        'sniffles?', 'sniffle', 'sniffling', 'sniffled',
        'pauses?', 'pause', 'pausing', 'paused',
        'giggles?', 'giggle', 'giggling', 'giggled',
        'whispers?', 'whisper', 'whispering', 'whispered',
        'yells?', 'yell', 'yelling', 'yelled',
        'groans?', 'groan', 'groaning', 'groaned',
        'gasps?', 'gasp', 'gasping', 'gasped',
        'chuckles?', 'chuckle', 'chuckling', 'chuckled',
        'smiles?', 'smile', 'smiling', 'smiled',
        'cries?', 'cry', 'crying', 'cried',
        'frustrated sigh', 'heavy sigh', 'deep sigh', 'long sigh',
        'nervous laugh', 'awkward laugh', 'bitter laugh',
        'scoffs?', 'scoff', 'scoffing', 'scoffed',
        'mumbles?', 'mumble', 'mumbling', 'mumbled',
        'mutters?', 'mutter', 'muttering', 'muttered',
        'stammers?', 'stammer', 'stammering', 'stammered',
        'stutters?', 'stutter', 'stuttering', 'stuttered',
        'shouts?', 'shout', 'shouting', 'shouted',
        'screams?', 'scream', 'screaming', 'screamed',
        'exhales?', 'exhale', 'exhaling', 'exhaled',
        'inhales?', 'inhale', 'inhaling', 'inhaled',
        'breathes? deeply', 'breathing deeply', 'breathed deeply',
        'nods?', 'nod', 'nodding', 'nodded',
        'shakes? head', 'shaking head', 'shook head',
        'silence', 'long pause', 'brief pause', 'awkward silence'
      ];
      const emotionRegex = new RegExp(`\\b(${emotions.join('|')})\\b`, 'gi');
      text = text.replace(emotionRegex, '[$1]');
      
      // Ensure ellipses are preserved properly
      text = text.replace(/\s+\.\s+\.\s+\./g, '...');
      
      // Clean up any double spaces
      text = text.replace(/\s{2,}/g, ' ');
      
      return text.trim();
    }
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full burden but offers minimal behavioral disclosure. It mentions 'learned user preferences' but doesn't explain what this entails (e.g., customization, history). It lacks details on rate limits, authentication needs, output format, error handling, or processing time. For a tool with 4 parameters and no annotations, this is insufficient.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single, efficient sentence that front-loads the core purpose. However, it could be more structured by separating functional description from behavioral context. It avoids redundancy but misses opportunities to add crucial usage details.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given 4 parameters, no annotations, and no output schema, the description is incomplete. It doesn't explain the return value (e.g., transcription text format), error cases, or how 'learned user preferences' affect behavior. For a tool with moderate complexity and no structured safety hints, more context is needed.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema fully documents all parameters. The description adds no additional parameter semantics beyond what's in the schema (e.g., it doesn't clarify 'context' usage or 'learned preferences' interaction with parameters). Baseline 3 is appropriate when schema does the heavy lifting.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the specific action ('Transcribe audio files to text'), identifies the resource ('audio files'), and mentions the unique capability ('using Gemini's multimodal capabilities with learned user preferences'). It distinguishes itself from sibling tools by focusing on audio transcription rather than image/video analysis, code execution, or file uploads.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. It doesn't mention prerequisites (e.g., file accessibility), exclusions (e.g., unsupported formats beyond those in schema), or comparisons with other transcription tools. The context is implied but not explicit.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Garblesnarff/gemini-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server