Skip to main content
Glama

gemini-transcribe-audio

Convert audio files to text using AI transcription with language and context support for accurate results.

Instructions

Transcribe audio files to text using Gemini's multimodal capabilities (with learned user preferences)

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
file_pathYesPath to the audio file to transcribe (supports MP3, WAV, FLAC, AAC, OGG, WEBM)
languageNoOptional language hint for better transcription accuracy (e.g., "en", "es", "fr")
contextNoOptional context for intelligent enhancement (e.g., "medical", "legal", "technical")
preserve_spelled_acronymsNoKeep spelled-out letters (U-R-L) instead of converting to acronyms (URL)

Implementation Reference

  • The primary handler function that executes the gemini-transcribe-audio tool. Validates inputs, reads audio file, constructs enhanced prompts (with intelligence system or verbatim mode), calls Gemini service for transcription, post-processes if verbatim, and learns from interaction.
    async execute(args) { const filePath = validateNonEmptyString(args.file_path, 'file_path'); const language = args.language ? validateString(args.language, 'language') : ''; const context = args.context ? validateString(args.context, 'context') : null; const preserveSpelledAcronyms = args.preserve_spelled_acronyms || false; log(`Transcribing audio file: "${filePath}" with context: ${context || 'general'}`, this.name); try { validateFileSize(filePath, config.MAX_AUDIO_SIZE_MB); const audioBuffer = readFileAsBuffer(filePath); const audioBase64 = audioBuffer.toString('base64'); const mimeType = getMimeType(filePath, config.SUPPORTED_AUDIO_MIMES); log(`Audio file loaded: ${(audioBuffer.length / 1024).toFixed(2)}KB, MIME type: ${mimeType}`, this.name); let basePrompt = 'Please transcribe this audio file accurately. Provide the complete text of what is spoken.'; if (language) { basePrompt += ` The audio is in ${language}.`; } let enhancedPrompt = basePrompt; if (this.intelligenceSystem.initialized && context !== 'verbatim') { try { enhancedPrompt = await this.intelligenceSystem.enhancePrompt(basePrompt, context, this.name); log('Applied Tool Intelligence enhancement', this.name); } catch (err) { log(`Tool Intelligence enhancement failed: ${err.message}`, this.name); } } else if (context === 'verbatim') { // For verbatim mode, apply special enhancement directly enhancedPrompt = 'TRANSCRIBE VERBATIM: Provide ONLY an exact word-for-word transcription of the spoken content. Do NOT analyze, interpret, or explain the content. '; enhancedPrompt += 'Include all utterances, pauses, filler words (um, uh, etc.), repetitions, and incomplete sentences. Preserve emotional expressions in brackets like [laughs] or [sighs]. '; if (preserveSpelledAcronyms) { enhancedPrompt += 'Keep spelled-out letters exactly as spoken with hyphens (like U-R-L, H-T-T-P-S, not URL or HTTPS). '; } else { enhancedPrompt += 'Keep spelled-out letters as hyphenated (like U-R-L not URL). '; } enhancedPrompt += 'Maintain original punctuation including ellipses (...) and dashes. Do not add any artifacts or sounds at the end of the transcription. '; enhancedPrompt += 'CRITICAL: Do NOT provide analysis, interpretation, summary, or explanation. ONLY provide the exact spoken words. '; enhancedPrompt += 'This is a verbatim transcription task - transcribe exactly what is spoken, nothing more.'; log('Applied verbatim transcription enhancement', this.name); } let transcriptionPrompt = enhancedPrompt; if (context) { transcriptionPrompt += ` Context: ${context}`; } let transcriptionText = await this.geminiService.transcribeAudio('AUDIO_TRANSCRIPTION', audioBase64, mimeType, transcriptionPrompt); // Post-process transcription for verbatim mode if (context === 'verbatim' && transcriptionText) { transcriptionText = this.cleanVerbatimTranscription(transcriptionText, preserveSpelledAcronyms); } if (transcriptionText) { log('Audio transcription completed successfully', this.name); if (this.intelligenceSystem.initialized) { try { await this.intelligenceSystem.learnFromInteraction(basePrompt, enhancedPrompt, `Transcription completed successfully: ${transcriptionText.length} characters`, context, this.name); log('Tool Intelligence learned from interaction', this.name); } catch (err) { log(`Tool Intelligence learning failed: ${err.message}`, this.name); } } let finalResponse = `✓ Audio file transcribed successfully:\n\n**File:** ${filePath}\n**Size:** ${(audioBuffer.length / 1024).toFixed(2)}KB\n**Format:** ${filePath.split('.').pop().toUpperCase()}\n\n**Transcription:**\n${transcriptionText}`; // eslint-disable-line max-len if (context && this.intelligenceSystem.initialized) { finalResponse += `\n\n---\n_Enhancement applied based on context: ${context}_`; } return { content: [ { type: 'text', text: finalResponse, }, ], }; } log('No transcription text generated', this.name); return { content: [ { type: 'text', text: `Could not transcribe audio file: "${filePath}". The audio may be unclear, too quiet, or in an unsupported language.`, }, ], }; } catch (error) { log(`Error transcribing audio: ${error.message}`, this.name); throw new Error(`Error transcribing audio: ${error.message}`); } }
  • Input schema/JSON Schema definition for the tool parameters passed to the constructor.
    { type: 'object', properties: { file_path: { type: 'string', description: 'Path to the audio file to transcribe (supports MP3, WAV, FLAC, AAC, OGG, WEBM)', }, language: { type: 'string', description: 'Optional language hint for better transcription accuracy (e.g., "en", "es", "fr")', }, context: { type: 'string', description: 'Optional context for intelligent enhancement (e.g., "medical", "legal", "technical")', }, preserve_spelled_acronyms: { type: 'boolean', description: 'Keep spelled-out letters (U-R-L) instead of converting to acronyms (URL)', }, }, required: ['file_path'], },
  • Instantiates and registers the AudioTranscriptionTool with name 'gemini-transcribe-audio' in the central tool registry.
    registerTool(new AudioTranscriptionTool(intelligenceSystem, geminiService));
  • Supporting helper method for post-processing verbatim transcriptions to remove artifacts, format emotional expressions in brackets, and clean formatting.
    cleanVerbatimTranscription(text, preserveSpelledAcronyms = false) { // Remove common end-of-file artifacts (pppp, Ppppppppppp) text = text.replace(/\s*[Pp]+\s*$/, ''); // Fix spacing before punctuation text = text.replace(/\s+([.!?,;:])/g, '$1'); // Fix common emotional expressions that may have lost brackets const emotions = [ 'laughs?', 'laugh', 'laughing', 'laughed', 'sighs?', 'sigh', 'sighing', 'sighed', 'clears? throat', 'clearing throat', 'cleared throat', 'coughs?', 'cough', 'coughing', 'coughed', 'sniffles?', 'sniffle', 'sniffling', 'sniffled', 'pauses?', 'pause', 'pausing', 'paused', 'giggles?', 'giggle', 'giggling', 'giggled', 'whispers?', 'whisper', 'whispering', 'whispered', 'yells?', 'yell', 'yelling', 'yelled', 'groans?', 'groan', 'groaning', 'groaned', 'gasps?', 'gasp', 'gasping', 'gasped', 'chuckles?', 'chuckle', 'chuckling', 'chuckled', 'smiles?', 'smile', 'smiling', 'smiled', 'cries?', 'cry', 'crying', 'cried', 'frustrated sigh', 'heavy sigh', 'deep sigh', 'long sigh', 'nervous laugh', 'awkward laugh', 'bitter laugh', 'scoffs?', 'scoff', 'scoffing', 'scoffed', 'mumbles?', 'mumble', 'mumbling', 'mumbled', 'mutters?', 'mutter', 'muttering', 'muttered', 'stammers?', 'stammer', 'stammering', 'stammered', 'stutters?', 'stutter', 'stuttering', 'stuttered', 'shouts?', 'shout', 'shouting', 'shouted', 'screams?', 'scream', 'screaming', 'screamed', 'exhales?', 'exhale', 'exhaling', 'exhaled', 'inhales?', 'inhale', 'inhaling', 'inhaled', 'breathes? deeply', 'breathing deeply', 'breathed deeply', 'nods?', 'nod', 'nodding', 'nodded', 'shakes? head', 'shaking head', 'shook head', 'silence', 'long pause', 'brief pause', 'awkward silence' ]; const emotionRegex = new RegExp(`\\b(${emotions.join('|')})\\b`, 'gi'); text = text.replace(emotionRegex, '[$1]'); // Ensure ellipses are preserved properly text = text.replace(/\s+\.\s+\.\s+\./g, '...'); // Clean up any double spaces text = text.replace(/\s{2,}/g, ' '); return text.trim(); }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Garblesnarff/gemini-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server