speech_to_text
Transcribe audio files into text with speaker diarization, supporting automatic language detection and flexible output options for saved files or direct text return.
Instructions
Transcribe speech from an audio file and either save the output text file to a given directory or return the text to the client directly.
⚠️ COST WARNING: This tool makes an API call to ElevenLabs which may incur costs. Only use when explicitly requested by the user.
Args:
file_path: Path to the audio file to transcribe
language_code: ISO 639-3 language code for transcription. If not provided, the language will be detected automatically.
diarize: Whether to diarize the audio file. If True, which speaker is currently speaking will be annotated in the transcription.
save_transcript_to_file: Whether to save the transcript to a file.
return_transcript_to_client_directly: Whether to return the transcript to the client directly.
output_directory: Directory where files should be saved.
Defaults to $HOME/Desktop if not provided.
Returns:
TextContent containing the transcription. If save_transcript_to_file is True, the transcription will be saved to a file in the output directory.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| diarize | No | ||
| input_file_path | Yes | ||
| language_code | No | ||
| output_directory | No | ||
| return_transcript_to_client_directly | No | ||
| save_transcript_to_file | No |
Implementation Reference
- elevenlabs_mcp/server.py:278-323 (handler)The core handler function implementing the speech_to_text tool logic: loads audio file, calls ElevenLabs speech-to-text API with options for language, diarization, formats output, and saves or returns the transcript.def speech_to_text( input_file_path: str, language_code: str | None = None, diarize: bool = False, save_transcript_to_file: bool = True, return_transcript_to_client_directly: bool = False, output_directory: str | None = None, ) -> TextContent: if not save_transcript_to_file and not return_transcript_to_client_directly: make_error("Must save transcript to file or return it to the client directly.") file_path = handle_input_file(input_file_path) if save_transcript_to_file: output_path = make_output_path(output_directory, base_path) output_file_name = make_output_file("stt", file_path.name, output_path, "txt") with file_path.open("rb") as f: audio_bytes = f.read() if language_code == "" or language_code is None: language_code = None transcription = client.speech_to_text.convert( model_id="scribe_v1", file=audio_bytes, language_code=language_code, enable_logging=True, diarize=diarize, tag_audio_events=True, ) # Format transcript with speaker identification if diarization was enabled if diarize: formatted_transcript = format_diarized_transcript(transcription) else: formatted_transcript = transcription.text if save_transcript_to_file: with open(output_path / output_file_name, "w") as f: f.write(formatted_transcript) if return_transcript_to_client_directly: return TextContent(type="text", text=formatted_transcript) else: return TextContent( type="text", text=f"Transcription saved to {output_path / output_file_name}" )
- elevenlabs_mcp/server.py:260-277 (registration)The @mcp.tool decorator that registers the speech_to_text function as an MCP tool, including the schema description for inputs and outputs.@mcp.tool( description="""Transcribe speech from an audio file and either save the output text file to a given directory or return the text to the client directly. ⚠️ COST WARNING: This tool makes an API call to ElevenLabs which may incur costs. Only use when explicitly requested by the user. Args: file_path: Path to the audio file to transcribe language_code: ISO 639-3 language code for transcription. If not provided, the language will be detected automatically. diarize: Whether to diarize the audio file. If True, which speaker is currently speaking will be annotated in the transcription. save_transcript_to_file: Whether to save the transcript to a file. return_transcript_to_client_directly: Whether to return the transcript to the client directly. output_directory: Directory where files should be saved. Defaults to $HOME/Desktop if not provided. Returns: TextContent containing the transcription. If save_transcript_to_file is True, the transcription will be saved to a file in the output directory. """ )
- elevenlabs_mcp/server.py:60-139 (helper)Helper function to format diarized transcripts with speaker labels, used conditionally in the speech_to_text handler.def format_diarized_transcript(transcription) -> str: """Format transcript with speaker labels from diarized response.""" try: # Try to access words array - the exact attribute might vary words = None if hasattr(transcription, "words"): words = transcription.words elif hasattr(transcription, "__dict__"): # Try to find words in the response dict for key, value in transcription.__dict__.items(): if key == "words" or ( isinstance(value, list) and len(value) > 0 and ( hasattr(value[0], "speaker_id") if hasattr(value[0], "__dict__") else ( "speaker_id" in value[0] if isinstance(value[0], dict) else False ) ) ): words = value break if not words: return transcription.text formatted_lines = [] current_speaker = None current_text = [] for word in words: # Get speaker_id - might be an attribute or dict key word_speaker = None if hasattr(word, "speaker_id"): word_speaker = word.speaker_id elif isinstance(word, dict) and "speaker_id" in word: word_speaker = word["speaker_id"] # Get text - might be an attribute or dict key word_text = None if hasattr(word, "text"): word_text = word.text elif isinstance(word, dict) and "text" in word: word_text = word["text"] if not word_speaker or not word_text: continue # Skip spacing/punctuation types if they exist if hasattr(word, "type") and word.type == "spacing": continue elif isinstance(word, dict) and word.get("type") == "spacing": continue if current_speaker != word_speaker: # Save previous speaker's text if current_speaker and current_text: speaker_label = current_speaker.upper().replace("_", " ") formatted_lines.append(f"{speaker_label}: {' '.join(current_text)}") # Start new speaker current_speaker = word_speaker current_text = [word_text.strip()] else: current_text.append(word_text.strip()) # Add final speaker's text if current_speaker and current_text: speaker_label = current_speaker.upper().replace("_", " ") formatted_lines.append(f"{speaker_label}: {' '.join(current_text)}") return "\n\n".join(formatted_lines) except Exception: # Fallback to regular text if something goes wrong return transcription.text