Skip to main content
Glama

speech_to_text

Transcribe audio files into text with speaker diarization, supporting automatic language detection and flexible output options for saved files or direct text return.

Instructions

Transcribe speech from an audio file and either save the output text file to a given directory or return the text to the client directly.

⚠️ COST WARNING: This tool makes an API call to ElevenLabs which may incur costs. Only use when explicitly requested by the user.

Args:
    file_path: Path to the audio file to transcribe
    language_code: ISO 639-3 language code for transcription. If not provided, the language will be detected automatically.
    diarize: Whether to diarize the audio file. If True, which speaker is currently speaking will be annotated in the transcription.
    save_transcript_to_file: Whether to save the transcript to a file.
    return_transcript_to_client_directly: Whether to return the transcript to the client directly.
    output_directory: Directory where files should be saved.
        Defaults to $HOME/Desktop if not provided.

Returns:
    TextContent containing the transcription. If save_transcript_to_file is True, the transcription will be saved to a file in the output directory.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
diarizeNo
input_file_pathYes
language_codeNo
output_directoryNo
return_transcript_to_client_directlyNo
save_transcript_to_fileNo

Implementation Reference

  • The core handler function implementing the speech_to_text tool logic: loads audio file, calls ElevenLabs speech-to-text API with options for language, diarization, formats output, and saves or returns the transcript.
    def speech_to_text(
        input_file_path: str,
        language_code: str | None = None,
        diarize: bool = False,
        save_transcript_to_file: bool = True,
        return_transcript_to_client_directly: bool = False,
        output_directory: str | None = None,
    ) -> TextContent:
        if not save_transcript_to_file and not return_transcript_to_client_directly:
            make_error("Must save transcript to file or return it to the client directly.")
        file_path = handle_input_file(input_file_path)
        if save_transcript_to_file:
            output_path = make_output_path(output_directory, base_path)
            output_file_name = make_output_file("stt", file_path.name, output_path, "txt")
        with file_path.open("rb") as f:
            audio_bytes = f.read()
    
        if language_code == "" or language_code is None:
            language_code = None
    
        transcription = client.speech_to_text.convert(
            model_id="scribe_v1",
            file=audio_bytes,
            language_code=language_code,
            enable_logging=True,
            diarize=diarize,
            tag_audio_events=True,
        )
    
        # Format transcript with speaker identification if diarization was enabled
        if diarize:
            formatted_transcript = format_diarized_transcript(transcription)
        else:
            formatted_transcript = transcription.text
    
        if save_transcript_to_file:
            with open(output_path / output_file_name, "w") as f:
                f.write(formatted_transcript)
    
        if return_transcript_to_client_directly:
            return TextContent(type="text", text=formatted_transcript)
        else:
            return TextContent(
                type="text", text=f"Transcription saved to {output_path / output_file_name}"
            )
  • The @mcp.tool decorator that registers the speech_to_text function as an MCP tool, including the schema description for inputs and outputs.
    @mcp.tool(
        description="""Transcribe speech from an audio file and either save the output text file to a given directory or return the text to the client directly.
    
        ⚠️ COST WARNING: This tool makes an API call to ElevenLabs which may incur costs. Only use when explicitly requested by the user.
    
        Args:
            file_path: Path to the audio file to transcribe
            language_code: ISO 639-3 language code for transcription. If not provided, the language will be detected automatically.
            diarize: Whether to diarize the audio file. If True, which speaker is currently speaking will be annotated in the transcription.
            save_transcript_to_file: Whether to save the transcript to a file.
            return_transcript_to_client_directly: Whether to return the transcript to the client directly.
            output_directory: Directory where files should be saved.
                Defaults to $HOME/Desktop if not provided.
    
        Returns:
            TextContent containing the transcription. If save_transcript_to_file is True, the transcription will be saved to a file in the output directory.
        """
    )
  • Helper function to format diarized transcripts with speaker labels, used conditionally in the speech_to_text handler.
    def format_diarized_transcript(transcription) -> str:
        """Format transcript with speaker labels from diarized response."""
        try:
            # Try to access words array - the exact attribute might vary
            words = None
            if hasattr(transcription, "words"):
                words = transcription.words
            elif hasattr(transcription, "__dict__"):
                # Try to find words in the response dict
                for key, value in transcription.__dict__.items():
                    if key == "words" or (
                        isinstance(value, list)
                        and len(value) > 0
                        and (
                            hasattr(value[0], "speaker_id")
                            if hasattr(value[0], "__dict__")
                            else (
                                "speaker_id" in value[0]
                                if isinstance(value[0], dict)
                                else False
                            )
                        )
                    ):
                        words = value
                        break
    
            if not words:
                return transcription.text
    
            formatted_lines = []
            current_speaker = None
            current_text = []
    
            for word in words:
                # Get speaker_id - might be an attribute or dict key
                word_speaker = None
                if hasattr(word, "speaker_id"):
                    word_speaker = word.speaker_id
                elif isinstance(word, dict) and "speaker_id" in word:
                    word_speaker = word["speaker_id"]
    
                # Get text - might be an attribute or dict key
                word_text = None
                if hasattr(word, "text"):
                    word_text = word.text
                elif isinstance(word, dict) and "text" in word:
                    word_text = word["text"]
    
                if not word_speaker or not word_text:
                    continue
    
                # Skip spacing/punctuation types if they exist
                if hasattr(word, "type") and word.type == "spacing":
                    continue
                elif isinstance(word, dict) and word.get("type") == "spacing":
                    continue
    
                if current_speaker != word_speaker:
                    # Save previous speaker's text
                    if current_speaker and current_text:
                        speaker_label = current_speaker.upper().replace("_", " ")
                        formatted_lines.append(f"{speaker_label}: {' '.join(current_text)}")
    
                    # Start new speaker
                    current_speaker = word_speaker
                    current_text = [word_text.strip()]
                else:
                    current_text.append(word_text.strip())
    
            # Add final speaker's text
            if current_speaker and current_text:
                speaker_label = current_speaker.upper().replace("_", " ")
                formatted_lines.append(f"{speaker_label}: {' '.join(current_text)}")
    
            return "\n\n".join(formatted_lines)
    
        except Exception:
            # Fallback to regular text if something goes wrong
            return transcription.text

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/projectservan8n/elevenlabs-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server