Skip to main content
Glama

speech_to_text

Transcribe audio files into text with speaker diarization, supporting automatic language detection and flexible output options for saved files or direct text return.

Instructions

Transcribe speech from an audio file and either save the output text file to a given directory or return the text to the client directly.

⚠️ COST WARNING: This tool makes an API call to ElevenLabs which may incur costs. Only use when explicitly requested by the user.

Args:
    file_path: Path to the audio file to transcribe
    language_code: ISO 639-3 language code for transcription. If not provided, the language will be detected automatically.
    diarize: Whether to diarize the audio file. If True, which speaker is currently speaking will be annotated in the transcription.
    save_transcript_to_file: Whether to save the transcript to a file.
    return_transcript_to_client_directly: Whether to return the transcript to the client directly.
    output_directory: Directory where files should be saved.
        Defaults to $HOME/Desktop if not provided.

Returns:
    TextContent containing the transcription. If save_transcript_to_file is True, the transcription will be saved to a file in the output directory.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
diarizeNo
input_file_pathYes
language_codeNo
output_directoryNo
return_transcript_to_client_directlyNo
save_transcript_to_fileNo

Implementation Reference

  • The core handler function implementing the speech_to_text tool logic: loads audio file, calls ElevenLabs speech-to-text API with options for language, diarization, formats output, and saves or returns the transcript.
    def speech_to_text(
        input_file_path: str,
        language_code: str | None = None,
        diarize: bool = False,
        save_transcript_to_file: bool = True,
        return_transcript_to_client_directly: bool = False,
        output_directory: str | None = None,
    ) -> TextContent:
        if not save_transcript_to_file and not return_transcript_to_client_directly:
            make_error("Must save transcript to file or return it to the client directly.")
        file_path = handle_input_file(input_file_path)
        if save_transcript_to_file:
            output_path = make_output_path(output_directory, base_path)
            output_file_name = make_output_file("stt", file_path.name, output_path, "txt")
        with file_path.open("rb") as f:
            audio_bytes = f.read()
    
        if language_code == "" or language_code is None:
            language_code = None
    
        transcription = client.speech_to_text.convert(
            model_id="scribe_v1",
            file=audio_bytes,
            language_code=language_code,
            enable_logging=True,
            diarize=diarize,
            tag_audio_events=True,
        )
    
        # Format transcript with speaker identification if diarization was enabled
        if diarize:
            formatted_transcript = format_diarized_transcript(transcription)
        else:
            formatted_transcript = transcription.text
    
        if save_transcript_to_file:
            with open(output_path / output_file_name, "w") as f:
                f.write(formatted_transcript)
    
        if return_transcript_to_client_directly:
            return TextContent(type="text", text=formatted_transcript)
        else:
            return TextContent(
                type="text", text=f"Transcription saved to {output_path / output_file_name}"
            )
  • The @mcp.tool decorator that registers the speech_to_text function as an MCP tool, including the schema description for inputs and outputs.
    @mcp.tool(
        description="""Transcribe speech from an audio file and either save the output text file to a given directory or return the text to the client directly.
    
        ⚠️ COST WARNING: This tool makes an API call to ElevenLabs which may incur costs. Only use when explicitly requested by the user.
    
        Args:
            file_path: Path to the audio file to transcribe
            language_code: ISO 639-3 language code for transcription. If not provided, the language will be detected automatically.
            diarize: Whether to diarize the audio file. If True, which speaker is currently speaking will be annotated in the transcription.
            save_transcript_to_file: Whether to save the transcript to a file.
            return_transcript_to_client_directly: Whether to return the transcript to the client directly.
            output_directory: Directory where files should be saved.
                Defaults to $HOME/Desktop if not provided.
    
        Returns:
            TextContent containing the transcription. If save_transcript_to_file is True, the transcription will be saved to a file in the output directory.
        """
    )
  • Helper function to format diarized transcripts with speaker labels, used conditionally in the speech_to_text handler.
    def format_diarized_transcript(transcription) -> str:
        """Format transcript with speaker labels from diarized response."""
        try:
            # Try to access words array - the exact attribute might vary
            words = None
            if hasattr(transcription, "words"):
                words = transcription.words
            elif hasattr(transcription, "__dict__"):
                # Try to find words in the response dict
                for key, value in transcription.__dict__.items():
                    if key == "words" or (
                        isinstance(value, list)
                        and len(value) > 0
                        and (
                            hasattr(value[0], "speaker_id")
                            if hasattr(value[0], "__dict__")
                            else (
                                "speaker_id" in value[0]
                                if isinstance(value[0], dict)
                                else False
                            )
                        )
                    ):
                        words = value
                        break
    
            if not words:
                return transcription.text
    
            formatted_lines = []
            current_speaker = None
            current_text = []
    
            for word in words:
                # Get speaker_id - might be an attribute or dict key
                word_speaker = None
                if hasattr(word, "speaker_id"):
                    word_speaker = word.speaker_id
                elif isinstance(word, dict) and "speaker_id" in word:
                    word_speaker = word["speaker_id"]
    
                # Get text - might be an attribute or dict key
                word_text = None
                if hasattr(word, "text"):
                    word_text = word.text
                elif isinstance(word, dict) and "text" in word:
                    word_text = word["text"]
    
                if not word_speaker or not word_text:
                    continue
    
                # Skip spacing/punctuation types if they exist
                if hasattr(word, "type") and word.type == "spacing":
                    continue
                elif isinstance(word, dict) and word.get("type") == "spacing":
                    continue
    
                if current_speaker != word_speaker:
                    # Save previous speaker's text
                    if current_speaker and current_text:
                        speaker_label = current_speaker.upper().replace("_", " ")
                        formatted_lines.append(f"{speaker_label}: {' '.join(current_text)}")
    
                    # Start new speaker
                    current_speaker = word_speaker
                    current_text = [word_text.strip()]
                else:
                    current_text.append(word_text.strip())
    
            # Add final speaker's text
            if current_speaker and current_text:
                speaker_label = current_speaker.upper().replace("_", " ")
                formatted_lines.append(f"{speaker_label}: {' '.join(current_text)}")
    
            return "\n\n".join(formatted_lines)
    
        except Exception:
            # Fallback to regular text if something goes wrong
            return transcription.text
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full burden and does well: it discloses cost implications (API call to ElevenLabs with potential costs), describes two output behaviors (save to file or return directly), mentions automatic language detection fallback, and explains default behavior for output_directory. It doesn't cover rate limits, error conditions, or authentication needs, but provides substantial behavioral context beyond basic functionality.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with purpose statement, cost warning, parameter explanations, and return behavior. Each sentence adds value, though the parameter explanations could be slightly more concise. The warning is appropriately highlighted, and information is logically organized from general to specific.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a 6-parameter tool with no annotations and no output schema, the description provides comprehensive coverage: clear purpose, cost warning, detailed parameter semantics, and return behavior explanation. The main gap is lack of output format details (what TextContent contains, structure of diarized output), but given the tool's moderate complexity, this is mostly complete.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The description provides excellent parameter semantics despite 0% schema description coverage. It explains all 6 parameters clearly: file_path purpose, language_code format (ISO 639-3) and auto-detection behavior, diarize functionality (speaker annotation), save_transcript_to_file and return_transcript_to_client_directly purposes and interaction, and output_directory default value. This fully compensates for the schema coverage gap.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Transcribe speech from an audio file' with specific actions (save to file or return to client). It distinguishes from siblings like 'text_to_speech' or 'voice_clone' by focusing on transcription from audio. However, it doesn't explicitly differentiate from 'speech_to_speech' which might involve similar input processing.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description includes a cost warning ('⚠️ COST WARNING: This tool makes an API call to ElevenLabs which may incur costs. Only use when explicitly requested by the user'), which provides some usage context. However, it lacks explicit guidance on when to use this tool versus alternatives like 'speech_to_speech' or 'isolate_audio', and doesn't mention prerequisites or typical scenarios for transcription.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/projectservan8n/elevenlabs-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server