Skip to main content
Glama

speech_to_text

Transcribe audio files into text using a specified ASR model. Save the output file with optional word timestamps and prioritized vocabulary settings to a designated directory or default location.

Instructions

Convert speech to text with a given model and save the output text file to a given directory. Directory is optional, if not provided, the output file will be saved to $HOME/Desktop.

⚠️ COST WARNING: This tool makes an API call to Whissle which may incur costs. Only use when explicitly requested by the user. Args: audio_file_path (str): Path to the audio file to transcribe model_name (str, optional): The name of the ASR model to use. Defaults to "en-NER" timestamps (bool, optional): Whether to include word timestamps boosted_lm_words (List[str], optional): Words to boost in recognition boosted_lm_score (int, optional): Score for boosted words (0-100) output_directory (str, optional): Directory where files should be saved. Defaults to $HOME/Desktop if not provided. Returns: TextContent with the transcription and path to the output file.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
audio_file_pathYes
boosted_lm_scoreNo
boosted_lm_wordsNo
model_nameNoen-NER
timestampsNo

Implementation Reference

  • The core handler function implementing the speech_to_text tool. It performs input validation on the audio file (existence, size, format), calls the WhissleClient.speech_to_text API with retries and error handling, and returns the transcription result including transcript, timestamps if available, etc.
    def speech_to_text(audio_file_path: str, model_name: str = "en-NER", timestamps: bool = True, boosted_lm_words: List[str] = None, boosted_lm_score: int = 80) -> Dict: """Convert speech to text using Whissle API""" try: # Check if file exists if not os.path.exists(audio_file_path): logger.error(f"Audio file not found: {audio_file_path}") return {"error": f"Audio file not found: {audio_file_path}"} # Check file size file_size = os.path.getsize(audio_file_path) if file_size == 0: logger.error(f"Audio file is empty: {audio_file_path}") return {"error": f"Audio file is empty: {audio_file_path}"} # Check file format file_ext = os.path.splitext(audio_file_path)[1].lower() if file_ext not in ['.wav', '.mp3', '.ogg', '.flac', '.m4a']: logger.error(f"Unsupported audio format: {file_ext}") return {"error": f"Unsupported audio format: {file_ext}. Supported formats: wav, mp3, ogg, flac, m4a"} # Check file size limits max_size_mb = 25 if file_size > max_size_mb * 1024 * 1024: logger.error(f"File too large: {file_size / (1024*1024):.2f} MB") return {"error": f"File too large ({file_size / (1024*1024):.2f} MB). Maximum size is {max_size_mb} MB."} # Log the request details logger.info(f"Transcribing audio file: {audio_file_path}") logger.info(f"File size: {file_size / (1024*1024):.2f} MB") logger.info(f"File format: {file_ext}") # Try with a different model if the default one fails models_to_try = ["en-NER"] last_error = None for try_model in models_to_try: retry_count = 0 max_retries = 2 while retry_count <= max_retries: try: logger.info(f"Attempting transcription with model: {try_model} (Attempt {retry_count+1}/{max_retries+1})") response = client.speech_to_text( audio_file_path=audio_file_path, model_name=try_model, timestamps=timestamps, boosted_lm_words=boosted_lm_words, boosted_lm_score=boosted_lm_score ) if response and hasattr(response, 'transcript'): logger.info(f"Transcription successful with model: {try_model}") result = { "transcript": response.transcript, "duration_seconds": getattr(response, 'duration_seconds', 0), "language_code": getattr(response, 'language_code', 'en') } if hasattr(response, 'timestamps'): result["timestamps"] = response.timestamps if hasattr(response, 'diarize_output') and response.diarize_output: result["diarize_output"] = response.diarize_output return result else: last_error = "No transcription was returned from the API" logger.error(f"No transcription returned from API with model {try_model}") break except Exception as api_error: error_msg = str(api_error) logger.error(f"Error with model {try_model}: {error_msg}") last_error = error_msg error_result = handle_api_error(error_msg, "transcription", retry_count, max_retries) if error_result is not None: if retry_count == max_retries: break else: return {"error": error_result} retry_count += 1 if "HTTP 500" in last_error: logger.error(f"All transcription attempts failed with HTTP 500: {last_error}") return {"error": f"Server error during transcription. This might be a temporary issue with the Whissle API. Please try again later or contact Whissle support. Error: {last_error}"} else: logger.error(f"All transcription attempts failed: {last_error}") return {"error": f"Failed to transcribe audio: {last_error}"} except Exception as e: logger.error(f"Unexpected error during transcription: {str(e)}") return {"error": f"Failed to transcribe audio: {str(e)}"}
  • The @mcp.tool decorator registers the speech_to_text function as an MCP tool, including the schema in the description docstring defining parameters like audio_file_path, model_name, timestamps, boosted_lm_words, boosted_lm_score, and output_directory (though not in func args).
    @mcp.tool( description="""Convert speech to text with a given model and save the output text file to a given directory. Directory is optional, if not provided, the output file will be saved to $HOME/Desktop. ⚠️ COST WARNING: This tool makes an API call to Whissle which may incur costs. Only use when explicitly requested by the user. Args: audio_file_path (str): Path to the audio file to transcribe model_name (str, optional): The name of the ASR model to use. Defaults to "en-NER" timestamps (bool, optional): Whether to include word timestamps boosted_lm_words (List[str], optional): Words to boost in recognition boosted_lm_score (int, optional): Score for boosted words (0-100) output_directory (str, optional): Directory where files should be saved. Defaults to $HOME/Desktop if not provided. Returns: TextContent with the transcription and path to the output file. """ ) def speech_to_text(audio_file_path: str, model_name: str = "en-NER", timestamps: bool = True, boosted_lm_words: List[str] = None, boosted_lm_score: int = 80) -> Dict:
  • Tool schema defined in the @mcp.tool description, specifying input parameters and their types/descriptions.
    description="""Convert speech to text with a given model and save the output text file to a given directory. Directory is optional, if not provided, the output file will be saved to $HOME/Desktop. ⚠️ COST WARNING: This tool makes an API call to Whissle which may incur costs. Only use when explicitly requested by the user. Args: audio_file_path (str): Path to the audio file to transcribe model_name (str, optional): The name of the ASR model to use. Defaults to "en-NER" timestamps (bool, optional): Whether to include word timestamps boosted_lm_words (List[str], optional): Words to boost in recognition boosted_lm_score (int, optional): Score for boosted words (0-100) output_directory (str, optional): Directory where files should be saved. Defaults to $HOME/Desktop if not provided. Returns: TextContent with the transcription and path to the output file. """ )

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/WhissleAI/whissle-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server