speech_recognition

speech_recognition

Convert audio files to text transcriptions using the Whisper model for accessibility, documentation, or content analysis purposes.

Instructions

Transcribe audio to text using DeepInfra OpenAI-compatible API (Whisper).

Input Schema

TableJSON Schema

Name	Required	Description	Default
`audio_url`	Yes

Implementation Reference

src/mcp_deepinfra/server.py:98-118 (handler)

The speech_recognition tool handler, conditionally defined and registered via @app.tool() decorator. It downloads audio from a provided URL, then uses DeepInfra's OpenAI-compatible Whisper API for transcription.

if "all" in ENABLED_TOOLS or "speech_recognition" in ENABLED_TOOLS:
    @app.tool()
    async def speech_recognition(audio_url: str) -> str:
        """Transcribe audio to text using DeepInfra OpenAI-compatible API (Whisper)."""
        model = DEFAULT_MODELS["speech_recognition"]
        try:
            async with httpx.AsyncClient(timeout=120.0) as http_client:
                # Download the audio file
                audio_response = await http_client.get(audio_url)
                audio_response.raise_for_status()
                audio_content = audio_response.content
            
            # Use the OpenAI-compatible Whisper API
            response = await client.audio.transcriptions.create(
                model=model,
                file=("audio.mp3", audio_content),
            )
            return response.text
        except Exception as e:
            return f"Error transcribing audio: {type(e).__name__}: {str(e)}"

src/mcp_deepinfra/server.py:31-42 (helper)

Configuration dictionary DEFAULT_MODELS defining the default model for speech_recognition tool (Whisper large-v3). Used within the handler.

DEFAULT_MODELS = {
    "generate_image": os.getenv("MODEL_GENERATE_IMAGE", "Bria/Bria-3.2"),
    "text_generation": os.getenv("MODEL_TEXT_GENERATION", "meta-llama/Llama-2-7b-chat-hf"),
    "embeddings": os.getenv("MODEL_EMBEDDINGS", "sentence-transformers/all-MiniLM-L6-v2"),
    "speech_recognition": os.getenv("MODEL_SPEECH_RECOGNITION", "openai/whisper-large-v3"),
    "zero_shot_image_classification": os.getenv("MODEL_ZERO_SHOT_IMAGE_CLASSIFICATION", "openai/gpt-4o-mini"),
    "object_detection": os.getenv("MODEL_OBJECT_DETECTION", "openai/gpt-4o-mini"),
    "image_classification": os.getenv("MODEL_IMAGE_CLASSIFICATION", "openai/gpt-4o-mini"),
    "text_classification": os.getenv("MODEL_TEXT_CLASSIFICATION", "microsoft/DialoGPT-medium"),
    "token_classification": os.getenv("MODEL_TOKEN_CLASSIFICATION", "microsoft/DialoGPT-medium"),
    "fill_mask": os.getenv("MODEL_FILL_MASK", "microsoft/DialoGPT-medium"),
}

MCP DeepInfra AI Tools Server

Instructions

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API