Provides audio transcription using OpenAI's Whisper and GPT-4o transcription models, interactive audio chat with GPT-4o audio models, and text-to-speech generation using OpenAI's TTS API with multiple voice options and customizable parameters.
MCP Server Whisper
A Model Context Protocol (MCP) server for advanced audio transcription and processing using OpenAI's Whisper and GPT-4o models.
Overview
MCP Server Whisper provides a standardized way to process audio files through OpenAI's latest transcription and speech services. By implementing the Model Context Protocol, it enables AI assistants like Claude to seamlessly interact with audio processing capabilities.
Key features:
🔍 Advanced file searching with regex patterns, file metadata filtering, and sorting capabilities
⚡ MCP-native parallel processing - call multiple tools simultaneously
🔄 Format conversion between supported audio types
📦 Automatic compression for oversized files
🎯 Multi-model transcription with support for all OpenAI audio models
🗣️ Interactive audio chat with GPT-4o audio models
✏️ Enhanced transcription with specialized prompts and timestamp support
🎙️ Text-to-speech generation with customizable voices, instructions, and speed
📊 Comprehensive metadata including duration, file size, and format support
🚀 High-performance caching for repeated operations
🔒 Type-safe responses with Pydantic models for all tool outputs
Note: This project is unofficial and not affiliated with, endorsed by, or sponsored by OpenAI. It provides a Model Context Protocol interface to OpenAI's publicly available APIs.
Installation
Environment Setup
Create a .env file based on the provided .env.example:
Edit .env with your actual values:
Note: Environment variables must be available at runtime. For local development with Claude, use a tool like dotenv-cli to load them (see Usage section below).
Usage
Local Development with Claude
The project includes a .mcp.json configuration file for local development with Claude. To use it:
Ensure your
.envfile is configured with the required environment variablesLaunch Claude with environment variables loaded:
This will:
Load environment variables from your
.envfileLaunch Claude with the MCP server configured per
.mcp.jsonEnable hot-reloading during development
The .mcp.json configuration:
Exposed MCP Tools
Audio File Management
list_audio_files- Lists audio files with comprehensive filtering and sorting options:Filter by regex pattern matching on filenames
Filter by file size, duration, modification time, or format
Sort by name, size, duration, modification time, or format
Returns type-safe
FilePathSupportParamswith full metadata
get_latest_audio- Gets the most recently modified audio file with model support info
Audio Processing
convert_audio- Converts audio files to supported formats (mp3 or wav)Returns
AudioProcessingResultwith output path
compress_audio- Compresses audio files that exceed size limitsReturns
AudioProcessingResultwith output path
Transcription
transcribe_audio- Advanced transcription using OpenAI's models:Supports
whisper-1,gpt-4o-transcribe, andgpt-4o-mini-transcribeCustom prompts for guided transcription
Optional timestamp granularities for word and segment-level timing
JSON response format option
Returns
TranscriptionResultwith text, usage data, and optional timestamps
chat_with_audio- Interactive audio analysis using GPT-4o audio models:Supports
gpt-4o-audio-preview(recommended) and dated versionsNote:
gpt-4o-mini-audio-previewhas limitations with audio chat and is not recommendedCustom system and user prompts
Provides conversational responses to audio content
Returns
ChatResultwith response text
transcribe_with_enhancement- Enhanced transcription with specialized templates:detailed- Includes tone, emotion, and background detailsstorytelling- Transforms the transcript into a narrative formprofessional- Creates formal, business-appropriate transcriptionsanalytical- Adds analysis of speech patterns and key pointsReturns
TranscriptionResultwith enhanced output
Text-to-Speech
create_audio- Generate text-to-speech audio using OpenAI's TTS API:Supports
gpt-4o-mini-tts(preferred) and other speech modelsMultiple voice options (alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, cedar)
Speed adjustment and custom instructions
Customizable output file paths
Handles texts of any length by automatically splitting and joining audio segments
Returns
TTSResultwith output path
Supported Audio Formats
Model | Supported Formats |
Transcribe | flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm |
Chat | mp3, wav |
Note: Files larger than 25MB are automatically compressed to meet API limits.
Example Usage with Claude
Claude will automatically:
Find the latest audio file using
get_latest_audioDetermine the appropriate transcription method
Process the file with
transcribe_with_enhancementusing the "detailed" templateReturn the enhanced transcription
Claude will:
Convert the date to a timestamp
Use
list_audio_fileswith appropriate filters:min_duration_seconds: 300(5 minutes)min_modified_time: <timestamp for Jan 1, 2024>sort_by: "size"
Return a sorted list of matching audio files with comprehensive metadata
Claude will:
Search for files using
list_audio_fileswith pattern and format filtersMake multiple parallel
transcribe_with_enhancementtool calls (MCP handles parallelism natively)Each call uses
enhancement_type: "professional"and returns a typedTranscriptionResultReturn all transcriptions with full metadata in a well-formatted output
Claude will:
Use the
create_audiotool with:text_promptcontaining the scriptvoice: "shimmer"model: "gpt-4o-mini-tts"(default high-quality model)instructions: "Speak in an enthusiastic, podcast host style"(optional)speed: 1.0(default, can be adjusted)
Generate the audio file and save it to the configured audio directory
Provide the path to the generated audio file
Configuration with Claude Desktop
For production use with Claude Desktop (as opposed to local development), add this to your claude_desktop_config.json:
UVX
Recommendation (Mac OS Only)
Install Screen Recorder By Omi (free)
Set
AUDIO_FILES_PATHto/Users/<user>/Movies/Omi Screen Recorderand replace<user>with your usernameAs you record audio with the app, you can transcribe multiple files in parallel with Claude
Development
This project uses modern Python development tools including uv, pytest, ruff, and mypy.
CI/CD Workflow
The project uses GitHub Actions for CI/CD:
Lint & Type Check: Ensures code quality with ruff and strict mypy type checking
Tests: Runs tests on multiple Python versions (3.10, 3.11, 3.12, 3.13, 3.14, 3.14t)
Release & Publish: Dual-trigger workflow for flexible release management
Note: Python 3.14t is the free-threaded build (without GIL) for testing true parallelism.
Creating a New Release
The release workflow supports two approaches:
Option 1: Automated Release (Recommended)
Push a tag to automatically create a release and publish to PyPI:
This will:
Verify the tag version matches pyproject.toml
Build the package
Create a GitHub release with auto-generated notes
Automatically publish to PyPI
Option 2: Manual Release
Create a release manually via GitHub UI, then publish optionally:
Go to Releases on GitHub
Click "Draft a new release"
Create a new tag or select an existing one
Fill in release details
Click "Publish release"
When you publish the release, the workflow will automatically publish to PyPI. You can also create a draft release to delay publishing.
API Design Philosophy
MCP Server Whisper follows a flat, type-safe API design optimized for MCP clients:
Flat Arguments: All tools accept flat parameters instead of nested objects for simpler, more intuitive calls
Type-Safe Responses: Every tool returns a strongly-typed Pydantic model (
TranscriptionResult,ChatResult,AudioProcessingResult,TTSResult)Single-Item Operations: One call processes one file, with MCP protocol handling parallelism natively
Per-File Error Handling: Failures are isolated to individual operations, not entire batches
Self-Documenting: Type hints provide autocomplete and validation in IDEs and AI models
This design makes it significantly easier for AI assistants to use the tools correctly and handle results reliably.
How It Works
For detailed architecture information, see Architecture Documentation.
MCP Server Whisper is built on the Model Context Protocol, which standardizes how AI models interact with external tools and data sources. The server:
Exposes Audio Processing Capabilities: Through standardized MCP tool interfaces with flat, type-safe APIs
Implements Parallel Processing: Using anyio structured concurrency; MCP clients handle parallelism natively
Manages File Operations: Handles detection, validation, conversion, and compression
Provides Rich Transcription: Via different OpenAI models and enhancement templates
Optimizes Performance: With caching mechanisms for repeated operations
Ensures Type Safety: All responses use Pydantic models for validation and IDE support
Under the hood, it uses:
pydubfor audio file manipulation (withaudioop-ltsfor Python 3.13+)anyiofor structured concurrency and task group managementaioresultfor collecting results from parallel task groupsOpenAI's latest transcription models (including gpt-4o-transcribe)
OpenAI's GPT-4o audio models for enhanced understanding
OpenAI's gpt-4o-mini-tts for high-quality speech synthesis
FastMCP for simplified MCP server implementation
Type hints and strict mypy validation throughout the codebase
Contributing
Contributions are welcome! Please follow these steps:
Fork the repository
Create a new branch for your feature (
git checkout -b feature/amazing-feature)Make your changes
Run the tests and linting (
uv run pytest && uv run ruff check src && uv run mypy --strict src)Commit your changes (
git commit -m 'Add some amazing feature')Push to the branch (
git push origin feature/amazing-feature)Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
Model Context Protocol (MCP) - For the protocol specification
pydub - For audio processing
OpenAI Whisper - For audio transcription
FastMCP - For MCP server implementation
Anthropic Claude - For natural language interaction
MCP Review - This MCP Server is certified by MCP Review