Skip to main content
Glama

Voice Mode

by mbailey
whisper-setup.md9.63 kB
# Whisper Speech-to-Text Setup Whisper is a local speech recognition service that converts audio to text for VoiceMode using OpenAI's Whisper model. It provides offline STT capabilities with various model sizes to balance speed and accuracy. ## Quick Start ```bash # Install whisper service with default base model (includes Core ML on Apple Silicon!) voice-mode whisper install # Install with a different model voice-mode whisper install --model large-v3 # Install without any model voice-mode whisper install --no-model # List available models and their status voice-mode whisper models # Download additional models (with Core ML support on Apple Silicon) voice-mode whisper model install large-v2 # Start the service voice-mode whisper start ``` **Apple Silicon Bonus:** On M1/M2/M3/M4 Macs, VoiceMode automatically downloads pre-built Core ML models for 2-3x faster performance. No Xcode or Python dependencies required! Default endpoint: `http://127.0.0.1:2022/v1` ## Installation Methods ### Automatic Installation (Recommended) VoiceMode includes an installation tool that sets up Whisper.cpp automatically: ```bash # Install with default base model (142MB) - good balance of speed and accuracy voice-mode whisper install # Install with a specific model voice-mode whisper install --model small # Skip Core ML on Apple Silicon (not recommended) voice-mode whisper install --skip-core-ml # Install without downloading any model voice-mode whisper install --no-model ``` This will: - Clone and build Whisper.cpp with GPU support (if available) - Download the specified model (default: base) - **On Apple Silicon:** Automatically download pre-built Core ML models for 2-3x faster performance - Create a start script with environment variable support - Set up automatic startup (launchd on macOS, systemd on Linux) ### Manual Installation #### macOS ```bash # Install via Homebrew brew install whisper.cpp # Download model mkdir -p ~/.voicemode/models/whisper cd ~/.voicemode/models/whisper curl -LO https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v2.bin ``` #### Linux ```bash # Clone and build whisper.cpp git clone https://github.com/ggerganov/whisper.cpp cd whisper.cpp make # Download model mkdir -p ~/.voicemode/models/whisper cd ~/.voicemode/models/whisper wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v2.bin ``` ### Prerequisites **macOS**: - Xcode Command Line Tools (`xcode-select --install`) - Only for building whisper.cpp - Homebrew (https://brew.sh) - cmake (`brew install cmake`) **Note for Apple Silicon users:** Core ML models are pre-built and downloaded automatically. No Xcode, PyTorch, or coremltools required! **Linux**: - Build essentials (`sudo apt install build-essential` on Ubuntu/Debian) ## Core ML Acceleration (Apple Silicon) On Apple Silicon Macs (M1/M2/M3/M4), VoiceMode automatically downloads pre-built Core ML models from Hugging Face for 2-3x faster transcription: - **Automatic:** Core ML models download alongside regular models - **No Dependencies:** No PyTorch, Xcode, or coremltools needed - **Pre-built:** Models are pre-compiled and ready to use - **Performance:** 2-3x faster than Metal acceleration alone To skip Core ML (not recommended): ```bash voice-mode whisper model install large-v3 --skip-core-ml ``` ## Model Management ### Available Models | Model | Size | RAM Usage | Accuracy | Speed | Language Support | |-------|------|-----------|----------|-------|-----------------| | **tiny** | 39 MB | ~390 MB | Low | Fastest | Multilingual | | **tiny.en** | 39 MB | ~390 MB | Low | Fastest | English only | | **base** | 142 MB | ~500 MB | Fair | Fast | Multilingual | | **base.en** | 142 MB | ~500 MB | Fair | Fast | English only | | **small** | 466 MB | ~1 GB | Good | Moderate | Multilingual | | **small.en** | 466 MB | ~1 GB | Good | Moderate | English only | | **medium** | 1.5 GB | ~2.6 GB | Very Good | Slow | Multilingual | | **medium.en** | 1.5 GB | ~2.6 GB | Very Good | Slow | English only | | **large-v1** | 2.9 GB | ~3.9 GB | Excellent | Slower | Multilingual | | **large-v2** | 2.9 GB | ~3.9 GB | Excellent | Slower | Multilingual (recommended) | | **large-v3** | 3.1 GB | ~3.9 GB | Best | Slowest | Multilingual | | **large-v3-turbo** | 1.6 GB | ~2.5 GB | Very Good | Moderate | Multilingual | ### Model Commands ```bash # List all models with installation status voice-mode whisper models # Show/set active model voice-mode whisper model active voice-mode whisper model active small.en # Install models voice-mode whisper model install # Install default (large-v2) voice-mode whisper model install medium # Install specific model voice-mode whisper model install all # Install all models voice-mode whisper model install large-v3 --force # Force re-download # Remove models voice-mode whisper model remove tiny voice-mode whisper model remove tiny.en --force # Skip confirmation ``` Note: After changing the active model, restart the whisper service for changes to take effect. ## Service Configuration ### Environment Variables Configure in `~/.voicemode/voicemode.env`: ```bash VOICEMODE_WHISPER_MODEL=large-v2 VOICEMODE_WHISPER_PORT=2022 VOICEMODE_WHISPER_LANGUAGE=auto VOICEMODE_WHISPER_MODEL_PATH=~/.voicemode/models/whisper ``` ### Running the Server #### OpenAI-Compatible Server Mode ```bash whisper-server \ --model models/ggml-large-v2.bin \ --host 127.0.0.1 \ --port 2022 \ --inference-path "/v1/audio/transcriptions" \ --threads 4 \ --processors 1 \ --convert \ --print-progress ``` Key options: - `--model`: Path to model file - `--host`: Server host (default: 127.0.0.1) - `--port`: Server port (VoiceMode expects 2022) - `--inference-path`: OpenAI-compatible endpoint path - `--threads`: Number of threads for processing - `--processors`: Number of parallel processors - `--convert`: Convert audio to required format automatically - `--print-progress`: Show transcription progress ### Service Management #### macOS (LaunchAgent) ```bash # Start/stop service launchctl load ~/Library/LaunchAgents/com.voicemode.whisper.plist launchctl unload ~/Library/LaunchAgents/com.voicemode.whisper.plist # Enable/disable at startup launchctl load -w ~/Library/LaunchAgents/com.voicemode.whisper.plist launchctl unload -w ~/Library/LaunchAgents/com.voicemode.whisper.plist # Check status launchctl list | grep whisper ``` #### Linux (Systemd) ```bash # Start/stop service systemctl --user start whisper systemctl --user stop whisper # Enable/disable at startup systemctl --user enable whisper systemctl --user disable whisper # Check status and logs systemctl --user status whisper journalctl --user -u whisper -f ``` ## Hardware Acceleration ### Apple Silicon (CoreML) CoreML provides 2-3x faster transcription on Apple Silicon Macs: ```bash # Install with CoreML support (when available) voice-mode whisper model install base.en --skip-core-ml=false # Performance comparison # CPU Only: ~1x baseline # Metal: ~3-4x faster # CoreML + Metal: ~8-12x faster ``` **Note**: CoreML support requires full Xcode installation and may be disabled in some versions due to installation complexity. ### GPU Acceleration The installation tool automatically detects and enables: - **Mac (Apple Silicon)**: Metal acceleration - **NVIDIA GPU**: CUDA acceleration - **CPU**: Optimized CPU builds ## Integration with VoiceMode VoiceMode automatically detects Whisper when available: 1. **First**: Checks for Whisper.cpp on `http://127.0.0.1:2022/v1` 2. **Fallback**: Uses OpenAI API (requires `OPENAI_API_KEY`) ### Custom Configuration To use a different endpoint or force Whisper use: ```bash export STT_BASE_URL=http://127.0.0.1:2022/v1 ``` Or in MCP configuration: ```json "voice-mode": { "env": { "STT_BASE_URL": "http://127.0.0.1:2022/v1" } } ``` ## Fully Local Setup For completely offline voice processing, combine Whisper with Kokoro: ```bash export STT_BASE_URL=http://127.0.0.1:2022/v1 # Whisper for STT export TTS_BASE_URL=http://127.0.0.1:8880/v1 # Kokoro for TTS export TTS_VOICE=af_sky # Kokoro voice ``` ## Troubleshooting ### Service Won't Start - Check if port 2022 is already in use: `lsof -i :2022` - Verify model file exists at configured path - Check service logs for error messages ### Poor Transcription Quality - Try a larger model (base → small → medium → large) - Ensure audio input quality is good - Set specific language instead of 'auto' if known ### High CPU Usage - Use a smaller model for better performance - Consider English-only models (.en) for English content - Enable GPU acceleration if available ### Model Installation Issues - Verify adequate disk space (models range from 39MB to 3GB) - Check network connectivity to Hugging Face - Use `--force` flag to re-download corrupted models ## Performance Monitoring ```bash # Check service status voice-mode whisper status # View performance statistics voice-mode statistics # Monitor real-time processing tail -f ~/.voicemode/services/whisper/logs/performance.log # Test model functionality voice-mode whisper model test base.en ``` ## File Locations - **Models**: `~/.voicemode/models/whisper/` or `~/.voicemode/services/whisper/models/` - **Service Config**: `~/.voicemode/services/whisper/config.json` - **Model Preferences**: `~/.voicemode/whisper-models.txt` - **Logs**: `~/.voicemode/services/whisper/logs/` - **LaunchAgent** (macOS): `~/Library/LaunchAgents/com.voicemode.whisper.plist` - **Systemd Service** (Linux): `~/.config/systemd/user/whisper.service`

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/mbailey/voicemode'

If you have feedback or need assistance with the MCP directory API, please join our Discord server