Skip to main content
Glama

Voice Mode

by mbailey
architecture.md7.83 kB
# VoiceMode Architecture Understanding how VoiceMode components work together to enable voice conversations. ## System Overview VoiceMode is built as a Model Context Protocol (MCP) server that provides voice capabilities to AI assistants. It follows a modular architecture with clear separation between voice services, audio processing, and client interfaces. ``` ┌─────────────────────────────────────────────┐ │ MCP Client (Claude) │ └─────────────────┬───────────────────────────┘ │ MCP Protocol ┌─────────────────┴───────────────────────────┐ │ VoiceMode MCP Server │ ├──────────────────────────────────────────────┤ │ Core Components │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Tools │ │ Providers│ │ Config │ │ │ └──────────┘ └──────────┘ └──────────┘ │ ├──────────────────────────────────────────────┤ │ Voice Services │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Whisper │ │ Kokoro │ │ LiveKit │ │ │ │ (STT) │ │ (TTS) │ │ (RTC) │ │ │ └──────────┘ └──────────┘ └──────────┘ │ └──────────────────────────────────────────────┘ ``` ## Core Components ### MCP Server The FastMCP-based server (`server.py`) is the entry point that: - Exposes tools, resources, and prompts via MCP protocol - Handles stdio transport for communication - Manages service lifecycle and health checks - Auto-imports all tools from the tools directory ### Tools System Tools are the primary interface for voice interactions: **converse**: Main voice conversation tool - Handles audio recording and playback - Manages TTS/STT service selection - Implements silence detection and VAD - Supports multiple transport methods (local, LiveKit) **Service tools**: Installation and management - `whisper_install`, `kokoro_install`, `livekit_install` - Service start/stop/status operations - Model and configuration management ### Provider System The provider system (`providers.py`) implements service discovery and failover: 1. **Discovery**: Automatically finds running services 2. **Health Checks**: Validates service availability 3. **Failover**: Falls back to alternative services 4. **Load Balancing**: Distributes requests across providers Provider selection priority: 1. User-specified URL (environment variable) 2. Local services (auto-discovered) 3. Cloud services (OpenAI) ### Configuration Layer Multi-layered configuration system (`config.py`): 1. **Environment Variables**: Highest priority 2. **Project Config**: `.voicemode.env` in working directory 3. **User Config**: `~/.voicemode/voicemode.env` 4. **Defaults**: Built-in sensible defaults ## Voice Services ### Whisper (Speech-to-Text) Local STT service using OpenAI's Whisper model: - Runs on port 2022 by default - Provides OpenAI-compatible API - Supports multiple model sizes - Hardware acceleration (Metal, CUDA) ### Kokoro (Text-to-Speech) Local TTS service with natural voices: - Runs on port 8880 by default - OpenAI-compatible API - Multiple languages and voices - Efficient caching system ### LiveKit (Real-Time Communication) WebRTC-based room communication: - Server on port 7880 - Frontend on port 3000 - Room-based architecture - Low-latency audio transport ## Audio Pipeline ### Recording Flow ``` Microphone → Audio Capture → VAD → Silence Detection → STT Service → Text ``` 1. **Audio Capture**: PyAudio or LiveKit SDK 2. **VAD**: WebRTC VAD filters non-speech 3. **Silence Detection**: Determines recording end 4. **STT Processing**: Converts audio to text ### Playback Flow ``` Text → TTS Service → Audio Stream → Format Conversion → Speaker ``` 1. **TTS Generation**: Creates audio from text 2. **Streaming**: Chunks for real-time playback 3. **Format Conversion**: FFmpeg handles formats 4. **Playback**: PyAudio or LiveKit output ## Service Architecture ### Service Lifecycle 1. **Installation**: Download binaries, create configs 2. **Registration**: systemd/launchd service files 3. **Startup**: Health checks, port binding 4. **Discovery**: Auto-detection by VoiceMode 5. **Monitoring**: Status checks, log rotation ### Service Communication All services expose OpenAI-compatible APIs: - Unified interface for TTS/STT - Standard authentication (API keys) - Consistent error handling - Format negotiation ## Transport Methods ### Local Transport Direct microphone/speaker access: - PyAudio for audio I/O - Low latency - No network overhead - Privacy-focused ### LiveKit Transport Room-based WebRTC communication: - Multi-participant support - Network resilient - Browser compatible - Scalable architecture ## Frontend Architecture ### Next.js Application The web frontend (`frontend/`) provides: - Voice conversation UI - Room management - Real-time status - WebRTC integration ### Build System Frontend is bundled with Python package: 1. Built during package creation 2. Served by MCP server 3. Auto-installed dependencies 4. Hot reload in development ## Security Model ### API Key Management - Never stored in code - Environment variable priority - Secure MCP transport - Optional local-only mode ### Audio Privacy - Local processing option - No cloud storage - Encrypted transport (LiveKit) - User-controlled recording ## Performance Optimization ### Caching Strategy - Model caching (Whisper/Kokoro) - Audio format caching - Provider health caching - Configuration caching ### Resource Management - Lazy service loading - Connection pooling - Memory limits (systemd) - CPU throttling ## Error Handling ### Graceful Degradation 1. Primary service fails 2. Attempt fallback service 3. Use cloud service if available 4. Return informative error ### Recovery Mechanisms - Automatic service restart - Connection retry logic - Circuit breaker pattern - Health check recovery ## Extension Points ### Adding New Tools 1. Create tool in `tools/` directory 2. Implement with FastMCP decorators 3. Auto-imported by server 4. Available via MCP ### Custom Providers 1. Implement provider interface 2. Add discovery logic 3. Register in provider system 4. Configure endpoints ### Service Integration 1. Create service installer 2. Add systemd/launchd templates 3. Implement health checks 4. Update CLI commands ## Deployment Patterns ### Development - Local services - Debug logging - Hot reload - Mock providers ### Production - Service supervision - Log rotation - Health monitoring - Failover configuration ### Containerized - Docker compose setup - Service orchestration - Volume management - Network isolation ## Future Architecture ### Planned Enhancements - Plugin system for tools - Webhook support - Multi-language support - GPU cluster support ### Scalability Path - Distributed services - Queue-based processing - Caching layers - Load balancing

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/mbailey/voicemode'

If you have feedback or need assistance with the MCP directory API, please join our Discord server