architecture.md•7.83 kB
# VoiceMode Architecture
Understanding how VoiceMode components work together to enable voice conversations.
## System Overview
VoiceMode is built as a Model Context Protocol (MCP) server that provides voice capabilities to AI assistants. It follows a modular architecture with clear separation between voice services, audio processing, and client interfaces.
```
┌─────────────────────────────────────────────┐
│ MCP Client (Claude) │
└─────────────────┬───────────────────────────┘
│ MCP Protocol
┌─────────────────┴───────────────────────────┐
│ VoiceMode MCP Server │
├──────────────────────────────────────────────┤
│ Core Components │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Tools │ │ Providers│ │ Config │ │
│ └──────────┘ └──────────┘ └──────────┘ │
├──────────────────────────────────────────────┤
│ Voice Services │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Whisper │ │ Kokoro │ │ LiveKit │ │
│ │ (STT) │ │ (TTS) │ │ (RTC) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└──────────────────────────────────────────────┘
```
## Core Components
### MCP Server
The FastMCP-based server (`server.py`) is the entry point that:
- Exposes tools, resources, and prompts via MCP protocol
- Handles stdio transport for communication
- Manages service lifecycle and health checks
- Auto-imports all tools from the tools directory
### Tools System
Tools are the primary interface for voice interactions:
**converse**: Main voice conversation tool
- Handles audio recording and playback
- Manages TTS/STT service selection
- Implements silence detection and VAD
- Supports multiple transport methods (local, LiveKit)
**Service tools**: Installation and management
- `whisper_install`, `kokoro_install`, `livekit_install`
- Service start/stop/status operations
- Model and configuration management
### Provider System
The provider system (`providers.py`) implements service discovery and failover:
1. **Discovery**: Automatically finds running services
2. **Health Checks**: Validates service availability
3. **Failover**: Falls back to alternative services
4. **Load Balancing**: Distributes requests across providers
Provider selection priority:
1. User-specified URL (environment variable)
2. Local services (auto-discovered)
3. Cloud services (OpenAI)
### Configuration Layer
Multi-layered configuration system (`config.py`):
1. **Environment Variables**: Highest priority
2. **Project Config**: `.voicemode.env` in working directory
3. **User Config**: `~/.voicemode/voicemode.env`
4. **Defaults**: Built-in sensible defaults
## Voice Services
### Whisper (Speech-to-Text)
Local STT service using OpenAI's Whisper model:
- Runs on port 2022 by default
- Provides OpenAI-compatible API
- Supports multiple model sizes
- Hardware acceleration (Metal, CUDA)
### Kokoro (Text-to-Speech)
Local TTS service with natural voices:
- Runs on port 8880 by default
- OpenAI-compatible API
- Multiple languages and voices
- Efficient caching system
### LiveKit (Real-Time Communication)
WebRTC-based room communication:
- Server on port 7880
- Frontend on port 3000
- Room-based architecture
- Low-latency audio transport
## Audio Pipeline
### Recording Flow
```
Microphone → Audio Capture → VAD → Silence Detection → STT Service → Text
```
1. **Audio Capture**: PyAudio or LiveKit SDK
2. **VAD**: WebRTC VAD filters non-speech
3. **Silence Detection**: Determines recording end
4. **STT Processing**: Converts audio to text
### Playback Flow
```
Text → TTS Service → Audio Stream → Format Conversion → Speaker
```
1. **TTS Generation**: Creates audio from text
2. **Streaming**: Chunks for real-time playback
3. **Format Conversion**: FFmpeg handles formats
4. **Playback**: PyAudio or LiveKit output
## Service Architecture
### Service Lifecycle
1. **Installation**: Download binaries, create configs
2. **Registration**: systemd/launchd service files
3. **Startup**: Health checks, port binding
4. **Discovery**: Auto-detection by VoiceMode
5. **Monitoring**: Status checks, log rotation
### Service Communication
All services expose OpenAI-compatible APIs:
- Unified interface for TTS/STT
- Standard authentication (API keys)
- Consistent error handling
- Format negotiation
## Transport Methods
### Local Transport
Direct microphone/speaker access:
- PyAudio for audio I/O
- Low latency
- No network overhead
- Privacy-focused
### LiveKit Transport
Room-based WebRTC communication:
- Multi-participant support
- Network resilient
- Browser compatible
- Scalable architecture
## Frontend Architecture
### Next.js Application
The web frontend (`frontend/`) provides:
- Voice conversation UI
- Room management
- Real-time status
- WebRTC integration
### Build System
Frontend is bundled with Python package:
1. Built during package creation
2. Served by MCP server
3. Auto-installed dependencies
4. Hot reload in development
## Security Model
### API Key Management
- Never stored in code
- Environment variable priority
- Secure MCP transport
- Optional local-only mode
### Audio Privacy
- Local processing option
- No cloud storage
- Encrypted transport (LiveKit)
- User-controlled recording
## Performance Optimization
### Caching Strategy
- Model caching (Whisper/Kokoro)
- Audio format caching
- Provider health caching
- Configuration caching
### Resource Management
- Lazy service loading
- Connection pooling
- Memory limits (systemd)
- CPU throttling
## Error Handling
### Graceful Degradation
1. Primary service fails
2. Attempt fallback service
3. Use cloud service if available
4. Return informative error
### Recovery Mechanisms
- Automatic service restart
- Connection retry logic
- Circuit breaker pattern
- Health check recovery
## Extension Points
### Adding New Tools
1. Create tool in `tools/` directory
2. Implement with FastMCP decorators
3. Auto-imported by server
4. Available via MCP
### Custom Providers
1. Implement provider interface
2. Add discovery logic
3. Register in provider system
4. Configure endpoints
### Service Integration
1. Create service installer
2. Add systemd/launchd templates
3. Implement health checks
4. Update CLI commands
## Deployment Patterns
### Development
- Local services
- Debug logging
- Hot reload
- Mock providers
### Production
- Service supervision
- Log rotation
- Health monitoring
- Failover configuration
### Containerized
- Docker compose setup
- Service orchestration
- Volume management
- Network isolation
## Future Architecture
### Planned Enhancements
- Plugin system for tools
- Webhook support
- Multi-language support
- GPU cluster support
### Scalability Path
- Distributed services
- Queue-based processing
- Caching layers
- Load balancing