Voice Mode

MIT License

401

Overview InspectNew Endpoints Schema Related Servers Reviews Score

streaming-audio-playback.md•6.53 kB

# Streaming Audio Playback Specification ## Overview This specification describes implementing progressive/streaming audio playback in voicemode to reduce latency by playing audio as it arrives rather than waiting for the complete file. ## Current State - Audio is fully downloaded before playback begins - Uses `with_streaming_response.create()` but reads entire response - Adds 3-6 seconds of latency for TTS generation - All formats are buffered completely before playing ## Proposed Implementation ### 1. Streaming Architecture ```python async def stream_tts_audio(text: str, client: AsyncOpenAI, **params): """Stream TTS audio with progressive playback.""" # Audio pipeline components audio_queue = asyncio.Queue(maxsize=10) # Buffer chunks playback_started = asyncio.Event() playback_complete = asyncio.Event() # Download task async def download_chunks(): async with client.audio.speech.with_streaming_response.create( **params ) as response: async for chunk in response.iter_bytes(chunk_size=4096): await audio_queue.put(chunk) await audio_queue.put(None) # Signal end # Playback task async def play_chunks(): # Initialize audio stream stream = await create_audio_stream(params['response_format']) while True: chunk = await audio_queue.get() if chunk is None: break # Decode and play chunk await stream.play_chunk(chunk) if not playback_started.is_set(): playback_started.set() await stream.close() playback_complete.set() # Run both tasks concurrently download_task = asyncio.create_task(download_chunks()) playback_task = asyncio.create_task(play_chunks()) # Wait for playback to start await playback_started.wait() # Continue both tasks await asyncio.gather(download_task, playback_task) ``` ### 2. Format-Specific Considerations #### PCM (Recommended for Streaming) - **Advantage**: No decoding needed, direct playback - **Disadvantage**: Large bandwidth requirement - **Use case**: Local TTS with minimal latency #### Opus - **Container**: OGG with Opus codec - **Advantages**: - Designed for low-latency streaming - Small chunk sizes (2.5-60ms frames) - Built-in error resilience - **Implementation**: Use `opuslib` or `pyogg` for decoding - **Critical Limitation**: Doesn't work for streaming - must be fully buffered before playback - **Recommendation**: Use PCM for streaming TTS, Opus only for file storage #### MP3 - **Challenge**: Frame boundaries may not align with chunks - **Solution**: Buffer until valid frame header found - **Typical frame size**: ~400-1400 bytes ### 3. Audio Playback Backend Options #### Option 1: PyAudio Streaming (Cross-platform) ```python import pyaudio class PyAudioStream: def __init__(self, format, channels, rate): self.pa = pyaudio.PyAudio() self.stream = self.pa.open( format=format, channels=channels, rate=rate, output=True, stream_callback=self.callback, frames_per_buffer=1024 ) self.buffer = queue.Queue() def callback(self, in_data, frame_count, time_info, status): # Pull from buffer and play data = self.get_audio_data(frame_count) return (data, pyaudio.paContinue) ``` #### Option 2: Sounddevice Streaming ```python import sounddevice as sd import numpy as np class SoundDeviceStream: def __init__(self, samplerate, channels): self.buffer = asyncio.Queue() self.stream = sd.OutputStream( samplerate=samplerate, channels=channels, callback=self.audio_callback, blocksize=2048 ) async def play_chunk(self, audio_data): await self.buffer.put(audio_data) ``` ### 4. Buffering Strategy - **Initial buffer**: 100-200ms before starting playback - **Adaptive buffering**: Adjust based on network conditions - **Underrun handling**: Pause and rebuffer if needed - **Maximum buffer**: 1-2 seconds to prevent memory issues ### 5. Implementation Phases #### Phase 1: Basic Streaming (PCM/WAV) - Implement streaming for uncompressed formats - Test latency improvements - Establish buffering parameters #### Phase 2: Compressed Format Support - Implement MP3 frame boundary detection - Handle format-specific quirks #### Phase 3: Adaptive Streaming - Monitor buffer health - Adjust chunk sizes dynamically - Implement quality fallback ### 6. Configuration ```bash # Enable streaming playback VOICEMODE_STREAMING_ENABLED=true # Initial buffer size (ms) VOICEMODE_STREAM_BUFFER_MS=150 # Chunk size for download VOICEMODE_STREAM_CHUNK_SIZE=4096 # Maximum buffer size (seconds) VOICEMODE_STREAM_MAX_BUFFER=2.0 ``` ### 7. Error Handling - **Network interruption**: Pause playback, attempt reconnection - **Decoder errors**: Skip corrupted chunks, log warnings - **Buffer underrun**: Insert silence, increase buffer size - **Format incompatibility**: Fall back to buffered playback ### 8. Metrics and Monitoring Track these metrics for optimization: - Time to first audio (TTFA) - Buffer underrun count - Average buffer level - Network throughput - Chunk decode time ### 9. Testing Strategy 1. **Unit tests**: Mock streaming responses, test decoders 2. **Integration tests**: Real API calls with streaming 3. **Network simulation**: Test with various latencies/bandwidth 4. **Format testing**: Verify each format streams correctly 5. **Stress testing**: Multiple concurrent streams ### 10. Benefits - **Reduced latency**: Start playback within 150-200ms - **Better UX**: User hears response beginning immediately - **Memory efficient**: No need to buffer entire response - **Scalable**: Handle longer responses without memory issues ### 11. Risks and Mitigations | Risk | Mitigation | |------|------------| | Audio glitches | Adequate buffering, quality monitoring | | Format incompatibility | Fallback to buffered mode | | CPU overhead | Efficient decoders, worker threads | | Network issues | Adaptive buffering, reconnection logic | ### 12. Future Enhancements - **WebRTC integration**: For ultra-low latency - **Spatial audio**: Stream multi-channel audio - **Adaptive bitrate**: Adjust quality based on network - **Caching**: Stream from cache for repeated phrases

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/mbailey/voicemode'

If you have feedback or need assistance with the MCP directory API, please join our Discord server