LocalVoiceMode

CLAUDE.md•14.8 KiB

# CLAUDE.md - LocalVoiceMode Project ## CRITICAL: USE VOICEMODE FOR ALL RESPONSES **THIS IS NOT OPTIONAL.** You have voicemode MCP tools. USE THEM. After EVERY task completion, file edit, search, or significant action: ``` mcp__voicemode__speak("Brief summary here") ``` If you finish something and don't speak it aloud, YOU ARE DOING IT WRONG. --- ## Voice Integration Behavior (IMPORTANT) **Claude MUST use voice for the following scenarios:** 1. **Agent Completions** - After ANY Task tool completes, call `mcp__voicemode__speak` with a brief summary like "Agent done, found 3 matching files" or "Search complete, 3 results" (NOTE: Never say "subagent" - the TTS mangles it) 2. **Error Alerts** - When a build fails, tests break, or any significant error occurs, immediately speak a brief alert like "Build failed, 2 type errors" or "Tests failing in auth module" 3. **Question Prompts** - When asking the user a question or needing input, speak the question aloud so they don't miss it 4. **Periodic Updates** - During long operations (more than 30 seconds), provide brief voice updates like "Still working, halfway through" or "Processing file 5 of 12" 5. **Task Completions** - When finishing a user-requested task, briefly announce completion like "Done, changes saved" or "Refactoring complete" **Voice Style Guidelines:** - Keep all voice announcements SHORT (under 15 words) - Use natural, conversational language - Don't be over-the-top or dramatic - Use `mcp__voicemode__speak` with `wait_for_response=false` for announcements - Only use `mcp__voicemode__converse` when you need to hear the user's response **Voice Commands (user can say these anytime):** - "status" - Claude reports current progress - "stop" - End current operation - "pause" - Pause voice announcements temporarily --- ## Project Overview **LocalVoiceMode** is a self-contained voice interface for AI chat using: - **Pocket TTS** (Kyutai) - CPU-based text-to-speech, 100M params, ~200ms latency - **Parakeet TDT 0.6B v3** (NVIDIA) - GPU-accelerated speech recognition via onnx-asr - **Auto-Provider Detection** - LM Studio, OpenRouter, OpenAI (priority order) - **Character Skills** - YAML-defined personalities with custom voices - **MCP Server** - Integrates with Claude Code for tool-based voice control Location: `C:\AI\localvoicemode` ## MCP Server Integration (Claude Code) The voice MCP server runs invisibly in the background. From Claude Code, just say or type: - **"start voice"** or **"start voice mode"** - Begins listening - **"stop voice"** - Ends voice mode - **"voice status"** - Check if running - **"list voices"** - See available characters - **"provider status"** - See available LLM providers **MCP Tools:** - `start_voice(skill)` - Start voice chat (default: "assistant") - `stop_voice()` - Stop voice chat - `voice_status()` - Check status - `list_voices()` - List characters - `provider_status()` - Show available providers **Features:** - Auto-detects LM Studio on common ports (1234, 1235, 1236, 8080) - Falls back to OpenRouter if OPENROUTER_API_KEY is set - Falls back to OpenAI if OPENAI_API_KEY is set - Runs headlessly - no terminal window - Voice commands work while running: - Say "stop" or "goodbye" to end - Say "change voice" to switch characters ## Start Voice Mode **From Claude Code:** Just type "start voice" or call `start_voice()` **Standalone (with UI):** Double-click `VoiceChat.bat` or run: ```bash .venv/Scripts/python.exe voice_client.py --api-url http://localhost:1235/v1 ``` **Headless (no window):** ```bash .venv/Scripts/python.exe voice_client.py --headless --api-url http://localhost:1235/v1 ``` **Available characters:** `assistant` (default), `hermione` **Prerequisites:** 1. LM Studio (or any OpenAI-compatible API) running, OR `OPENROUTER_API_KEY` set 2. First run: `huggingface-cli login` for TTS model ## Quick Setup Commands ```bash # Navigate to project cd C:\AI\localvoicemode # Create and activate virtual environment python -m venv .venv .venv\Scripts\activate # Windows # source .venv/bin/activate # Linux/Mac # Install dependencies pip install -r requirements.txt # Test voice client python voice_client.py --list-skills ``` ## Architecture ``` C:\AI\localvoicemode\ ├── voice_client.py # Main entry point - handles TTS, ASR, LLM, skills, UI ├── mcp_server.py # MCP server for Claude Code integration ├── requirements.txt # Python deps: mcp, pocket-tts, transformers, rich, httpx ├── setup.bat # Windows setup script ├── VoiceChat.bat # Windows launcher │ ├── skills/ # Character skill definitions │ ├── README.md # Skill creation guide │ ├── hermione-companion/ │ │ ├── SKILL.md # YAML frontmatter + system prompt │ │ ├── references/ # Lore/knowledge markdown files │ │ └── scripts/ # Python helper scripts │ └── assistant-default/ │ └── SKILL.md │ └── voice_references/ # Custom .wav voice files for TTS cloning ``` ## Key Components ### voice_client.py Main classes: - `Config` - Configuration dataclass, reads from env vars - `SkillLoader` - Parses SKILL.md files (YAML frontmatter + markdown body) - `TTSFilter` - Intelligent content filtering for TTS (removes code, URLs, JSON, etc.) - `TTSEngine` - Pocket TTS wrapper with voice caching and intelligent filtering - `ASREngine` - Parakeet TDT 0.6B v3 via onnx-asr (GPU accelerated) - `SileroVAD` - Lightweight speech detection (used by Smart Turn) - `SmartTurnVAD` - Semantic turn detection using Whisper Tiny encoder - `AudioRecorder` - VAD and PTT recording modes (uses Smart Turn for voice activity detection) - `ProviderManager` - Detects and manages LLM providers (LM Studio, OpenRouter, OpenAI) - `LLMClient` - Unified client for OpenAI-compatible APIs - `VoiceModeController` - Main loop orchestrating everything with Rich UI ### UI Components (Rich-based) - `Theme` - Consistent color palette - `AudioLevelMeter` - Real-time audio visualization - `StatusIndicator` - Animated spinners with states - `CharacterCard` - Visual character display panel - `ConversationPanel` - Scrolling message history - `ShortcutBar` - Keyboard shortcuts footer ### SKILL.md Format ```yaml --- id: skill-id name: Character Name display_name: "Display Name" description: Brief description for listing voice: reference.wav # Optional custom voice avatar: avatar.png # Optional image metadata: setting: "Scene description" greeting: "First thing character says" initial_memories: [] # For memory systems personality_traits: [] # Character traits speech_patterns: [] # TTS optimization hints allowed_tools: [] # Tool permissions --- # Character Name ## System Prompt Full system prompt / character instructions here... ``` ## Provider Priority LocalVoiceMode auto-detects providers in this order: 1. **LM Studio** - Scans ports 1234, 1235, 1236, 8080, 5000 2. **OpenRouter** - Uses `OPENROUTER_API_KEY` env var 3. **OpenAI** - Uses `OPENAI_API_KEY` env var Force a specific provider with `VOICE_PROVIDER` env var. ## Common Tasks ### Add a New Character Skill ```bash # Create skill directory mkdir skills/my-character mkdir skills/my-character/references mkdir skills/my-character/scripts # Create SKILL.md with YAML frontmatter + system prompt # See skills/hermione-companion/SKILL.md for example ``` ### Test a Skill ```bash python voice_client.py --skill my-character --mode vad ``` ### Add Custom Voice (Voice Cloning) Pocket TTS supports voice cloning from reference audio. **Reference Audio Requirements:** - **Format**: WAV file (16-bit PCM recommended) - **Duration**: 10 seconds of clean speech (optimal) - **Quality**: Clear recording, minimal background noise - **Content**: Natural conversational speech works best **Best Practices:** 1. Record in a quiet environment 2. Use consistent microphone distance 3. Speak naturally - avoid reading robotically 4. Include varied intonation (questions, statements) 5. Consider using [ai-coustics](https://ai-coustics.com/) to enhance recordings **Voice File Locations:** - `skills/my-character/reference.wav` (per-skill) - `voice_references/my-character.wav` (global) - `voice_references/default.wav` (default voice override) **Pre-made Voices (from HuggingFace):** - `alba` - Female, casual (default) - `marius` - Male - `javert` - Male - `fantine` - Female - See: https://huggingface.co/kyutai/tts-voices ### Switch LLM Backend ```bash # LM Studio (default) - requires LM Studio running on localhost:1234 python voice_client.py --api-url http://localhost:1234/v1 # OpenRouter (access many models including Claude) set OPENROUTER_API_KEY=sk-or-... python voice_client.py # OpenAI set OPENAI_API_KEY=sk-... python voice_client.py # Any OpenAI-compatible endpoint python voice_client.py --api-url http://your-server/v1 --api-key YOUR_KEY --model your-model ``` ### ASR Device Selection ```bash # Use GPU (default) - uses TensorRT if available, falls back to CUDA python voice_client.py --device cuda # Use CPU (slower but no GPU required) python voice_client.py --device cpu # Check GPU availability python -c "import onnxruntime as ort; print(ort.get_available_providers())" ``` **GPU Providers (in priority order):** 1. TensorRT - Best performance (requires TensorRT installation) 2. CUDA - Good performance (requires CUDA/cuDNN) 3. CPU - Fallback (always available) ## Dependencies Core Python packages: - `mcp` - MCP server SDK for Claude Code integration - `pocket-tts` - TTS (requires HuggingFace login for model download) - `onnx-asr[gpu,hub]` - Parakeet TDT speech recognition with GPU support - `rich` - Terminal UI - `sounddevice` - Audio I/O - `scipy` - Audio processing - `numpy` - Array operations - `httpx` - HTTP client for OpenAI-compatible APIs - `pynput` - Keyboard handling (PTT mode) - `pyyaml` - SKILL.md parsing For GPU acceleration, also install: - `onnxruntime-gpu[cuda,cudnn]` - CUDA execution provider - `tensorrt-cu12-libs` - TensorRT for best performance (optional) ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `VOICE_API_URL` | `http://localhost:1234/v1` | OpenAI-compatible API endpoint | | `VOICE_API_KEY` | (none) | API key for OpenAI-compatible providers | | `VOICE_MODEL` | (none) | Model name (uses server default if not set) | | `VOICE_PROVIDER` | (auto) | Force provider: lm_studio, openrouter, openai | | `OPENROUTER_API_KEY` | (none) | OpenRouter API key | | `OPENAI_API_KEY` | (none) | OpenAI API key | | `VOICE_TTS_VOICE` | `alba` | Default Pocket TTS voice | | `VOICE_DEVICE` | `cuda` | ASR device: `cuda` (GPU) or `cpu` | | `VOICE_SMART_TURN_THRESHOLD` | `0.5` | Smart Turn completion probability threshold (0.0-1.0) | ## Voice Activity Detection (VAD) LocalVoiceMode uses **Smart Turn** for semantic voice activity detection. ### Smart Turn Uses **Silero VAD** for speech detection + **Smart Turn v3** for semantic turn completion. - Understands when speaker is *done talking*, not just detecting silence - Handles pauses, "um", thinking silences naturally - Supports 23 languages - ~12ms CPU inference, 8MB model - Downloads automatically on first use Configure the turn completion threshold: ```bash # Lower threshold = faster response (more false positives) # Higher threshold = waits longer (fewer interruptions) set VOICE_SMART_TURN_THRESHOLD=0.5 # default ``` ## TTS Content Filtering LocalVoiceMode includes intelligent filtering to prevent reading code, JSON, or technical content aloud. ### What Gets Filtered The TTS filter automatically handles: - **Code blocks** - Replaced with "[code omitted]" - **Inline code** - Simple names kept, complex code removed - **URLs** - Replaced with "[link]" - **File paths** - Only filename spoken if simple - **JSON objects/arrays** - Replaced with "[data]" or "[list]" - **UUIDs and hashes** - Replaced with "[identifier]" - **Long lists** - Summarized (e.g., "item1, item2, and 8 more items") - **Tables** - Replaced with "[table omitted]" - **Stack traces** - Replaced with "[error details omitted]" ### Speech Modes Control how much Claude speaks using `set_speech_mode`: | Mode | Description | |------|-------------| | `speak-all` | Speak everything (with filtering) - default | | `speak-summaries` | Summarize long content before speaking | | `speak-minimal` | Only speak short announcements (<100 chars) | | `silent` | No speech output | ```bash # In MCP/Claude Code set_speech_mode("minimal") # Only announcements set_speech_mode("silent") # Mute all speech set_speech_mode("all") # Resume normal speech ``` ## Troubleshooting ### Pocket TTS fails to load ```bash # Login to HuggingFace (required for model access) pip install huggingface-hub huggingface-cli login # Accept license at: https://huggingface.co/kyutai/pocket-tts ``` ### No audio input ```python # Check audio devices import sounddevice print(sounddevice.query_devices()) ``` ### LM Studio connection refused 1. Open LM Studio 2. Load a model 3. Go to Local Server tab -> Start Server 4. Verify: `curl http://localhost:1234/v1/models` ### OpenRouter/OpenAI not working 1. Verify API key is set correctly (OPENROUTER_API_KEY or OPENAI_API_KEY) 2. Check `python voice_client.py --list-providers` to see detected providers ### ASR/GPU not working ```bash # Check available ONNX Runtime providers python -c "import onnxruntime as ort; print(ort.get_available_providers())" # Install GPU support pip install onnxruntime-gpu[cuda,cudnn] # For TensorRT (optional, best performance) pip install tensorrt-cu12-libs # Test ASR python -c "from voice_client import ASREngine; a = ASREngine(); a._ensure_loaded()" ``` If CUDA is not showing up: 1. Ensure NVIDIA drivers are installed 2. Install CUDA Toolkit 12.x 3. Reinstall onnxruntime-gpu: `pip uninstall onnxruntime onnxruntime-gpu && pip install onnxruntime-gpu[cuda,cudnn]` ## Code Style - Python 3.10+ with type hints - Dataclasses for configuration - Lazy loading for heavy models (TTS, ASR) - Auto-install missing packages via `ensure_package()` - Rich library for terminal UI ## Testing ```bash # List skills python voice_client.py --list-skills # Test TTS only python -c "from voice_client import TTSEngine; t = TTSEngine(); t.speak('Hello world')" # Test ASR only python -c "from voice_client import ASREngine, AudioRecorder; a = ASREngine(); r = AudioRecorder(); print(a.transcribe(r.record_vad()))" # Test Smart Turn VAD python -c "from voice_client import get_smart_turn_vad; s = get_smart_turn_vad(); s._ensure_loaded(); print('Smart Turn loaded')" # Test Silero VAD python -c "from voice_client import SileroVAD; s = SileroVAD(); s._ensure_loaded(); print('Silero VAD loaded')" ``` ## Migration This project is designed to be portable. To move to a new machine: 1. Copy entire `C:\AI\localvoicemode\` directory 2. Run `setup.bat` on new machine 3. Re-login to HuggingFace if needed: `huggingface-cli login` 4. Set API keys if using cloud providers

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/DevMan57/voiceblitz-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

CLAUDE.md•14.8 KiB