# CLAUDE.md - LocalVoiceMode Project
## CRITICAL: USE VOICEMODE FOR ALL RESPONSES
**THIS IS NOT OPTIONAL.** You have voicemode MCP tools. USE THEM.
After EVERY task completion, file edit, search, or significant action:
```
mcp__voicemode__speak("Brief summary here")
```
If you finish something and don't speak it aloud, YOU ARE DOING IT WRONG.
---
## Voice Integration Behavior (IMPORTANT)
**Claude MUST use voice for the following scenarios:**
1. **Agent Completions** - After ANY Task tool completes, call `mcp__voicemode__speak` with a brief summary like "Agent done, found 3 matching files" or "Search complete, 3 results" (NOTE: Never say "subagent" - the TTS mangles it)
2. **Error Alerts** - When a build fails, tests break, or any significant error occurs, immediately speak a brief alert like "Build failed, 2 type errors" or "Tests failing in auth module"
3. **Question Prompts** - When asking the user a question or needing input, speak the question aloud so they don't miss it
4. **Periodic Updates** - During long operations (more than 30 seconds), provide brief voice updates like "Still working, halfway through" or "Processing file 5 of 12"
5. **Task Completions** - When finishing a user-requested task, briefly announce completion like "Done, changes saved" or "Refactoring complete"
**Voice Style Guidelines:**
- Keep all voice announcements SHORT (under 15 words)
- Use natural, conversational language
- Don't be over-the-top or dramatic
- Use `mcp__voicemode__speak` with `wait_for_response=false` for announcements
- Only use `mcp__voicemode__converse` when you need to hear the user's response
**Voice Commands (user can say these anytime):**
- "status" - Claude reports current progress
- "stop" - End current operation
- "pause" - Pause voice announcements temporarily
---
## Project Overview
**LocalVoiceMode** is a self-contained voice interface for AI chat using:
- **Pocket TTS** (Kyutai) - CPU-based text-to-speech, 100M params, ~200ms latency
- **Parakeet TDT 0.6B v3** (NVIDIA) - GPU-accelerated speech recognition via onnx-asr
- **Auto-Provider Detection** - LM Studio, OpenRouter, OpenAI (priority order)
- **Character Skills** - YAML-defined personalities with custom voices
- **MCP Server** - Integrates with Claude Code for tool-based voice control
Location: `C:\AI\localvoicemode`
## MCP Server Integration (Claude Code)
The voice MCP server runs invisibly in the background. From Claude Code, just say or type:
- **"start voice"** or **"start voice mode"** - Begins listening
- **"stop voice"** - Ends voice mode
- **"voice status"** - Check if running
- **"list voices"** - See available characters
- **"provider status"** - See available LLM providers
**MCP Tools:**
- `start_voice(skill)` - Start voice chat (default: "assistant")
- `stop_voice()` - Stop voice chat
- `voice_status()` - Check status
- `list_voices()` - List characters
- `provider_status()` - Show available providers
**Features:**
- Auto-detects LM Studio on common ports (1234, 1235, 1236, 8080)
- Falls back to OpenRouter if OPENROUTER_API_KEY is set
- Falls back to OpenAI if OPENAI_API_KEY is set
- Runs headlessly - no terminal window
- Voice commands work while running:
- Say "stop" or "goodbye" to end
- Say "change voice" to switch characters
## Start Voice Mode
**From Claude Code:** Just type "start voice" or call `start_voice()`
**Standalone (with UI):** Double-click `VoiceChat.bat` or run:
```bash
.venv/Scripts/python.exe voice_client.py --api-url http://localhost:1235/v1
```
**Headless (no window):**
```bash
.venv/Scripts/python.exe voice_client.py --headless --api-url http://localhost:1235/v1
```
**Available characters:** `assistant` (default), `hermione`
**Prerequisites:**
1. LM Studio (or any OpenAI-compatible API) running, OR `OPENROUTER_API_KEY` set
2. First run: `huggingface-cli login` for TTS model
## Quick Setup Commands
```bash
# Navigate to project
cd C:\AI\localvoicemode
# Create and activate virtual environment
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # Linux/Mac
# Install dependencies
pip install -r requirements.txt
# Test voice client
python voice_client.py --list-skills
```
## Architecture
```
C:\AI\localvoicemode\
├── voice_client.py # Main entry point - handles TTS, ASR, LLM, skills, UI
├── mcp_server.py # MCP server for Claude Code integration
├── requirements.txt # Python deps: mcp, pocket-tts, transformers, rich, httpx
├── setup.bat # Windows setup script
├── VoiceChat.bat # Windows launcher
│
├── skills/ # Character skill definitions
│ ├── README.md # Skill creation guide
│ ├── hermione-companion/
│ │ ├── SKILL.md # YAML frontmatter + system prompt
│ │ ├── references/ # Lore/knowledge markdown files
│ │ └── scripts/ # Python helper scripts
│ └── assistant-default/
│ └── SKILL.md
│
└── voice_references/ # Custom .wav voice files for TTS cloning
```
## Key Components
### voice_client.py
Main classes:
- `Config` - Configuration dataclass, reads from env vars
- `SkillLoader` - Parses SKILL.md files (YAML frontmatter + markdown body)
- `TTSFilter` - Intelligent content filtering for TTS (removes code, URLs, JSON, etc.)
- `TTSEngine` - Pocket TTS wrapper with voice caching and intelligent filtering
- `ASREngine` - Parakeet TDT 0.6B v3 via onnx-asr (GPU accelerated)
- `SileroVAD` - Lightweight speech detection (used by Smart Turn)
- `SmartTurnVAD` - Semantic turn detection using Whisper Tiny encoder
- `AudioRecorder` - VAD and PTT recording modes (uses Smart Turn for voice activity detection)
- `ProviderManager` - Detects and manages LLM providers (LM Studio, OpenRouter, OpenAI)
- `LLMClient` - Unified client for OpenAI-compatible APIs
- `VoiceModeController` - Main loop orchestrating everything with Rich UI
### UI Components (Rich-based)
- `Theme` - Consistent color palette
- `AudioLevelMeter` - Real-time audio visualization
- `StatusIndicator` - Animated spinners with states
- `CharacterCard` - Visual character display panel
- `ConversationPanel` - Scrolling message history
- `ShortcutBar` - Keyboard shortcuts footer
### SKILL.md Format
```yaml
---
id: skill-id
name: Character Name
display_name: "Display Name"
description: Brief description for listing
voice: reference.wav # Optional custom voice
avatar: avatar.png # Optional image
metadata:
setting: "Scene description"
greeting: "First thing character says"
initial_memories: [] # For memory systems
personality_traits: [] # Character traits
speech_patterns: [] # TTS optimization hints
allowed_tools: [] # Tool permissions
---
# Character Name
## System Prompt
Full system prompt / character instructions here...
```
## Provider Priority
LocalVoiceMode auto-detects providers in this order:
1. **LM Studio** - Scans ports 1234, 1235, 1236, 8080, 5000
2. **OpenRouter** - Uses `OPENROUTER_API_KEY` env var
3. **OpenAI** - Uses `OPENAI_API_KEY` env var
Force a specific provider with `VOICE_PROVIDER` env var.
## Common Tasks
### Add a New Character Skill
```bash
# Create skill directory
mkdir skills/my-character
mkdir skills/my-character/references
mkdir skills/my-character/scripts
# Create SKILL.md with YAML frontmatter + system prompt
# See skills/hermione-companion/SKILL.md for example
```
### Test a Skill
```bash
python voice_client.py --skill my-character --mode vad
```
### Add Custom Voice (Voice Cloning)
Pocket TTS supports voice cloning from reference audio.
**Reference Audio Requirements:**
- **Format**: WAV file (16-bit PCM recommended)
- **Duration**: 10 seconds of clean speech (optimal)
- **Quality**: Clear recording, minimal background noise
- **Content**: Natural conversational speech works best
**Best Practices:**
1. Record in a quiet environment
2. Use consistent microphone distance
3. Speak naturally - avoid reading robotically
4. Include varied intonation (questions, statements)
5. Consider using [ai-coustics](https://ai-coustics.com/) to enhance recordings
**Voice File Locations:**
- `skills/my-character/reference.wav` (per-skill)
- `voice_references/my-character.wav` (global)
- `voice_references/default.wav` (default voice override)
**Pre-made Voices (from HuggingFace):**
- `alba` - Female, casual (default)
- `marius` - Male
- `javert` - Male
- `fantine` - Female
- See: https://huggingface.co/kyutai/tts-voices
### Switch LLM Backend
```bash
# LM Studio (default) - requires LM Studio running on localhost:1234
python voice_client.py --api-url http://localhost:1234/v1
# OpenRouter (access many models including Claude)
set OPENROUTER_API_KEY=sk-or-...
python voice_client.py
# OpenAI
set OPENAI_API_KEY=sk-...
python voice_client.py
# Any OpenAI-compatible endpoint
python voice_client.py --api-url http://your-server/v1 --api-key YOUR_KEY --model your-model
```
### ASR Device Selection
```bash
# Use GPU (default) - uses TensorRT if available, falls back to CUDA
python voice_client.py --device cuda
# Use CPU (slower but no GPU required)
python voice_client.py --device cpu
# Check GPU availability
python -c "import onnxruntime as ort; print(ort.get_available_providers())"
```
**GPU Providers (in priority order):**
1. TensorRT - Best performance (requires TensorRT installation)
2. CUDA - Good performance (requires CUDA/cuDNN)
3. CPU - Fallback (always available)
## Dependencies
Core Python packages:
- `mcp` - MCP server SDK for Claude Code integration
- `pocket-tts` - TTS (requires HuggingFace login for model download)
- `onnx-asr[gpu,hub]` - Parakeet TDT speech recognition with GPU support
- `rich` - Terminal UI
- `sounddevice` - Audio I/O
- `scipy` - Audio processing
- `numpy` - Array operations
- `httpx` - HTTP client for OpenAI-compatible APIs
- `pynput` - Keyboard handling (PTT mode)
- `pyyaml` - SKILL.md parsing
For GPU acceleration, also install:
- `onnxruntime-gpu[cuda,cudnn]` - CUDA execution provider
- `tensorrt-cu12-libs` - TensorRT for best performance (optional)
## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `VOICE_API_URL` | `http://localhost:1234/v1` | OpenAI-compatible API endpoint |
| `VOICE_API_KEY` | (none) | API key for OpenAI-compatible providers |
| `VOICE_MODEL` | (none) | Model name (uses server default if not set) |
| `VOICE_PROVIDER` | (auto) | Force provider: lm_studio, openrouter, openai |
| `OPENROUTER_API_KEY` | (none) | OpenRouter API key |
| `OPENAI_API_KEY` | (none) | OpenAI API key |
| `VOICE_TTS_VOICE` | `alba` | Default Pocket TTS voice |
| `VOICE_DEVICE` | `cuda` | ASR device: `cuda` (GPU) or `cpu` |
| `VOICE_SMART_TURN_THRESHOLD` | `0.5` | Smart Turn completion probability threshold (0.0-1.0) |
## Voice Activity Detection (VAD)
LocalVoiceMode uses **Smart Turn** for semantic voice activity detection.
### Smart Turn
Uses **Silero VAD** for speech detection + **Smart Turn v3** for semantic turn completion.
- Understands when speaker is *done talking*, not just detecting silence
- Handles pauses, "um", thinking silences naturally
- Supports 23 languages
- ~12ms CPU inference, 8MB model
- Downloads automatically on first use
Configure the turn completion threshold:
```bash
# Lower threshold = faster response (more false positives)
# Higher threshold = waits longer (fewer interruptions)
set VOICE_SMART_TURN_THRESHOLD=0.5 # default
```
## TTS Content Filtering
LocalVoiceMode includes intelligent filtering to prevent reading code, JSON, or technical content aloud.
### What Gets Filtered
The TTS filter automatically handles:
- **Code blocks** - Replaced with "[code omitted]"
- **Inline code** - Simple names kept, complex code removed
- **URLs** - Replaced with "[link]"
- **File paths** - Only filename spoken if simple
- **JSON objects/arrays** - Replaced with "[data]" or "[list]"
- **UUIDs and hashes** - Replaced with "[identifier]"
- **Long lists** - Summarized (e.g., "item1, item2, and 8 more items")
- **Tables** - Replaced with "[table omitted]"
- **Stack traces** - Replaced with "[error details omitted]"
### Speech Modes
Control how much Claude speaks using `set_speech_mode`:
| Mode | Description |
|------|-------------|
| `speak-all` | Speak everything (with filtering) - default |
| `speak-summaries` | Summarize long content before speaking |
| `speak-minimal` | Only speak short announcements (<100 chars) |
| `silent` | No speech output |
```bash
# In MCP/Claude Code
set_speech_mode("minimal") # Only announcements
set_speech_mode("silent") # Mute all speech
set_speech_mode("all") # Resume normal speech
```
## Troubleshooting
### Pocket TTS fails to load
```bash
# Login to HuggingFace (required for model access)
pip install huggingface-hub
huggingface-cli login
# Accept license at: https://huggingface.co/kyutai/pocket-tts
```
### No audio input
```python
# Check audio devices
import sounddevice
print(sounddevice.query_devices())
```
### LM Studio connection refused
1. Open LM Studio
2. Load a model
3. Go to Local Server tab -> Start Server
4. Verify: `curl http://localhost:1234/v1/models`
### OpenRouter/OpenAI not working
1. Verify API key is set correctly (OPENROUTER_API_KEY or OPENAI_API_KEY)
2. Check `python voice_client.py --list-providers` to see detected providers
### ASR/GPU not working
```bash
# Check available ONNX Runtime providers
python -c "import onnxruntime as ort; print(ort.get_available_providers())"
# Install GPU support
pip install onnxruntime-gpu[cuda,cudnn]
# For TensorRT (optional, best performance)
pip install tensorrt-cu12-libs
# Test ASR
python -c "from voice_client import ASREngine; a = ASREngine(); a._ensure_loaded()"
```
If CUDA is not showing up:
1. Ensure NVIDIA drivers are installed
2. Install CUDA Toolkit 12.x
3. Reinstall onnxruntime-gpu: `pip uninstall onnxruntime onnxruntime-gpu && pip install onnxruntime-gpu[cuda,cudnn]`
## Code Style
- Python 3.10+ with type hints
- Dataclasses for configuration
- Lazy loading for heavy models (TTS, ASR)
- Auto-install missing packages via `ensure_package()`
- Rich library for terminal UI
## Testing
```bash
# List skills
python voice_client.py --list-skills
# Test TTS only
python -c "from voice_client import TTSEngine; t = TTSEngine(); t.speak('Hello world')"
# Test ASR only
python -c "from voice_client import ASREngine, AudioRecorder; a = ASREngine(); r = AudioRecorder(); print(a.transcribe(r.record_vad()))"
# Test Smart Turn VAD
python -c "from voice_client import get_smart_turn_vad; s = get_smart_turn_vad(); s._ensure_loaded(); print('Smart Turn loaded')"
# Test Silero VAD
python -c "from voice_client import SileroVAD; s = SileroVAD(); s._ensure_loaded(); print('Silero VAD loaded')"
```
## Migration
This project is designed to be portable. To move to a new machine:
1. Copy entire `C:\AI\localvoicemode\` directory
2. Run `setup.bat` on new machine
3. Re-login to HuggingFace if needed: `huggingface-cli login`
4. Set API keys if using cloud providers