VoiceLayer
Allows extraction of audio from YouTube videos to collect voice samples for voice cloning.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@VoiceLayerTranscribe my voice input"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
VoiceLayer
Your AI agent can't hear you. VoiceLayer gives it ears and a voice.
Voice I/O for AI coding assistants. Press F5, speak to Claude Code, get on-device transcription in under 1.5 seconds. Your AI speaks back. Works with any MCP client.
You ──🎤──> whisper.cpp ──> Claude Code ──> edge-tts ──🔊──> You
STT (local) MCP tools TTS (free)Local-first. Free. Open-source. No cloud APIs, no API keys, no data leaves your machine. Part of the Golems ecosystem.
VoiceLayer runs as a persistent singleton daemon on a Unix socket — every Claude session connects through a lightweight socat shim instead of spawning its own process. 2 canonical MCP tools plus 9 backward-compatible aliases ship with full ToolAnnotations.
Architecture
┌─────────────────────────────────────┐
│ VoiceLayer Daemon │
│ /tmp/voicelayer-mcp.sock │
│ │
│ MCP JSONRPC ──> Tool Handlers │
│ (Content-Length ├── voice_speak │
│ framing) └── voice_ask │
│ │
│ TTS: edge-tts (retry + 30s timeout) │
│ STT: whisper.cpp / Wispr Flow │
│ VAD: Silero ONNX (speech detection) │
│ IPC: Voice Bar ← NDJSON events │
└──────────┬──────────────────────────┘
│ Unix socket
┌──────────────┼──────────────┐
│ │ │
Claude Code Claude Code Cursor/Codex
(socat shim) (socat shim) (socat shim)Why a daemon? The original design spawned a new Bun process per Claude session. With 17+ repos open, that meant 17 competing processes (700+ MB RAM), fighting over one Voice Bar socket, crashing edge-tts with PATH issues, and leaving orphans that never died. The daemon architecture — shipped in PRs #67-72 — replaced all of that with a single process and socat shims.
Metric | Before (spawn-per-session) | After (daemon) |
Processes | N per session (17+ typical) | 1 daemon + socat shims |
RAM | ~700 MB (17 x 41 MB) | ~50 MB |
Orphan cleanup | Manual | PID lockfile auto-kills stale |
edge-tts failures | Random (PATH, contention) | Retry with 30s hard timeout |
voice_ask hang | Up to 300s (5 min!) | 30s default + outer guard |
Quick Start
# Install from npm
bun add -g voicelayer-mcp
# Prerequisites
brew install sox socat
pip3 install edge-tts
brew install whisper-cpp # optional — local STT
# Download a whisper model (recommended)
mkdir -p ~/.cache/whisper
curl -L -o ~/.cache/whisper/ggml-large-v3-turbo.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.binOr install from source:
git clone https://github.com/EtanHey/voicelayer.git
cd voicelayer && bun installStart the Daemon
# Option A: LaunchAgent (auto-start on login, auto-restart on crash)
./launchd/install.sh
# Option B: Manual
bun run src/mcp-server-daemon.tsDisabling VoiceLayer
DISABLE_VOICELAYER=1 is a hard kill-switch for the MCP daemon.
# Install the LaunchAgent in a disabled state and sync the runtime daemon flag
DISABLE_VOICELAYER=1 ./launchd/install.sh
# Or edit the template-generated plist and add:
# <key>DISABLE_VOICELAYER</key>
# <string>1</string>If the daemon is already running, create /tmp/.voicelayer-daemon-disabled and it will shut down within 5 seconds. ./launchd/install.sh also keeps that file in sync with DISABLE_VOICELAYER, so VoiceBar-launched daemons stay disabled too. To re-enable it, remove the env var from ~/Library/LaunchAgents/com.voicelayer.mcp-daemon.plist, delete /tmp/.voicelayer-daemon-disabled if present, and restart the agent:
launchctl kickstart -k "gui/$(id -u)/com.voicelayer.mcp-daemon"Configure MCP Clients
Add to your .mcp.json (in any repo where you use Claude Code):
{
"mcpServers": {
"voicelayer": {
"command": "socat",
"args": ["STDIO", "UNIX-CONNECT:/tmp/voicelayer-mcp.sock"]
}
}
}Or migrate all repos at once:
bash scripts/migrate-to-daemon.sh # migrates every .mcp.json under ~/Gits
bash scripts/migrate-to-daemon.sh --dry-run # preview without changesGrant microphone access to your terminal (macOS: System Settings > Privacy > Microphone).
Voice Tools
Primary tools
Tool | Behavior | Blocking | readOnly | destructive | idempotent |
| TTS with auto-mode (announce/brief/consult/think), replay, toggle | No | false | false | true |
| Speak question + record mic + transcribe response | Yes | false | false | false |
Backward-compatible aliases
Alias | Maps to | idempotent |
|
| true |
|
| true |
|
| true |
|
| true |
|
| false |
|
| true |
|
| true |
|
| false |
|
| false |
All 11 tools include MCP ToolAnnotations. No VoiceLayer tools are destructive. All have openWorldHint: false.
How voice_ask Works
Waits for any playing
voice_speakaudio to finishSpeaks the question via edge-tts (with retry on failure)
Records mic at device native rate, resamples to 16kHz
Silero VAD detects speech onset and silence end
whisper.cpp transcribes locally (~200-400ms on Apple Silicon)
Returns transcription to the AI agent
Reliability Features
PID lockfile (
/tmp/voicelayer-mcp.pid): On startup, detects and kills any orphan MCP server from a previous sessionedge-tts retry: Health check (cached 60s) + automatic retry with 30s hard timeout per attempt
Outer timeout guard:
Promise.racewrapper around the entire voice_ask flow — if anything hangs, returns an error instead of blocking foreverSession booking: Lockfile mutex prevents mic conflicts between concurrent sessions
Recording Controls
Method | How |
Stop signal |
|
VAD silence | Configurable: quick (0.5s), standard (1.5s), thoughtful (2.5s) |
Timeout | 30s default, configurable 5-3600s per call |
Push-to-talk |
|
STT Backends
Backend | Type | Latency | Setup |
whisper.cpp | Local (default) | ~200-400ms |
|
Wispr Flow | Cloud (fallback) | ~500ms + network | Set |
Auto-detected. Override with QA_VOICE_STT_BACKEND=whisper|wispr|auto.
Voice Bar (macOS)
Floating SwiftUI widget providing visual feedback during voice interactions. Connects to the daemon via NDJSON over /tmp/voicelayer.sock.
Teleprompter with word-level highlighting and auto-scroll
Waveform visualization during recording
Expandable pill UI — collapses to dot after 5s idle
Draggable, position persisted across launches
Global hotkey: F5 (hold for push-to-talk)
bun add -g voicelayer-mcp
voicelayer hotkey install # Install F5/Dictation -> F18 relay
voicelayer bar # Build and launch Voice BarHotkey Notes:
Requires Input Monitoring permission (System Settings > Privacy & Security)
On keyboards where the physical key is Apple's Dictation key,
voicelayer hotkey installinstalls ahidutilLaunchAgent to map F5/Dictation to VoiceBar's internal F18 relay.The installer preserves non-VoiceBar
hidutilmappings and is safe to rerun.Shift+F5re-pastes the latest transcript.
Advanced: Voice Cloning
Three-tier TTS engine cascade for cloned voices:
XTTS-v2 fine-tuned (cadence + timbre)
F5-TTS MLX zero-shot (local, no daemon)
Qwen3-TTS daemon (HTTP-based)
edge-tts fallback (always available)
voicelayer extract <youtube-url> # Extract voice samples
voicelayer clone <name> # Build voice profile
voicelayer daemon --port 8880 # Run Qwen3-TTS serverThe Qwen3 daemon now uses bearer auth from ~/.voicelayer/daemon.secret
(created on first launch with mode 0600). The TypeScript bridge reads the
same file automatically. Override the location with
VOICELAYER_TTS_DAEMON_SECRET_FILE,
VOICELAYER_TTS_AUTH_TOKEN_FILE, or
voicelayer daemon --daemon-secret-file ... if you need a custom launcher
path. The daemon only accepts Host: 127.0.0.1:8880 /
Host: localhost:8880, rejects non-local Origin headers, and only reads
reference_wav files that resolve under ~/.voicelayer/voices/.
Environment Variables
Variable | Default | Description |
|
| STT backend: |
| auto-detected | Path to whisper.cpp GGML model |
| -- | Wispr Flow API key (cloud fallback) |
|
| edge-tts voice ID |
|
| Base speech rate |
|
| Preferred override for the shared Qwen3 daemon bearer secret file |
|
| Backward-compatible override for the shared Qwen3 daemon bearer secret file |
Testing
bun test # 585 Bun tests + 1 skip (latest verified on PR #190 pre-push gate)
bash flow-bar/run_tests.sh # 144 Swift tests for VoiceBar
git config core.hooksPath .githooks # install repo pre-push hook once per clone (#181, #182)Test coverage includes: MCP protocol framing, tool handlers, TTS synthesis + retry, VAD speech detection, session booking, process lock lifecycle, socket client reconnection, edge-tts health checks, schema validation, Hebrew STT eval baselines, daemon resilience, ToolAnnotations, SSML sanitization, and secure path hardening.
Recent Hardening (2026-04-27 → 2026-05-02)
One-week sprint focused on VoiceBar reliability and a recording corpus to fight STT regressions. Every line below traces to a merged PR.
Recording reliability
Recording control clickability restored — F6 socket controls remained interactive while the pill animated (#188).
Pill bottom anchor preserved during resize so the UI doesn't drift off-screen (#187).
Waveform animates again on real audio input + redundant "listening" copy removed (#184).
Waveform dynamic range restored above the silence gate (#185).
Custom VoiceBar install paths supported (no more hard-coded
/Applications/VoiceBar.app) (#186).VoiceBar transcription preserved through the recording RMS gate so quiet speech survives (#177).
Stale daemon restart detection — VoiceBar transcription resumes automatically after the daemon restarts (#183).
STT quality
No-input STT hallucinations suppressed (#189).
Zero-RMS audio ingestion watchdog catches a silent mic before whisper.cpp guesses (#178).
VoiceBar dictation corpus (Phase 1) — #190
Every successful VoiceBar dictation is archived under
~/.local/share/voicelayer/recordings/YYYY-MM-DD/<timestamp-id>/withaudio.wav+voicelayer-transcript.txt+metadata.json(schema v1, SHA-256 over WAV bytes).Atomic rename + fsync so partial writes never appear in the corpus.
Cancelled or empty transcriptions are skipped — only real dictations land on disk.
Re-paste hotkey moved to
Shift+F5; plainF5is now the default record-start/stop activation through VoiceBar's F18 relay.
Test infrastructure
VoiceLayer pre-push regression gate (#181) plus exit-0 fix on the success path (#182).
voicelayer run_tests.shorchestrator script unifies Bun + Swift + daemon-boot + Karabiner smoke runs (#180).VoiceBar audio fixtures for golden-path STT regressions (#179).
Project Structure
voicelayer/
├── src/ # TypeScript/Bun (18K lines, 69 files)
│ ├── mcp-server-daemon.ts # Singleton daemon entry point
│ ├── mcp-server.ts # Stdio MCP server (legacy)
│ ├── mcp-daemon.ts # Unix socket server (dual-protocol)
│ ├── mcp-framing.ts # Content-Length + NDJSON framing
│ ├── mcp-handler.ts # JSONRPC request router
│ ├── process-lock.ts # PID lockfile (orphan prevention)
│ ├── handlers.ts # Tool handler implementations
│ ├── tts.ts # Multi-engine TTS with playback queue
│ ├── tts-health.ts # edge-tts health check + retry
│ ├── input.ts # Mic recording + STT pipeline
│ ├── vad.ts # Silero VAD (ONNX inference)
│ ├── stt.ts # STT backend abstraction
│ ├── socket-client.ts # Voice Bar IPC (auto-reconnect)
│ ├── session-booking.ts # Lockfile mutex
│ ├── paths.ts # Centralized path constants
│ └── __tests__/ # 536 tests across 48 files
├── flow-bar/ # SwiftUI macOS app (1.9K lines, 9 files)
│ ├── Sources/VoiceBar/ # App source
│ └── Tests/ # Swift tests
├── scripts/
│ ├── migrate-to-daemon.sh # Batch .mcp.json migration
│ └── edge-tts-words.py # Word-level TTS with timestamps
├── launchd/ # macOS LaunchAgent auto-start
├── models/ # Silero VAD ONNX model
└── package.json # v2.0.0Platform Support
Platform | TTS | STT | Recording | Voice Bar |
macOS | edge-tts + afplay | whisper.cpp (CoreML) | sox | SwiftUI app |
Linux | edge-tts + mpv/ffplay | whisper.cpp | sox | -- |
Part of Golems
VoiceLayer is one of three open-source MCP servers in the Golems ecosystem:
Server | What it does | Tools |
Persistent memory for AI agents — knowledge graph + hybrid search | 12 | |
Voice I/O — local STT, neural TTS, F5 push-to-talk | 11 | |
Terminal orchestration — spawn panes, read screens, coordinate agents | 22 |
Pair with BrainLayer to remember voice conversations across sessions.
License
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/EtanHey/voicelayer'
If you have feedback or need assistance with the MCP directory API, please join our Discord server