VoiceLayer
Allows extraction of audio from YouTube videos to collect voice samples for voice cloning.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@VoiceLayerTranscribe my voice input"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
VoiceLayer
Your AI agent can't hear you. VoiceLayer gives it ears and a voice.
Voice I/O for AI coding assistants. Press F5, speak to Claude Code, get on-device transcription in under 1.5 seconds. Your AI speaks back. Works with any MCP client.
You ──🎤──> whisper.cpp ──> Claude Code ──> edge-tts ──🔊──> You
STT (local) MCP tools TTS (free)Local-first. Free. Open-source. No cloud APIs, no API keys, no data leaves your machine. Part of the Golems ecosystem.
VoiceLayer runs as a persistent singleton daemon on a Unix socket — every Claude session connects through a lightweight socat shim instead of spawning its own process. 2 canonical MCP tools plus 9 backward-compatible aliases ship with full ToolAnnotations.
Architecture
┌─────────────────────────────────────┐
│ VoiceLayer Daemon │
│ /tmp/voicelayer-mcp.sock │
│ │
│ MCP JSONRPC ──> Tool Handlers │
│ (Content-Length ├── voice_speak │
│ framing) └── voice_ask │
│ │
│ TTS: edge-tts (retry + 30s timeout) │
│ STT: whisper.cpp / Wispr Flow │
│ VAD: Silero ONNX (speech detection) │
│ IPC: Voice Bar ← NDJSON events │
└──────────┬──────────────────────────┘
│ Unix socket
┌──────────────┼──────────────┐
│ │ │
Claude Code Claude Code Cursor/Codex
(socat shim) (socat shim) (socat shim)Why a daemon? The original design spawned a new Bun process per Claude session. With 17+ repos open, that meant 17 competing processes (700+ MB RAM), fighting over one Voice Bar socket, crashing edge-tts with PATH issues, and leaving orphans that never died. The daemon architecture — shipped in PRs #67-72 — replaced all of that with a single process and socat shims.
Metric | Before (spawn-per-session) | After (daemon) |
Processes | N per session (17+ typical) | 1 daemon + socat shims |
RAM | ~700 MB (17 x 41 MB) | ~50 MB |
Orphan cleanup | Manual | PID lockfile auto-kills stale |
edge-tts failures | Random (PATH, contention) | Retry with 30s hard timeout |
voice_ask hang | Up to 300s (5 min!) | 30s default + outer guard |
Related MCP server: speaches-mcp
Quick Start
# Install from npm
bun add -g voicelayer-mcp
# Prerequisites
brew install sox socat
pip3 install edge-tts
brew install whisper-cpp # optional — local STT
# Download a whisper model (recommended)
mkdir -p ~/.cache/whisper
curl -L -o ~/.cache/whisper/ggml-large-v3-turbo.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.binOr install from source:
git clone https://github.com/EtanHey/voicelayer.git
cd voicelayer && bun installStart VoiceBar
# Build + install /Applications/VoiceBar.app (also retires the old
# standalone daemon LaunchAgent via launchd/install.sh).
voicelayer build-app # or: bash flow-bar/build-app.sh
# Launch the installed app; the app owns the child daemon
voicelayer barvoicelayer build-app is the canonical builder — it compiles VoiceBar from
source, installs to /Applications/VoiceBar.app (override with
--install-path), refuses to overwrite a running VoiceBar, and runs
launchd/install.sh afterward. voicelayer bar then opens the installed
/Applications/VoiceBar.app (it no longer builds a bare dev binary); if the
app isn't installed it tells you to run voicelayer build-app first.
VoiceLayer's daily-driver supervision chain is launchd -> VoiceBar.app -> child MCP daemon. The standalone com.voicelayer.mcp-daemon LaunchAgent is retired
because a launchd-owned daemon cannot reliably inherit VoiceBar's microphone
permission. launchd/install.sh now removes that old LaunchAgent; it does not
install or bootstrap a daemon plist. VoiceBar launches the daemon child from the
installed bundle or checkout and restarts it on crashes, clean exits, and
broken-mic silence signals.
Disabling VoiceLayer
DISABLE_VOICELAYER=1 and /tmp/.voicelayer-daemon-disabled are hard
kill-switches for the MCP daemon child.
# Disable daemon child launch/restart
touch /tmp/.voicelayer-daemon-disabled
# Re-enable daemon child launch/restart
rm -f /tmp/.voicelayer-daemon-disabledWhen the flag exists, VoiceBar treats exit 0 as an explicit terminal stop. All other child exits reschedule a restart.
Updating VoiceLayer
voicelayer update is a cross-machine updater that runs on whichever Mac you
invoke it on (it auto-detects a git checkout vs a global package install). It
updates the package, rebuilds /Applications/VoiceBar.app, runs
launchd/install.sh, pulls the Qwen3-TTS model into ~/.voicelayer if missing,
and restarts the VoiceBar stack.
voicelayer update --dry-run # print the plan without running it
voicelayer update # package + app + model + restart
# Optionally sync personal runtime data (voices, vocabulary, daemon secret):
voicelayer update --data-mode direct --data-source other-mac.local:/Users/<you>
voicelayer update --data-mode brain-drive --data-source /Volumes/BrainDrive/VoiceLayerBackup/<you>Personal-data sync is opt-in (--data-mode skip is the default). When enabled,
it rsyncs ~/.voicelayer/voices, voices.json, pronunciation.yaml,
daemon.secret, and the STT vocabulary from the source host.
Configure MCP Clients
Add to your .mcp.json (in any repo where you use Claude Code):
{
"mcpServers": {
"voicelayer": {
"command": "socat",
"args": ["STDIO", "UNIX-CONNECT:/tmp/voicelayer-mcp.sock"]
}
}
}Or migrate all repos at once:
bash scripts/migrate-to-daemon.sh # migrates every .mcp.json under ~/Gits
bash scripts/migrate-to-daemon.sh --dry-run # preview without changesGrant microphone access to your terminal (macOS: System Settings > Privacy > Microphone).
Voice Tools
Primary tools
Tool | Behavior | Blocking | readOnly | destructive | idempotent |
| TTS with auto-mode (announce/brief/consult/think), replay, toggle | No | false | false | true |
| Speak question + record mic + transcribe response | Yes | false | false | false |
Backward-compatible aliases
Alias | Maps to | idempotent |
|
| true |
|
| true |
|
| true |
|
| true |
|
| false |
|
| true |
|
| true |
|
| false |
|
| false |
All 11 tools include MCP ToolAnnotations. No VoiceLayer tools are destructive. All have openWorldHint: false.
How voice_ask Works
Waits for any playing
voice_speakaudio to finishSpeaks the question via edge-tts (with retry on failure)
Records mic at device native rate, resamples to 16kHz
Silero VAD detects speech onset and silence end
whisper.cpp transcribes locally (~200-400ms on Apple Silicon)
Returns transcription to the AI agent
Reliability Features
PID lockfile (
/tmp/voicelayer-mcp.pid): On startup, detects and kills any orphan MCP server from a previous sessionedge-tts retry: Health check (cached 60s) + automatic retry with 30s hard timeout per attempt
Outer timeout guard:
Promise.racewrapper around the entire voice_ask flow — if anything hangs, returns an error instead of blocking foreverSession booking: Lockfile mutex prevents mic conflicts between concurrent sessions
Recording Controls
Method | How |
Stop signal |
|
VAD silence | Configurable: quick (0.5s), standard (1.5s), thoughtful (2.5s) |
Timeout | 30s default, configurable 5-3600s per call |
Push-to-talk |
|
STT Backends
Backend | Type | Latency | Setup |
whisper.cpp | Local (default) | ~200-400ms |
|
Wispr Flow | Cloud (fallback) | ~500ms + network | Set |
Auto-detected. Override with QA_VOICE_STT_BACKEND=whisper|wispr|auto.
Voice Bar (macOS)
Floating SwiftUI widget providing visual feedback during voice interactions. Connects to the daemon via NDJSON over /tmp/voicelayer.sock.
Teleprompter with word-level highlighting and auto-scroll
Waveform visualization during recording
Expandable pill UI — collapses to dot after 5s idle
Draggable, position persisted across launches
Global hotkey: F5 (hold for push-to-talk)
bun add -g voicelayer-mcp
voicelayer build-app # Build + install /Applications/VoiceBar.app
voicelayer hotkey install # Install Dictation-key -> F18 relay
voicelayer bar # Launch the installed VoiceBar.appHotkey Notes:
Requires Input Monitoring permission (System Settings > Privacy & Security)
On keyboards where the physical key is Apple's Dictation key,
voicelayer hotkey installinstalls ahidutilLaunchAgent to map the Dictation key to VoiceBar's internal F18 relay. Physical F5 is handled natively by VoiceBar.The installer preserves non-VoiceBar
hidutilmappings and is safe to rerun.Shift+F5re-pastes the latest transcript.voicelayer hotkey statusprints the LaunchAgent state and the currenthidutilkey mapping.
Settings
Open VoiceBar's Settings window for in-app configuration:
General — a permissions and hotkey setup panel (Microphone, Accessibility, Input Monitoring, and the hidutil relay) with per-row status plus quick "Open" and "Set up" actions.
Audio — microphone input device + a Performance effort picker (see below).
Dictionary — STT corrections and prompt terms (the same vocabulary managed by
voicelayer vocab).
Performance (STT effort tiers). The Audio tab exposes three decoding-effort tiers that trade latency for accuracy on the same large-v3-turbo model (they only change whisper.cpp's beam search/best-of, not the model):
Tier | whisper.cpp args | Notes |
Fast |
| Lowest decode cost |
Balanced |
| Middle ground |
Accurate |
| Default; widest beam |
The selection persists to ~/.local/state/voicelayer/whisper-performance.json. Override per-process with QA_VOICE_WHISPER_PERFORMANCE_EFFORT=fast|balanced|accurate.
Advanced: Voice Cloning
Three-tier TTS engine cascade for cloned voices:
XTTS-v2 fine-tuned (cadence + timbre)
F5-TTS MLX zero-shot (local, no daemon)
Qwen3-TTS daemon (HTTP-based)
edge-tts fallback (always available)
voicelayer extract <youtube-url> # Extract voice samples
voicelayer clone <name> # Build voice profile
voicelayer daemon --port 8880 # Run Qwen3-TTS serverThe Qwen3 daemon now uses bearer auth from ~/.voicelayer/daemon.secret
(created on first launch with mode 0600). The TypeScript bridge reads the
same file automatically. Override the location with
VOICELAYER_TTS_DAEMON_SECRET_FILE,
VOICELAYER_TTS_AUTH_TOKEN_FILE, or
voicelayer daemon --daemon-secret-file ... if you need a custom launcher
path. The daemon only accepts Host: 127.0.0.1:8880 /
Host: localhost:8880, rejects non-local Origin headers, and only reads
reference_wav files that resolve under ~/.voicelayer/voices/.
Environment Variables
Variable | Default | Description |
|
| STT backend: |
| auto-detected | Path to whisper.cpp GGML model |
|
| STT decode effort: |
| -- | Wispr Flow API key (cloud fallback) |
|
| edge-tts voice ID |
|
| Base speech rate |
|
| Preferred override for the shared Qwen3 daemon bearer secret file |
|
| Backward-compatible override for the shared Qwen3 daemon bearer secret file |
Testing
bun test # 585 Bun tests + 1 skip (latest verified on PR #190 pre-push gate)
bash flow-bar/run_tests.sh # 144 Swift tests for VoiceBar
git config core.hooksPath .githooks # install repo pre-push hook once per clone (#181, #182)Test coverage includes: MCP protocol framing, tool handlers, TTS synthesis + retry, VAD speech detection, session booking, process lock lifecycle, socket client reconnection, edge-tts health checks, schema validation, Hebrew STT eval baselines, daemon resilience, ToolAnnotations, SSML sanitization, and secure path hardening.
Recent Hardening (2026-04-27 → 2026-05-02)
One-week sprint focused on VoiceBar reliability and a recording corpus to fight STT regressions. Every line below traces to a merged PR.
Recording reliability
Recording control clickability restored — F6 socket controls remained interactive while the pill animated (#188).
Pill bottom anchor preserved during resize so the UI doesn't drift off-screen (#187).
Waveform animates again on real audio input + redundant "listening" copy removed (#184).
Waveform dynamic range restored above the silence gate (#185).
Custom VoiceBar install paths supported (no more hard-coded
/Applications/VoiceBar.app) (#186).VoiceBar transcription preserved through the recording RMS gate so quiet speech survives (#177).
Stale daemon restart detection — VoiceBar transcription resumes automatically after the daemon restarts (#183).
STT quality
No-input STT hallucinations suppressed (#189).
Zero-RMS audio ingestion watchdog catches a silent mic before whisper.cpp guesses (#178).
VoiceBar dictation corpus (Phase 1) — #190
Every successful VoiceBar dictation is archived under
~/.local/share/voicelayer/recordings/YYYY-MM-DD/<timestamp-id>/withaudio.wav+voicelayer-transcript.txt+metadata.json(schema v1, SHA-256 over WAV bytes).Atomic rename + fsync so partial writes never appear in the corpus.
Cancelled or empty transcriptions are skipped — only real dictations land on disk.
Re-paste hotkey moved to
Shift+F5; plainF5is now the default record-start/stop activation through VoiceBar's F18 relay.
Test infrastructure
VoiceLayer pre-push regression gate (#181) plus exit-0 fix on the success path (#182).
voicelayer run_tests.shorchestrator script unifies Bun + Swift + daemon-boot + Karabiner smoke runs (#180).VoiceBar audio fixtures for golden-path STT regressions (#179).
Project Structure
voicelayer/
├── src/ # TypeScript/Bun (18K lines, 69 files)
│ ├── mcp-server-daemon.ts # Singleton daemon entry point
│ ├── mcp-server.ts # Stdio MCP server (legacy)
│ ├── mcp-daemon.ts # Unix socket server (dual-protocol)
│ ├── mcp-framing.ts # Content-Length + NDJSON framing
│ ├── mcp-handler.ts # JSONRPC request router
│ ├── process-lock.ts # PID lockfile (orphan prevention)
│ ├── handlers.ts # Tool handler implementations
│ ├── tts.ts # Multi-engine TTS with playback queue
│ ├── tts-health.ts # edge-tts health check + retry
│ ├── input.ts # Mic recording + STT pipeline
│ ├── vad.ts # Silero VAD (ONNX inference)
│ ├── stt.ts # STT backend abstraction
│ ├── socket-client.ts # Voice Bar IPC (auto-reconnect)
│ ├── session-booking.ts # Lockfile mutex
│ ├── paths.ts # Centralized path constants
│ └── __tests__/ # 536 tests across 48 files
├── flow-bar/ # SwiftUI macOS app (1.9K lines, 9 files)
│ ├── Sources/VoiceBar/ # App source
│ └── Tests/ # Swift tests
├── scripts/
│ ├── migrate-to-daemon.sh # Batch .mcp.json migration
│ └── edge-tts-words.py # Word-level TTS with timestamps
├── launchd/ # VoiceBar LaunchAgent + retired daemon cleanup
├── models/ # Silero VAD ONNX model
└── package.json # v2.0.0Platform Support
Platform | TTS | STT | Recording | Voice Bar |
macOS | edge-tts + afplay | whisper.cpp (CoreML) | sox | SwiftUI app |
Linux | edge-tts + mpv/ffplay | whisper.cpp | sox | -- |
Part of Golems
VoiceLayer is one of three open-source MCP servers in the Golems ecosystem:
Server | What it does | Tools |
Persistent memory for AI agents — knowledge graph + hybrid search | 12 | |
Voice I/O — local STT, neural TTS, F5 push-to-talk | 11 | |
Terminal orchestration — spawn panes, read screens, coordinate agents | 22 |
Pair with BrainLayer to remember voice conversations across sessions.
License
This server cannot be installed
Maintenance
Latest Blog Posts
- Your AI Chatbot Just Exposed Your CEO's Salary to an InternBy Om-Shree-0709 on .Agent IdentityMCP SecurityOAuth Delegation
- Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)By Om-Shree-0709 on .Agentic AiPrompt InjectionWebAssembly
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/EtanHey/voicelayer'
If you have feedback or need assistance with the MCP directory API, please join our Discord server