voice-mcp
Provides text-to-speech using DashScope Qwen-TTS Voice Cloning API.
Provides text-to-speech using ElevenLabs API with voice cloning and style support.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@voice-mcpsay 'Hello, welcome to the future of AI voices.'"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
voice-mcp
An MCP (Model Context Protocol) server for AI voice synthesis with an inline audio player. Give your AI assistant a custom cloned voice!
Fork Notice
This repository is a fork of garan0613/voice-mcp, released under the MIT License.
This fork lives at Yinglianchun/voice-mcp and keeps the original MCP speak(text) behavior while adding provider switching, ElevenLabs support, and a live visualizer panel.
Related MCP server: Kokoro TTS MCP Server
What Changed in This Fork
Added
TTS_PROVIDERswitching between DashScope/CosyVoice and ElevenLabs.Kept the old
speak(text)call compatible, and extended it tospeak(text, style?, raw_tags?).Added ElevenLabs TTS support with configurable model, output format, voice settings, and optional v3 audio tags.
Added style-to-tag mapping for ElevenLabs v3, while stripping raw audio tags before DashScope/CosyVoice calls.
Added
/statusfields for provider, model, voice, configuration state, and audio tag availability.Added
/panel, a breathing audio visualizer that listens for the latest MCPspeakresult.Added
/events/latestso the panel can receive the newest generated voice and text.Added ElevenLabs history loading through
/history?id=....Added line-style captions, playback-linked caption timing when ElevenLabs timing data is available, and MP3 download from the panel.
Features
Custom Voice Cloning — Use DashScope Qwen-TTS Voice Cloning API or ElevenLabs TTS with your own cloned voice
Inline Audio Player — Beautiful WeChat-style player with waveform visualization
Breathing Visualizer Panel — Use
/panelto listen for the latest MCPspeakoutputTranscript Toggle — Show/hide the spoken text
Dark Mode Support — Automatic theme adaptation
Cloudflare Workers — Fast, serverless deployment
Demo
When you call the speak tool, you get:
A sleek audio player with play/pause button
Animated waveform that follows playback progress
Duration display
Expandable transcript
Quick Start
1. Clone the repository
git clone https://github.com/Yinglianchun/voice-mcp.git
cd voice-mcp2. Install dependencies
npm install3. Configure TTS provider
Set the provider. If omitted, the worker uses DashScope.
npx wrangler secret put TTS_PROVIDER # dashscope or elevenlabsDashScope / CosyVoice
You'll need an Alibaba Cloud DashScope account with Qwen-TTS Voice Cloning access.
Add your secrets to Cloudflare:
npx wrangler secret put DASHSCOPE_API_KEY
npx wrangler secret put VOICE_ID
npx wrangler secret put BOT_NAME # Optional, defaults to "AI"Optional:
npx wrangler secret put TTS_MODEL # Default: qwen3-tts-vc-2026-01-22ElevenLabs
Add your ElevenLabs secrets to Cloudflare:
npx wrangler secret put ELEVENLABS_API_KEY
npx wrangler secret put ELEVENLABS_VOICE_ID
npx wrangler secret put ELEVENLABS_VOICE_ID_ZH
npx wrangler secret put ELEVENLABS_VOICE_ID_ENOptional:
npx wrangler secret put ELEVENLABS_MODEL_ID # Default: eleven_v3
npx wrangler secret put ELEVENLABS_OUTPUT_FORMAT # Default: mp3_44100_128
npx wrangler secret put ELEVENLABS_LANGUAGE_CODE # Example: zh
npx wrangler secret put ELEVENLABS_LANGUAGE_CODE_ZH # Default with zh voice: zh
npx wrangler secret put ELEVENLABS_LANGUAGE_CODE_EN # Default with en voice: en
npx wrangler secret put ELEVENLABS_STABILITY # Example: 0.36
npx wrangler secret put ELEVENLABS_STYLE # Example: 0.85
npx wrangler secret put ELEVENLABS_SPEED # Example: 1.20eleven_v3 supports audio tags such as [whispers], [sighs], and [laughs].
eleven_multilingual_v2 is a steadier choice for ordinary reading.
4. Deploy
npx wrangler deploy5. Connect to Claude.ai
Go to Settings -> Connectors -> Add Connector
Enter your Worker URL:
https://your-worker.workers.dev/mcpDone! The
speaktool is now available.
Configuration
Variable | Required | Description |
| No |
|
| DashScope | Your DashScope API key |
| DashScope | The cloned voice ID (Qwen-TTS VC) |
| No | Display name (default: "AI") |
| No | DashScope TTS model (default: |
| ElevenLabs | Your ElevenLabs API key |
| ElevenLabs | Default/fallback ElevenLabs voice ID |
| No | Chinese ElevenLabs voice ID; auto-selected when text contains Chinese |
| No | English ElevenLabs voice ID; auto-selected for English text |
| No | ElevenLabs model (default: |
| No | ElevenLabs output format (default: |
| No | ElevenLabs request language code, such as |
| No | Chinese request language code; defaults to |
| No | English request language code; defaults to |
| No | ElevenLabs voice setting override, such as |
| No | ElevenLabs voice setting override |
| No | ElevenLabs voice setting override, such as |
| No | ElevenLabs voice setting override, |
| No | ElevenLabs voice setting override, such as |
API Endpoints
Endpoint | Description |
| MCP server (SSE protocol) |
| Breathing voice visualizer that listens for MCP |
| Latest generated voice event for the visualizer |
| Load an ElevenLabs history item into the visualizer |
| Direct audio file |
| Direct audio file with optional style |
| Preserve ElevenLabs v3 audio tags |
| Health check |
The MCP speak tool accepts:
speak(text: string, style?: string, raw_tags?: boolean)Existing speak(text) calls remain compatible.
When the MCP speak tool succeeds, the Worker stores the latest voice event for
/panel. Keep /panel open while using speak; when a new voice arrives, the
visualizer loads it and enables playback.
ElevenLabs uses the speech-with-timing API to store line-level caption cues for
sync; providers without timing data fall back to approximate caption progress.
When TTS_PROVIDER=elevenlabs and ELEVENLABS_MODEL_ID=eleven_v3, raw_tags=true
passes text through unchanged. Without raw_tags=true, supported styles map to
ElevenLabs v3 audio tags:
Style | Audio tag |
|
|
|
|
|
|
|
|
|
|
|
|
DashScope/CosyVoice and non-v3 ElevenLabs calls strip raw audio tags before sending text to the provider.
Tech Stack
Cloudflare Workers — Serverless runtime
MCP SDK — Model Context Protocol
DashScope Qwen-TTS VC — Voice synthesis
ElevenLabs Text to Speech — Voice synthesis
ext-apps — Inline UI rendering
License
MIT. This fork preserves the upstream license from garan0613/voice-mcp.
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
- Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)By Om-Shree-0709 on .Agentic AiPrompt InjectionWebAssembly
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/ALLY596/not-mcp-video'
If you have feedback or need assistance with the MCP directory API, please join our Discord server