supertone-mcp
OfficialThe supertone-mcp server provides a full-featured interface to the Supertone TTS API, enabling speech synthesis, voice management, and custom voice cloning from MCP-compatible clients.
Text-to-Speech Synthesis: Convert text into natural-sounding audio across 23+ languages, with support for speed (0.5x–2.0x), pitch shift (-24 to +24 semitones), emotion styles, and MP3/WAV output. Audio can be saved to disk, returned as MCP resources, or both. Long text is auto-chunked beyond the 300-character limit.
Duration & Cost Prediction: Estimate output audio duration and credit cost before synthesizing, using the same parameters as synthesis.
Voice Catalog Search: Browse and filter Supertone's preset voice catalog by language, gender, age, use case, style, model, name, or description.
Voice Details & Previews: Retrieve full metadata for a specific voice and access sample audio URLs, optionally filtered by language, style, and model.
Credit Balance Check: Monitor the remaining API credit balance for your Supertone API key.
Voice Cloning: Create a custom cloned voice from a local WAV or MP3 file (max 3MB), immediately usable for synthesis.
Custom Voice Management: List/filter cloned voices, update their name or description, or permanently delete them.
supertone-mcp
A composable MCP toolkit for the Supertone TTS API. Rather than a single "speak this text" command, it exposes Supertone's SDK as a set of building-block tools — synthesis, voice discovery, preview, duration/credit prediction, usage tracking, and full voice-cloning CRUD — that an LLM assembles to fulfill a request. Works in Claude Desktop, Cursor, or any MCP-compatible client.
Covers Korean, English, Japanese, and 31 languages total. Speed (0.5x–2.0x), pitch shift (-24 to +24 semitones), emotion styles, per-call output mode, streaming, and model selection.
Features
Synthesis
text_to_speech— Convert text to audio. Per-call control ofoutput_mode(files / resources / both),autoplay,streaming,model, plusinclude_phonemes/normalized_text. Long text is auto-chunked by the SDK.predict_duration— Estimate audio length (and credit cost) without synthesizing.
Voice discovery (preset)
search_voice— Filter the catalog by language, gender, age, use_case, style, model, name, or description.get_voice— Full detail for one voice.preview_voice— Sample audio URLs for a voice (filterable by language/style/model).
Custom voice cloning
clone_voice— Create a cloned voice from a local WAV/MP3 (≤3MB).search_custom_voice— List/filter cloned voices.get_custom_voice— Full detail for one cloned voice.edit_custom_voice— Update name and/or description.delete_custom_voice— Permanently delete (irreversible).
Usage & credits
get_credit_balance— Remaining credits.get_usage_history— Usage over a time window.get_voice_usage— Usage for a specific voice.
Breaking changes & migration (0.2.0)
0.2.0 moves behavior control out of environment variables and into per-call tool parameters — so the LLM decides per request, not the server config.
Before (env var) | After (per-call parameter) | Note |
|
| Default still |
|
| Default changed |
(always streamed) |
| New, default |
Other changes:
Default model changed
sona_speech_1→sona_speech_2_flash.list_voiceswas removed (since the discovery release) and replaced bysearch_voice— call it with no arguments to reproduce the old "list everything" behavior.No more hard 300-character limit — longer text is auto-chunked by the SDK (credit/latency scale with length).
If you previously set SUPERTONE_MCP_OUTPUT_MODE or SUPERTONE_MCP_AUTOPLAY, remove them from your client config and pass output_mode / autoplay per call instead. (The server prints a one-time stderr notice if it sees the removed vars.)
Installation
# Using uvx (recommended)
uvx supertone-mcp
# Using pip
pip install supertone-mcpConfiguration
Claude Desktop
Add to claude_desktop_config.json:
{
"mcpServers": {
"supertone-tts": {
"command": "uvx",
"args": ["supertone-mcp"],
"env": {
"SUPERTONE_API_KEY": "your-api-key-here"
}
}
}
}Cursor
Add to your Cursor MCP settings (same JSON shape as above).
Environment Variables
Only authentication and stable defaults are configured via the environment — all behavior is controlled per call.
Variable | Required | Default | Description |
| Yes | — | Your Supertone API key |
| No | preset voice (Aiden, multilingual) | Default |
| No |
| Directory where audio files are saved (used by |
Removed in 0.2.0:
SUPERTONE_MCP_OUTPUT_MODEandSUPERTONE_MCP_AUTOPLAY— see Migration.
Output modes (text_to_speech output_mode)
Mode | Returns | Use when |
| Plain text with the saved file path + metadata | You want the file on disk |
| MCP | The client renders audio inline (e.g., Claude.ai chat) |
| File on disk and | You want both — preview inline, keep the file |
Usage Examples
The MCP client routes natural-language requests across these tools — the value of the toolkit is composition: the LLM chains several tools to satisfy one request.
Example 1 — Discover → preview → estimate cost → synthesize
"Find a calm Korean female voice, let me hear a sample, check the cost, then make this announcement as an mp3."
The LLM assembles:
search_voice(language="ko", gender="female", style="neutral") # find candidates
→ preview_voice(voice_id) # sample URLs to confirm the voice
→ predict_duration(text, voice_id) + get_credit_balance() # gauge cost before spending
→ text_to_speech(text, voice_id, output_format="mp3",
output_mode="files") # synthesizeExample 2 — Clone my voice → use it right away
"Make a cloned voice from ~/recordings/sample.wav named MyVoice, then read this greeting with it and play it for me."
The LLM assembles:
clone_voice(name="MyVoice", audio_path="~/recordings/sample.wav") # create the cloned voice
→ get_custom_voice(voice_id) # confirm it was created
→ text_to_speech(text, voice_id=<cloned>, autoplay=true) # synthesize, then play immediately
autoplayis a per-call parameter (defaultfalse), so playback happens only when explicitly requested.
Tool Parameters
text_to_speech
Parameter | Type | Required | Default | Description |
| string | Yes | — | Text to convert (long text is auto-chunked by the SDK) |
| string | No | env or preset | Voice identifier (browse via |
| string | No |
| Language code — one of 31 ( |
| string | No |
|
|
| string | No |
|
|
| float | No |
| 0.5–2.0 |
| int | No |
| -24 to +24 semitones |
| string | No | — | Emotion style (varies by voice) |
| string | No |
|
|
| bool | No |
| Play the audio locally after synthesis (macOS |
| bool | No |
| Stream synthesis. Only supported by |
| bool | No |
| Return phoneme timing data alongside the audio |
| string | No | — | Pre-normalized text (only used by |
predict_duration
Same core parameter schema as text_to_speech (long text auto-chunked). Returns "Predicted duration: 2.34s (credit usage is proportional to duration).".
search_voice
All parameters optional. With no filters → full catalog. With any filter → first response line is Filters applied: ....
Parameter | Type | Description |
| string | e.g., |
| string | e.g., |
| string | e.g., |
| string | e.g., |
| string | e.g., |
| string | e.g., |
| string | partial match |
| string | partial match |
get_voice / preview_voice
Tool | Required | Optional |
|
| — |
|
|
|
clone_voice
Parameter | Type | Required | Description |
| string | Yes | Display name (non-empty) |
| string | Yes | Local WAV or MP3 path (≤3MB). Supports |
| string | No | Optional note |
Custom voice CRUD
Tool | Required | Optional |
| — |
|
|
| — |
|
|
|
|
| — (IRREVERSIBLE) |
Usage & credits
Tool | Required | Optional |
| — | — |
| — | — (reports a recent default window) |
|
| — |
Development
# Clone and install
git clone https://github.com/supertone-inc/supertone-mcp.git
cd supertone-mcp
uv sync
# Run tests
uv run pytest -q
# Run with coverage
uv run pytest --cov=src --cov-report=term-missingLicense
MIT
Maintenance
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/supertone-inc/supertone-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server