Utilizes FFmpeg for audio conversion, supporting multiple output formats (WAV, MP3, FLAC, OGG) for the synthesized speech
Offers Node.js implementation for running the TTS server, with TypeScript support for type-safe interactions
Uses ONNX runtime for the Kokoro neural voice models, providing high-quality text-to-speech synthesis with multiple voices and emotional expressions
Provides Python API for interacting with the TTS engine, supporting various speech synthesis operations and batch processing
Advanced TTS MCP Server
A high-quality, feature-rich Text-to-Speech MCP server with native TypeScript implementation. Designed for professional applications requiring natural, expressive speech synthesis with advanced controls and zero external dependencies.
⨠Features
šÆ Advanced Voice Control
10 High-Quality Voices - Male and female voices with distinct personalities
Emotion Control - Neutral, happy, excited, calm, serious, casual, confident
Dynamic Pacing - Natural, conversational, presentation, tutorial, narrative modes
Speed & Volume - Precise control from 0.25x to 3.0x speed, 0.1x to 2.0x volume
š Professional Capabilities
Streaming Audio - Real-time synthesis and playback
Batch Processing - Handle multiple text segments efficiently
Multiple Formats - WAV, MP3, FLAC, OGG output support
Natural Speech Enhancement - Automatic pause insertion and emotion markers
Queue Management - Handle multiple concurrent requests
š§ MCP Integration
6 Powerful Tools - Complete synthesis, batch processing, voice management
2 Rich Resources - Voice capabilities and usage examples
Real-time Status - Track processing progress and manage requests
File Management - Save, list, and organize audio outputs
š Quick Start
Option 1: Deploy to Smithery.ai (Recommended)
šÆ One-Click Deployment to Smithery Platform
Deploy Now: Visit Smithery.ai and import this repository
Configure: Set your preferred voice and speech settings
Use Instantly: Access via Claude Desktop or any MCP-compatible client
Benefits:
ā Zero setup required
ā Automatic scaling and updates
ā No model downloads needed
ā Enterprise-grade hosting
š Full Smithery Deployment Guide ā
Option 2: Local Installation
Prerequisites:
Node.js 18+
Installation:
Clone the repository
Install dependencies
Configure Claude Desktop
Add to your claude_desktop_config.json:
Start using!
Restart Claude Desktop and start synthesizing with natural, expressive voices.
šļø Available Voices
Voice ID | Name | Gender | Description |
| Heart | Female | Warm, friendly voice (default) |
| Sky | Female | Clear, bright voice |
| Bella | Female | Elegant, sophisticated voice |
| Sarah | Female | Professional, confident voice |
| Nicole | Female | Gentle, soothing voice |
| Adam | Male | Strong, authoritative voice |
| Michael | Male | Friendly, approachable voice |
| Emma | Female | Young, energetic voice |
| Isabella | Female | Mature, expressive voice |
| Lewis | Male | Deep, resonant voice |
š Usage Examples
Basic Synthesis
Emotional Expression
Professional Presentation
Batch Processing
š ļø Available Tools
synthesize_speech
Convert text to natural speech with full control over voice characteristics.
Parameters:
text- Text to synthesize (max 10,000 chars)voice_id- Voice selection (see table above)speed- Speech rate (0.25-3.0)emotion- Voice emotion (neutral, happy, excited, calm, serious, casual, confident)pacing- Speech style (natural, conversational, presentation, tutorial, narrative, fast, slow)volume- Audio volume (0.1-2.0)output_format- File format (wav, mp3, flac, ogg)save_file- Save to file (boolean)filename- Custom filename
batch_synthesize
Process multiple text segments efficiently with optional merging.
Parameters:
segments- List of text segmentsmerge_output- Combine into single filesegment_pause- Pause between segments (0.0-5.0s)All synthesis parameters from above
get_voices
Retrieve complete voice information and capabilities.
get_status
Check processing status for synthesis requests.
cancel_request
Cancel active synthesis operations.
list_output_files
Browse saved audio files with metadata.
šļø Voice Controls
Emotions
Neutral - Standard, professional tone
Happy - Upbeat, cheerful expression
Excited - Enthusiastic, energetic delivery
Calm - Relaxed, soothing tone
Serious - Formal, authoritative delivery
Casual - Relaxed, conversational style
Confident - Assured, professional tone
Pacing Styles
Natural - Balanced, human-like rhythm
Conversational - Casual discussion pace
Presentation - Professional speaking rhythm
Tutorial - Educational, clear delivery
Narrative - Storytelling pace
Fast - Quick delivery (1.2x base speed)
Slow - Deliberate delivery (0.8x base speed)
šµ Audio Formats
Format | Quality | Use Case |
WAV | Uncompressed | Highest quality, editing |
MP3 | Compressed | Web, streaming, sharing |
FLAC | Lossless | Archival, high-quality storage |
OGG | Compressed | Open source alternative |
š§ Configuration
Environment Variables
Server Configuration
šļø Architecture
š¤ Contributing
Contributions welcome! Areas for improvement:
Additional voice models
Real-time streaming synthesis
Advanced audio effects
Multi-language support
Performance optimizations
š License
MIT License - see LICENSE for details.
š Acknowledgments
Kokoro TTS - High-quality neural voice synthesis
MCP Protocol - Seamless AI model integration
FastMCP - Efficient server framework
Developed by
Transform your text into natural, expressive speech with Advanced TTS MCP Server.
Related MCP Servers
- AsecurityAlicenseAqualityHelps refine AI-generated content to sound more natural and human-like. Built with advanced AI detection and text enhancement capabilities.Last updated -13237MIT License
- AsecurityAlicenseAqualityEnables text-to-speech functionality on macOS using the say command, offering extensive control over speech parameters like voice, rate, volume, and pitch for a customizable auditory experience.Last updated -2618MIT License
- -securityFlicense-qualityA server providing text-to-speech and speech-to-text functionalities using Windows' native speech services without external dependencies.Last updated -5
- -securityFlicense-qualityProvides text-to-speech capabilities through the Model Context Protocol, allowing applications to easily integrate speech synthesis with customizable voices, adjustable speech speed, and cross-platform audio playback support.Last updated -10