Kokoro MCP Server
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Kokoro MCP ServerGenerate voiceover for this section using am_michael voice"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Kokoro MCP SERVER: Text To Speech (TTS)
A comprehensive Text-to-Speech toolkit built on Kokoro-82M with audio enhancement, Model Context Protocol (MCP) server integration, CLI interface, and Docker deployment.
📺 Demo Video

Features
Kokoro-82M TTS Engine: Open-weight model with 82M parameters (510 tokens per pass)
🌐 Streamlit Web UI: Enterprise-grade management interface with real-time preview (OPTIONAL)
Audio Enhancement: Professional processing with librosa (normalization, noise reduction, fade in/out)
MCP Server: Model Context Protocol integration for Claude Desktop, Cursor, and other AI tools (OPTIONAL)
CLI Interface: Command-line tools for quick generation (OPTIONAL)
Batch Processing: Generate multiple audio files efficiently
Script Processing: Convert complete video scripts with automatic text chunking
Docker Support: Containerization with docker-compose
Enterprise Features: Structured logging, configuration management, comprehensive testing
CI/CD: GitHub Actions pipeline with automated testing
Streamlit Web Interface (Optional)
Beautiful web UI for managing all TTS functionality:
🎯 Single Generation - Convert text with real-time preview
📦 Batch Processing - Process multiple texts in one go
📄 Script Processing - Complete video script conversion
🔍 Voice Explorer - Compare all 12 voices side-by-side
⚙️ Configuration - Manage settings visually
📊 Analytics - Track generations with charts and statistics
Install with: pip install -e ".[streamlit]" or pip install -e ".[complete]"
Quick Start:
python run_streamlit.py
# Opens at http://localhost:8501📚 See STREAMLIT_README.md for complete Streamlit documentation.
Building on Kokoro-82M
What Kokoro-82M Provides Out-of-the-Box: Kokoro-82M is an exceptional open-weight TTS model that delivers: core neural TTS inference with 82M parameters, a basic Python inference library (KPipeline), 10 professional voice packs (male/female, American/British), phonemization (G2P) system, and raw 24kHz audio output with a 510-token processing limit per pass.
What aparsoft-tts Adds: We integrate Kokoro-82M's excellent TTS inference with comprehensive development tooling and workflow enhancements. This toolkit adds:
Audio post-processing - Normalization, noise reduction, silence trimming, and fade in/out using librosa
Automated script workflows - Direct script-to-voiceover conversion with paragraph detection and gap management
IDE-native generation - MCP server integration eliminates context switching for Claude Desktop and Cursor users
Deployment infrastructure - Docker deployment, structured logging, configuration management, and comprehensive testing
Batch processing - CLI and Python APIs for processing multiple segments efficiently
Technical Implementation
Audio Enhancement (librosa Integration):
This toolkit adds an audio processing pipeline on Kokoro generated TTS output:
# Without enhancement - raw Kokoro output
audio = kokoro_pipeline(text)
# With enhancement
audio = enhance_audio(
kokoro_output,
normalize=True, # Consistent volume
trim_silence=True, # Remove dead air
noise_reduction=True, # Spectral gating
add_fade=True # Smooth transitions
)Result: Voiceovers ready for YouTube, podcasts, or content creation without additional audio editing.
MCP Server Integration:
Traditional workflow:
# 1. Write script in Claude/Cursor
# 2. Copy text to terminal
# 3. Run Python script
# 4. Switch back to editor
# 5. Repeat for each segmentWith MCP server:
# In Claude Desktop or Cursor:
"Generate voiceover for this section using am_michael voice"
# Done. Audio generated without leaving your workspace.Workflow Enhancement:
Content creators: Write scripts in AI editors, generate voiceovers inline
Developers: Generate test audio during development without context switching
Teams: Standardized TTS across tools (Claude, Cursor, CLI, API)
Automation: AI agents can generate audio as part of content pipelines
Deployment Features:
The toolkit wraps Kokoro with common deployment and development needs:
Configuration management - Environment-based settings, no hardcoded values
Structured logging - JSON logs for aggregation, correlation IDs for tracing
Error handling - Custom exceptions, graceful failures, detailed error context
Testing - Comprehensive test suite, CI/CD integration
Docker deployment - Containerized with health checks, resource limits
CLI interface - Quick access without writing code
Use Cases
YouTube/Podcast Production:
# Process entire video script with proper gaps
engine.process_script("script.txt", "voiceover.wav", gap_duration=0.5)AI-Assisted Content Creation:
# In Claude Desktop with MCP:
User: "Generate a 30-second intro for my coding tutorial"
Claude: *generates script and voiceover via MCP*Batch Content Generation:
# Generate 100 audio segments for e-learning course
engine.batch_generate(lesson_texts, output_dir="lessons/")Development/Testing:
# Quick CLI test during development
aparsoft-tts generate "Test message" -o test.wavQuick Start
Installation
System Dependencies (Required):
# Ubuntu/Debian
sudo apt-get install espeak-ng ffmpeg libsndfile1
# macOS
brew install espeak ffmpeg
# Windows: Download from
# - espeak-ng: http://espeak.sourceforge.net/
# - ffmpeg: https://ffmpeg.org/download.htmlPython Package - Choose Your Installation:
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# OPTION 1: Complete installation (RECOMMENDED)
# Includes: TTS Engine + MCP Server + CLI + Streamlit Web UI
pip install -e ".[complete]"
# OPTION 2: Without Streamlit (Developers)
# Includes: TTS Engine + MCP Server + CLI (no web UI)
pip install -e ".[mcp,cli]"
# OPTION 3: Streamlit Only
# Includes: TTS Engine + Streamlit Web UI (no MCP, no CLI)
pip install -e ".[streamlit]"
# OPTION 4: Core Only (Minimal)
# Includes: TTS Engine only (Python API)
pip install -e .
# OPTION 5: Everything (Contributors)
# Includes: All features + development tools
pip install -e ".[all]"📚 See INSTALLATION.md for detailed installation options and troubleshooting.
Quick Launch
Streamlit Web UI:
# Cross-platform launcher
python run_streamlit.py
# Or use platform-specific scripts
./run_streamlit.sh # Linux/macOS
run_streamlit.bat # Windows
# Or direct
streamlit run streamlit_app.pyMCP Server (for Claude Desktop/Cursor):
See MCP Integration section below
Basic Usage
from aparsoft_tts import TTSEngine
# Initialize engine
engine = TTSEngine()
# Generate speech
engine.generate(
text="Welcome to Kokoro YouTube TTS",
output_path="output.wav"
)CLI Usage
# Generate audio
aparsoft-tts generate "Hello world" -o output.wav
# List available voices
aparsoft-tts voices
# Process video script
aparsoft-tts script video_script.txt -o voiceover.wav
# Batch generate
aparsoft-tts batch "Intro" "Body" "Outro" -d segments/Available Voices
Male Voices:
am_adam- American male (natural inflection)am_michael- American male (deeper tones, professional)bm_george- British male (classic accent)bm_lewis- British male (modern accent)
Female Voices:
af_bella- American female (warm tones)af_nicole- American female (dynamic range)af_sarah- American female (clear articulation)af_sky- American female (youthful energy)bf_emma- British female (professional)bf_isabella- British female (soft tones)
Special Voices:
af- Default mix (50-50 blend of Bella and Sarah)
Advanced Usage
Custom Configuration
from aparsoft_tts import TTSEngine, TTSConfig
# Create custom configuration
config = TTSConfig(
voice="bm_george",
speed=1.2,
enhance_audio=True,
fade_duration=0.2
)
engine = TTSEngine(config=config)
engine.generate("Custom configuration", "output.wav")Audio Enhancement
from aparsoft_tts.utils.audio import enhance_audio
# Generate raw audio
audio = engine.generate("Test audio")
# Apply custom enhancement
enhanced = enhance_audio(
audio,
sample_rate=24000,
normalize=True,
trim_silence=True,
trim_db=25.0,
noise_reduction=True,
add_fade=True,
fade_duration=0.15
)Batch Processing
# Process multiple texts
texts = [
"Welcome to the tutorial",
"Let's explore the features",
"Thanks for watching"
]
paths = engine.batch_generate(
texts=texts,
output_dir="segments/",
voice="am_michael"
)Script Processing
# Process complete video script with automatic text chunking
engine.process_script(
script_path="video_script.txt",
output_path="complete_voiceover.wav",
gap_duration=0.5, # Gap between paragraphs
voice="am_michael",
speed=1.0
)
# Note: Kokoro processes up to 510 tokens per pass.
# Long scripts are automatically chunked and combined seamlessly.Podcast Generation (Multi-Voice)
Create podcast-style content with different voices and speeds per segment. Perfect for interviews, dialogues, or multi-speaker content.
Via MCP (Claude Desktop/Cursor):
"Create a podcast with these segments:
- Intro by am_michael: 'Welcome to Tech Talk'
- Guest by af_bella at 0.95 speed: 'Thanks for having me'
- Outro by am_michael: 'See you next week'"Via Python API:
from aparsoft_tts.utils.audio import combine_audio_segments, save_audio
# Define podcast segments with different voices/speeds
segments = [
{"text": "Welcome to the show", "voice": "am_michael", "speed": 1.0},
{"text": "Great to be here", "voice": "af_bella", "speed": 0.95},
{"text": "Thanks for listening", "voice": "am_michael", "speed": 1.0},
]
# Generate each segment
audio_segments = []
for seg in segments:
audio = engine.generate(
text=seg["text"],
voice=seg["voice"],
speed=seg["speed"]
)
audio_segments.append(audio)
# Combine with gaps
combined = combine_audio_segments(
audio_segments,
sample_rate=24000,
gap_duration=0.6 # Pause between segments
)
# Save
save_audio(combined, "podcast.wav", sample_rate=24000)Via Streamlit UI:
Open Streamlit:
python run_streamlit.pyNavigate to "🎙️ Podcast Generation" tab
Click "➕ Add Segment" for each speaker
Configure voice, speed, and text per segment
Adjust gap duration in settings panel
Click "🎧 Generate Podcast"
Features:
Per-segment voice control (host/guest conversations)
Individual speed settings (emphasis/pacing)
Configurable gaps between segments
Audio enhancement (normalization, crossfades)
Segment reordering (move up/down)
Template support for quick start
Streaming Generation
# Generate audio in chunks
for chunk in engine.generate_stream(
text="Long text for streaming...",
voice="am_michael"
):
# Process chunk as it's generated
process_audio_chunk(chunk)Model Context Protocol (MCP) Integration
Quick MCP Setup (5 Minutes)
What is MCP? Model Context Protocol lets Claude Desktop and Cursor generate speech directly from your conversations. No copy-pasting, no context switching.
For Developers: Quick Start
# 1. Find your Python path
which python # Linux/Mac
where python # Windows
# Example output: /home/ram/projects/youtube-creator/venv/bin/pythonClaude Desktop:
# 1. Open config (creates if doesn't exist)
code ~/Library/Application\ Support/Claude/claude_desktop_config.json # macOS
code ~/.config/Claude/claude_desktop_config.json # Linux
notepad %APPDATA%\Claude\claude_desktop_config.json # Windows
# 2. Add this (use YOUR absolute Python path):
{
"mcpServers": {
"aparsoft-tts": {
"command": "/absolute/path/to/your/venv/bin/python",
"args": ["-m", "aparsoft_tts.mcp_server"]
}
}
}
# 3. Restart Claude (Cmd/Ctrl + R)Cursor:
# 1. Create/edit config
mkdir -p ~/.cursor && code ~/.cursor/mcp.json
# 2. Add this (use YOUR absolute Python path):
{
"mcpServers": {
"aparsoft-tts": {
"command": "/absolute/path/to/your/venv/bin/python",
"args": ["-m", "aparsoft_tts.mcp_server"]
}
}
}
# 3. Restart Cursor completelyTesting MCP Server
# Quick test - should print server info
python -m aparsoft_tts.mcp_server --help
# Interactive testing with MCP Inspector
npx @modelcontextprotocol/inspector \
--command "/path/to/venv/bin/python" \
--args "-m" "aparsoft_tts.mcp_server"
# Opens UI at http://localhost:6274Usage Examples
In Claude Desktop or Cursor, just ask naturally:
# Basic generation
"Generate speech for 'Welcome to my channel' using am_michael voice"
# Voice discovery
"List all available TTS voices"
# Batch processing
"Create voiceovers for these three segments: 'Intro', 'Main', 'Outro'"
# Script processing
"Process video_script.txt and create a complete voiceover"
# Custom parameters
"Generate 'Test message' at 1.3x speed with British accent"MCP Tools Available
generate_speech: Single audio generation with full control
Text input (up to 10,000 characters)
Voice selection (6 voices)
Speed control (0.5x - 2.0x)
Audio enhancement toggle
list_voices: Get voice catalog with descriptions
batch_generate: Process multiple texts efficiently
process_script: Complete video script conversion
Automatic paragraph detection
Configurable gap duration
Handles long texts via automatic chunking
Troubleshooting MCP
"Could not attach to MCP server"
Use absolute path:
/full/path/to/venv/bin/pythonTest server runs:
python -m aparsoft_tts.mcp_serverCheck Python version:
python --version(needs 3.10+)
"Tool not found"
# Reinstall MCP dependencies
pip install -e ".[mcp]"
# Verify FastMCP
python -c "from fastmcp import FastMCP; print('✅ OK')"Detailed Documentation: See TUTORIAL.md for comprehensive MCP guide with advanced features, debugging, and production deployment.
Docker Deployment
Build and Run
# Build image
docker build -t aparsoft-tts:latest .
# Run MCP server
docker run -d \
--name aparsoft-tts \
-v $(pwd)/outputs:/app/outputs \
-v $(pwd)/logs:/app/logs \
aparsoft-tts:latest
# Run CLI commands
docker run --rm \
-v $(pwd)/outputs:/app/outputs \
aparsoft-tts:latest \
aparsoft-tts generate "Docker test" -o /app/outputs/test.wavDocker Compose
# Start services
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose downEnvironment Variables
# TTS Configuration
TTS_VOICE=am_michael
TTS_SPEED=1.0
TTS_ENHANCE_AUDIO=true
# MCP Server
MCP_SERVER_NAME=aparsoft-tts-server
MCP_ENABLE_RATE_LIMITING=true
# Logging
LOG_LEVEL=INFO
LOG_FORMAT=jsonProject Structure
youtube-creator/
├── aparsoft_tts/
│ ├── core/
│ │ └── engine.py # TTS engine
│ ├── utils/
│ │ ├── audio.py # Audio processing with librosa
│ │ ├── logging.py # Structured logging
│ │ └── exceptions.py # Custom exceptions
│ ├── config.py # Configuration management
│ ├── cli.py # CLI interface
│ └── mcp_server.py # MCP server (FastMCP)
├── tests/
│ ├── unit/ # Unit tests
│ └── integration/ # Integration tests
├── examples/ # Usage examples
├── pyproject.toml # Project metadata
├── Dockerfile # Docker configuration
└── docker-compose.yml # Docker Compose configAudio Processing
The toolkit enhances Kokoro's output with professional audio processing:
Features:
Normalization: Consistent volume levels
Silence Trimming: Remove quiet sections (configurable threshold)
Noise Reduction: Spectral gating for cleaner audio
Fade In/Out: Smooth transitions, prevents clicks
Custom Processing: Extensible with librosa/scipy
Enhancement Pipeline:
from aparsoft_tts.utils.audio import enhance_audio, save_audio
# Generate raw audio
audio = engine.generate("Your text here")
# Apply enhancement pipeline
enhanced = enhance_audio(
audio,
sample_rate=24000,
normalize=True, # Normalize volume
trim_silence=True, # Trim silence
trim_db=20.0, # Threshold in dB
noise_reduction=True, # Apply noise gate
add_fade=True, # Add fade in/out
fade_duration=0.1 # 100ms fade
)
# Save enhanced audio
save_audio(enhanced, "enhanced.wav", sample_rate=24000)Configuration
Using Configuration Files
from aparsoft_tts import TTSConfig, MCPConfig, LoggingConfig, Config
# TTS settings
tts_config = TTSConfig(
voice="am_michael",
speed=1.0,
enhance_audio=True,
sample_rate=24000,
output_format="wav"
)
# MCP server settings
mcp_config = MCPConfig(
server_name="aparsoft-tts-production",
enable_rate_limiting=True,
rate_limit_calls=100
)
# Logging settings
logging_config = LoggingConfig(
level="INFO",
format="json",
output="file"
)
# Combined configuration
config = Config(
tts=tts_config,
mcp=mcp_config,
logging=logging_config
)Environment Variables
Create .env file:
# TTS Settings
TTS_VOICE=am_michael
TTS_SPEED=1.0
TTS_ENHANCE_AUDIO=true
TTS_SAMPLE_RATE=24000
# Audio Processing
TTS_TRIM_SILENCE=true
TTS_TRIM_DB=20.0
TTS_FADE_DURATION=0.1
# Logging
LOG_LEVEL=INFO
LOG_FORMAT=console
LOG_OUTPUT=stdoutTesting
# Run all tests
pytest
# Run with coverage
pytest --cov=aparsoft_tts --cov-report=html
# Run specific test file
pytest tests/unit/test_engine.py
# Run only fast tests
pytest -m "not slow"Development
Setup Development Environment
# Clone repository
git clone https://github.com/aparsoft/kokoro-youtube-tts.git
cd kokoro-youtube-tts
# Install with dev dependencies
pip install -e ".[dev,mcp,cli,all]"
# Install pre-commit hooks
pre-commit installRunning CI Locally
The project includes GitHub Actions workflow for CI/CD:
Code quality checks (Black, Ruff, mypy)
Tests on multiple Python versions (3.10, 3.11, 3.12)
Docker build verification
Security scanning with Trivy
API Reference
TTSEngine
Initialization:
TTSEngine(config: TTSConfig | None = None)Methods:
generate(text, output_path, voice, speed, enhance)- Generate speechgenerate_stream(text, voice, speed)- Stream audio chunksbatch_generate(texts, output_dir, voice, speed)- Batch processingprocess_script(script_path, output_path, gap_duration, voice, speed)- Process scriptslist_voices()- Get available voices
Configuration Classes
TTSConfig- TTS engine settingsMCPConfig- MCP server configurationLoggingConfig- Logging configurationConfig- Main application configuration
Audio Utilities
enhance_audio(audio, ...)- Apply audio enhancementcombine_audio_segments(segments, ...)- Combine audio filessave_audio(audio, path, ...)- Save audio to fileload_audio(path, ...)- Load audio from filechunk_audio(audio, ...)- Split audio into chunksget_audio_duration(audio, ...)- Get audio duration
Examples
See the examples/ directory for complete examples:
basic_usage.py- Simple generation examplesyoutube_workflow.py- Complete YouTube video production workflow
Troubleshooting
espeak-ng not found
# Ubuntu/Debian
sudo apt-get install espeak-ng
# macOS
brew install espeak
# Windows: Download from http://espeak.sourceforge.net/Audio quality issues
Enable audio enhancement:
engine.generate(text="Your text", enhance=True)Import errors
Ensure virtual environment is activated:
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # WindowsDocker issues
Check container logs:
docker logs aparsoft-ttsPerformance
Benchmarks (on typical consumer hardware):
Model Loading: ~2-3 seconds (one-time)
Generation Speed: ~0.5s per second of audio
Memory Usage: ~2GB RAM (model loaded)
Token Processing: Up to 510 tokens per pass
Text Length Limits:
Kokoro-82M processes up to 510 tokens in a single pass. For longer texts:
Automatic chunking: Engine automatically splits long texts
Script processing: Handles unlimited length via intelligent segmentation
Batch processing: Each segment processed independently
Optimization Tips:
Reuse engine instances (avoid reloading model)
Disable enhancement for draft generations (
enhance=False)Use streaming for long texts (automatic chunking)
Batch process multiple files for efficiency
Enable GPU acceleration on supported platforms
For very long texts, use
process_script()for optimal chunking
Credits & Acknowledgements
This project builds upon excellent open-source software:
Core Dependencies
Kokoro-82M by hexgrad - Apache License 2.0
Open-weight TTS model with 82M parameters
Processes up to 510 tokens per pass
Architectured by @yl4579 (StyleTTS 2)
24kHz audio output, <100 hours training data
librosa - ISC License
Audio analysis and processing
FastMCP - MIT License
Model Context Protocol server framework
Additional Dependencies
soundfile - Audio I/O
pydantic - Configuration management
structlog - Structured logging
typer - CLI framework
pytest - Testing framework
Special Thanks
🛠️ @yl4579 for StyleTTS 2 architecture
🏆 hexgrad team for Kokoro model and inference library
🌐 Anthropic for Model Context Protocol
📊 All contributors to the open-source dependencies
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Third-Party Licenses:
Kokoro-82M: Apache License 2.0
librosa: ISC License
FastMCP: MIT License
Support
Email: contact@aparsoft.com
Website: https://aparsoft.com
Issues: GitHub Issues
Citation
If you use this toolkit in your research or project, please cite:
@software{kokoro_youtube_tts,
author = {Aparsoft},
title = {Kokoro YouTube TTS: Comprehensive TTS Toolkit},
year = {2025},
url = {https://github.com/aparsoft/kokoro-youtube-tts}
}For the Kokoro model:
@software{kokoro_tts,
author = {hexgrad},
title = {Kokoro-82M: Open-weight TTS Model},
year = {2024},
url = {https://huggingface.co/hexgrad/Kokoro-82M}
}Built with ❤️ for the video creator community
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/aparsoft/kokoro-mcp-server'
If you have feedback or need assistance with the MCP directory API, please join our Discord server