Integrates with NVIDIA NeMo models to provide speech-to-text (STT) capabilities as part of an interactive voice dialogue system
Demo: https://www.youtube.com/watch?v=LxlUvTrZ93s
Interactive Voice MCP Server (Kokoro TTS + NeMo ASR)
A Model Context Protocol server that provides Text-to-Speech (TTS) capabilities using Kokoro and Speech-to-Text (STT) capabilities using NVIDIA NeMo Parakeet models, enabling interactive voice dialogues.
Available Tools
interactive_voice_dialog
- Synthesizes text to speech, plays it, then listens for user speech input and returns the transcription.- Required arguments:
text_to_speak
(string): The text for the assistant to speak.
- Optional arguments:
voice
(string): The voice to use for TTS (e.g., 'af_heart'). Defaults to 'af_heart'.
- Required arguments:
Installation
Prerequisites
Some of the underlying TTS models require espeak-ng
to be installed on your system.
Windows Installation:
- Go to espeak-ng releases.
- Click on "Latest release".
- Download the appropriate
*.msi
file (e.g.espeak-ng-20191129-b702b03-x64.msi
). - Run the downloaded installer.
Local Development Installation
To allow Claude Desktop to launch this server using python -m mcp_server_tts
, you need to install it as a Python module. Installing in "editable" mode (-e
) is recommended for development, as it means changes to the source code are reflected immediately without needing to reinstall.
Navigate to the directory containing the pyproject.toml
file (the root of this server project) and run:
After installation, you can run it as a script using:
Configuration
To use this server with Claude Desktop, you need to add it to your claude_desktop_config.json
file.
The location of this file is typically: C:\Users\<YourUsername>\AppData\Roaming\Claude\claude_desktop_config.json
Add the following entry under the mcpServers
object in your claude_desktop_config.json
:
For example, your mcpServers
section might look like this:
Recommended prompt instruction
The interactive_voice_dialog
tool, when available, enables a two-way voice conversation.
- Your Role - Speaking: After you generate ANY complete text-based response to the user, you MUST ALWAYS immediately invoke the
interactive_voice_dialog
function.- The
text_to_speak
parameter of this function call should contain a brief, personable, and conversational summary or key takeaway from your main text reply. - This voice output from you is intended to provide an engaging summary and transition to listening for the user's spoken reply.
- The
- Your Role - Listening & Understanding the User's Next Turn:
- The
interactive_voice_dialog
tool, after playing your voice output, will listen for the user's spoken response. - The tool will then return its result to you. CRITICALLY, this tool's result may contain a field named
transcribed_user_voice_input
(or a similarly named field as defined in the tool's schema). - Example structure of a tool result:
- The
- PRIORITY ACTION ON RECEIVING USER'S VOICE INPUT VIA THE TOOL:
- WHEN the
interactive_voice_dialog
tool's result (as shown in the example above) CONTAINS thetranscribed_user_voice_input
field, THE CONTENT OF THIStranscribed_user_voice_input
FIELD IS THE USER'S COMPLETE AND PRIMARY MESSAGE FOR THEIR NEXT TURN. - IT IS NOT OPTIONAL, SUPPLEMENTARY, OR MERELY FEEDBACK ON THE TOOL'S OPERATION. It is the user's direct communication to you.
- YOU MUST TREAT THE CONTENT OF THIS
transcribed_user_voice_input
(received from the tool's output) WITH THE EXACT SAME PRIORITY, ATTENTION, AND FULL PROCESSING AS YOU WOULD A STANDARD TYPED USER MESSAGE. - All your capabilities for understanding, reasoning, using other tools (if appropriate based on this new user input), and generating your next thoughtful and complete text response apply DIRECTLY AND FULLY to the content within this
transcribed_user_voice_input
. - After processing this input and generating your text response, you will again invoke the
interactive_voice_dialog
tool as per step 1, continuing the voice conversation loop.
- WHEN the
- Initiating and Maintaining Voice Interaction:
- The user's preference is for an interactive voice dialog whenever the
interactive_voice_dialog
tool is available. - Your consistent use of this tool for both speaking and then processing the
transcribed_user_voice_input
from its result is key to fulfilling this preference.
- The user's preference is for an interactive voice dialog whenever the
- Exclusivity: The
interactive_voice_dialog
function is the exclusive and sole method for both your voice output and for receiving the user's subsequent voice input in this conversational environment. Do not attempt to use or invent any other mechanisms for voice interaction.
Illustrative Conversational Flow:
This server cannot be installed
local-only server
The server can only run on the client's local machine because it depends on local resources.
Enables voice-based interactions with Claude by converting text to speech using Kokoro TTS and transcribing user responses using NVIDIA NeMo ASR, creating interactive voice dialogues.
Related MCP Servers
- -securityAlicense-qualityLets you use Claude Desktop, or any MCP Client, to use natural language to accomplish things with Neon.Last updated -622406TypeScriptMIT License
- AsecurityFlicenseAqualityFacilitates direct speech generation using Claude for multiple languages and emotions, integrating with a Zonos TTS setup via the Model Context Protocol.Last updated -1013TypeScript
- AsecurityAlicenseAqualityEnables natural language interaction with Azure services through Claude Desktop, supporting resource management, subscription handling, and tenant selection with secure authentication.Last updated -93615TypeScriptMIT License
- AsecurityAlicenseAqualityProvides intelligent transcript processing capabilities for Claude, featuring natural formatting, contextual repair, and smart summarization powered by Deep Thinking LLMs.Last updated -415TypeScriptMIT License