Speech AI - Pronunciation, STT & TTS - MCP Connector

Server Details

Pronunciation scoring, speech-to-text, and text-to-speech for language learning

Status: Healthy
Last Tested: 2026-03-15 13:30
Transport: Streamable HTTP
URL
Repository: fasuizu-br/speech-ai-examples
GitHub Stars: 0

See and control every tool call

Log every tool call with full inputs and outputs

Control which tools are enabled per connector

Manage credentials once, use from any MCP client

Monitor uptime and get alerted when servers go down

Available Tools

10 tools

assess_pronunciationInspect

Assess English pronunciation quality from audio.

Scores pronunciation at four levels: overall, sentence, word, and phoneme. Each score is 0-100. Phonemes are returned in both IPA and ARPAbet notation. Sub-300ms inference latency.

Args: audio_base64: Base64-encoded audio data. Supports WAV, MP3, OGG, and WebM formats. text: The reference English text that the speaker was expected to read aloud. audio_format: Audio format hint — one of 'wav', 'mp3', 'ogg', 'webm'. Defaults to 'wav'.

Returns: dict with keys: - overallScore (int 0-100): Overall pronunciation quality - sentenceScore (int 0-100): Sentence-level fluency and accuracy - words (list): Per-word scores, each containing: - word (str): The word - score (int 0-100): Word pronunciation score - phonemes (list): Per-phoneme scores with IPA/ARPAbet notation - decodedTranscript (str): What the model heard (ASR transcript) - transcript (str): Reference text - confidence (float 0-1): Scoring confidence - warnings (list[str]): Quality warnings if any - audioQuality (dict): Audio metrics (SNR, peak/RMS dB, etc.)

ParametersJSON Schema

Name	Required	Description	Default
`text`	Yes	The reference English text that the speaker was expected to read aloud.
`audio_base64`	Yes	Base64-encoded audio data. Supports WAV, MP3, OGG, and WebM formats.
`audio_format`	No	Audio format hint — one of 'wav', 'mp3', 'ogg', 'webm'.	wav

check_pronunciation_serviceInspect

Check if the pronunciation assessment service is healthy and ready.

Returns: dict with keys: - status (str): 'healthy' or error state - modelLoaded (bool): Whether the scoring model is loaded - version (str): API version

ParametersJSON Schema

Name	Required	Description	Default
No parameters

check_stt_serviceInspect

Check if the speech-to-text service is healthy and ready.

Returns: dict with keys: - status (str): 'healthy' or error state - modelLoaded (bool): Whether the STT model is loaded - version (str): API version

ParametersJSON Schema

Name	Required	Description	Default
No parameters

check_tts_serviceInspect

Check if the text-to-speech service is healthy and ready.

Returns: dict with keys: - status (str): 'healthy' or error state - modelLoaded (bool): Whether the TTS model is loaded - version (str): API version

ParametersJSON Schema

Name	Required	Description	Default
No parameters

check_whisper_serviceInspect

Check if the Whisper STT Pro service is healthy and ready.

Returns: dict with keys: - status (str): 'healthy' or error state - modelLoaded (bool): Whether the Whisper model is loaded - diarizeLoaded (bool): Whether the diarization pipeline is loaded - version (str): API version - modelName (str): Whisper model name (e.g. 'large-v3-turbo')

ParametersJSON Schema

Name	Required	Description	Default
No parameters

get_phoneme_inventoryInspect

Get the full phoneme inventory supported by the pronunciation scorer.

Returns a list of all English phonemes the engine can assess, including ARPAbet symbol, IPA equivalent, example word, and phoneme category (vowel, consonant, diphthong).

Returns: list of dicts, each with keys: - arpabet (str): ARPAbet symbol (e.g. 'AA', 'TH') - ipa (str): IPA notation - example (str): Example word containing the phoneme - category (str): vowel, consonant, or diphthong

ParametersJSON Schema

Name	Required	Description	Default
No parameters

list_tts_voicesInspect

List all available text-to-speech voices with metadata.

Returns: dict with keys: - voices (list): Available voices, each with id, name, gender, accent, grade - defaultVoice (str): Default voice ID

ParametersJSON Schema

Name	Required	Description	Default
No parameters

synthesize_speechInspect

Generate natural speech audio from English text.

Produces high-quality speech with 12 English voices. Returns base64-encoded WAV audio (16-bit PCM, 24kHz mono) along with metadata.

Available voices:

af_heart (default), af_bella, af_nicole, af_sarah, af_sky (American female)
am_adam, am_michael (American male)
bf_emma, bf_isabella (British female)
bm_george, bm_lewis, bm_daniel (British male)

Args: text: English text to synthesize (1-5000 characters). voice: Voice ID. See list above. Defaults to 'af_heart'. speed: Speed multiplier from 0.5 to 2.0 (default: 1.0).

Returns: dict with keys: - audio_base64 (str): Base64-encoded WAV audio (16-bit PCM, 24kHz) - duration_ms (str): Audio duration in milliseconds - voice (str): Voice ID used - text_length (str): Input text character count - processing_ms (str): Synthesis time in milliseconds

ParametersJSON Schema

Name	Required	Description
`text`	Yes	English text to convert to speech. Max 5000 characters.
`speed`	No	Speech speed multiplier (0.5 = half speed, 2.0 = double).
`voice`	No	Voice ID (e.g. 'af_heart', 'am_adam'). Uses default if omitted.

transcribe_audioInspect

Transcribe audio to text with word-level timestamps.

Converts spoken English audio into text with optional word-level timestamps and per-word confidence scores.

Args: audio_base64: Base64-encoded audio data (WAV, MP3, OGG, FLAC, WebM). audio_format: Audio format hint. Auto-detected from magic bytes if omitted. include_timestamps: Whether to include word-level timing (default: true).

Returns: dict with keys: - text (str): Full decoded transcript - words (list): Per-word results with timestamps, each containing: - word (str): The transcribed word - start (float): Start time in seconds - end (float): End time in seconds - confidence (float 0-1): Word-level confidence - audioDurationMs (int): Audio duration in milliseconds - metadata (dict): Processing time, audio length, model version - audioQuality (dict): Audio metrics (SNR, peak/RMS dB, etc.)

ParametersJSON Schema

Name	Required	Description
`audio_base64`	Yes	Base64-encoded audio data. Supports WAV, MP3, OGG, FLAC, and WebM formats.
`audio_format`	No	Audio format hint — 'wav', 'mp3', 'ogg', 'flac', 'webm'. Auto-detected if omitted.
`include_timestamps`	No	If true, include word-level start/end times and confidence.

transcribe_audio_proInspect

Transcribe audio with Whisper Large V3 Turbo — multilingual STT.

Supports 99 languages with automatic language detection, word-level timestamps, per-word confidence scores, and optional speaker diarization (identifies who spoke each word). Best-in-class WER (~2%).

Args: audio_base64: Base64-encoded audio (WAV, MP3, OGG, FLAC, WebM). language: Language code. Auto-detected if omitted. Supports 99 languages. diarize: Enable speaker diarization (default: false). When true, each word includes a speaker label (e.g. SPEAKER_00, SPEAKER_01).

Returns: dict with keys: - text (str): Full decoded transcript - words (list): Per-word results with timestamps, each containing: - word (str), start (float), end (float), confidence (float 0-1) - speaker (str|null): Speaker label when diarize=true - speakers (dict|null): Speaker info with count and labels - audioDurationMs (int): Audio duration in milliseconds - metadata (dict): Processing time, language, languageProbability - audioQuality (dict): Audio metrics (SNR, peak/RMS dB, etc.)

ParametersJSON Schema

Name	Required	Description
`diarize`	No	Enable speaker diarization to identify who spoke each word.
`language`	No	Language code (e.g. 'en', 'es', 'zh'). Auto-detected when omitted.
`audio_base64`	Yes	Base64-encoded audio data. Supports WAV, MP3, OGG, FLAC, and WebM formats.

To claim this server, publish a /.well-known/glama.json file on your server's domain with the following structure:

{
  "$schema": "https://glama.ai/mcp/schemas/connector.json",
  "maintainers": [
    {
      "email": "your-email@example.com"
    }
  ]
}

The email address must match the email associated with your Glama account. Once verified, the server will appear as claimed by you.

Speech AI - Pronunciation, STT & TTS

Server Details

Available Tools

Discussions

Your Connectors

Resources