Skip to main content
Glama

Speech AI - Pronunciation, TTS & STT

Server Details

Pronunciation scoring, text-to-speech (12 voices), and speech-to-text with timestamps.

Status
Healthy
Last Tested
Transport
Streamable HTTP
URL
Repository
fasuizu-br/speech-ai-examples
GitHub Stars
0

See and control every tool call

Log every tool call with full inputs and outputs
Control which tools are enabled per connector
Manage credentials once, use from any MCP client
Monitor uptime and get alerted when servers go down

Available Tools

10 tools
assess_pronunciationTry in Inspector

Assess English pronunciation quality from audio.

Scores pronunciation at four levels: overall, sentence, word, and phoneme. Each score is 0-100. Phonemes are returned in both IPA and ARPAbet notation. Sub-300ms inference latency.

Args: audio_base64: Base64-encoded audio data. Supports WAV, MP3, OGG, and WebM formats. text: The reference English text that the speaker was expected to read aloud. audio_format: Audio format hint — one of 'wav', 'mp3', 'ogg', 'webm'. Defaults to 'wav'.

Returns: dict with keys: - overallScore (int 0-100): Overall pronunciation quality - sentenceScore (int 0-100): Sentence-level fluency and accuracy - words (list): Per-word scores, each containing: - word (str): The word - score (int 0-100): Word pronunciation score - phonemes (list): Per-phoneme scores with IPA/ARPAbet notation - decodedTranscript (str): What the model heard (ASR transcript) - transcript (str): Reference text - confidence (float 0-1): Scoring confidence - warnings (list[str]): Quality warnings if any - audioQuality (dict): Audio metrics (SNR, peak/RMS dB, etc.)

ParametersJSON Schema
NameRequiredDescriptionDefault
textYesThe reference English text that the speaker was expected to read aloud.
audio_base64YesBase64-encoded audio data. Supports WAV, MP3, OGG, and WebM formats.
audio_formatNoAudio format hint — one of 'wav', 'mp3', 'ogg', 'webm'.wav
check_pronunciation_serviceTry in Inspector

Check if the pronunciation assessment service is healthy and ready.

Returns: dict with keys: - status (str): 'healthy' or error state - modelLoaded (bool): Whether the scoring model is loaded - version (str): API version

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

check_stt_serviceTry in Inspector

Check if the speech-to-text service is healthy and ready.

Returns: dict with keys: - status (str): 'healthy' or error state - modelLoaded (bool): Whether the STT model is loaded - version (str): API version

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

check_tts_serviceTry in Inspector

Check if the text-to-speech service is healthy and ready.

Returns: dict with keys: - status (str): 'healthy' or error state - modelLoaded (bool): Whether the TTS model is loaded - version (str): API version

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

check_whisper_serviceTry in Inspector

Check if the Whisper STT Pro service is healthy and ready.

Returns: dict with keys: - status (str): 'healthy' or error state - modelLoaded (bool): Whether the Whisper model is loaded - diarizeLoaded (bool): Whether the diarization pipeline is loaded - version (str): API version - modelName (str): Whisper model name (e.g. 'large-v3-turbo')

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

get_phoneme_inventoryTry in Inspector

Get the full phoneme inventory supported by the pronunciation scorer.

Returns a list of all English phonemes the engine can assess, including ARPAbet symbol, IPA equivalent, example word, and phoneme category (vowel, consonant, diphthong).

Returns: list of dicts, each with keys: - arpabet (str): ARPAbet symbol (e.g. 'AA', 'TH') - ipa (str): IPA notation - example (str): Example word containing the phoneme - category (str): vowel, consonant, or diphthong

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

list_tts_voicesTry in Inspector

List all available text-to-speech voices with metadata.

Returns: dict with keys: - voices (list): Available voices, each with id, name, gender, accent, grade - defaultVoice (str): Default voice ID

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

synthesize_speechTry in Inspector

Generate natural speech audio from English text.

Produces high-quality speech with 12 English voices. Returns base64-encoded WAV audio (16-bit PCM, 24kHz mono) along with metadata.

Available voices:

  • af_heart (default), af_bella, af_nicole, af_sarah, af_sky (American female)

  • am_adam, am_michael (American male)

  • bf_emma, bf_isabella (British female)

  • bm_george, bm_lewis, bm_daniel (British male)

Args: text: English text to synthesize (1-5000 characters). voice: Voice ID. See list above. Defaults to 'af_heart'. speed: Speed multiplier from 0.5 to 2.0 (default: 1.0).

Returns: dict with keys: - audio_base64 (str): Base64-encoded WAV audio (16-bit PCM, 24kHz) - duration_ms (str): Audio duration in milliseconds - voice (str): Voice ID used - text_length (str): Input text character count - processing_ms (str): Synthesis time in milliseconds

ParametersJSON Schema
NameRequiredDescriptionDefault
textYesEnglish text to convert to speech. Max 5000 characters.
speedNoSpeech speed multiplier (0.5 = half speed, 2.0 = double).
voiceNoVoice ID (e.g. 'af_heart', 'am_adam'). Uses default if omitted.
transcribe_audioTry in Inspector

Transcribe audio to text with word-level timestamps.

Converts spoken English audio into text with optional word-level timestamps and per-word confidence scores.

Args: audio_base64: Base64-encoded audio data (WAV, MP3, OGG, FLAC, WebM). audio_format: Audio format hint. Auto-detected from magic bytes if omitted. include_timestamps: Whether to include word-level timing (default: true).

Returns: dict with keys: - text (str): Full decoded transcript - words (list): Per-word results with timestamps, each containing: - word (str): The transcribed word - start (float): Start time in seconds - end (float): End time in seconds - confidence (float 0-1): Word-level confidence - audioDurationMs (int): Audio duration in milliseconds - metadata (dict): Processing time, audio length, model version - audioQuality (dict): Audio metrics (SNR, peak/RMS dB, etc.)

ParametersJSON Schema
NameRequiredDescriptionDefault
audio_base64YesBase64-encoded audio data. Supports WAV, MP3, OGG, FLAC, and WebM formats.
audio_formatNoAudio format hint — 'wav', 'mp3', 'ogg', 'flac', 'webm'. Auto-detected if omitted.
include_timestampsNoIf true, include word-level start/end times and confidence.
transcribe_audio_proTry in Inspector

Transcribe audio with Whisper Large V3 Turbo — multilingual STT.

Supports 99 languages with automatic language detection, word-level timestamps, per-word confidence scores, and optional speaker diarization (identifies who spoke each word). Best-in-class WER (~2%).

Args: audio_base64: Base64-encoded audio (WAV, MP3, OGG, FLAC, WebM). language: Language code. Auto-detected if omitted. Supports 99 languages. diarize: Enable speaker diarization (default: false). When true, each word includes a speaker label (e.g. SPEAKER_00, SPEAKER_01).

Returns: dict with keys: - text (str): Full decoded transcript - words (list): Per-word results with timestamps, each containing: - word (str), start (float), end (float), confidence (float 0-1) - speaker (str|null): Speaker label when diarize=true - speakers (dict|null): Speaker info with count and labels - audioDurationMs (int): Audio duration in milliseconds - metadata (dict): Processing time, language, languageProbability - audioQuality (dict): Audio metrics (SNR, peak/RMS dB, etc.)

ParametersJSON Schema
NameRequiredDescriptionDefault
diarizeNoEnable speaker diarization to identify who spoke each word.
languageNoLanguage code (e.g. 'en', 'es', 'zh'). Auto-detected when omitted.
audio_base64YesBase64-encoded audio data. Supports WAV, MP3, OGG, FLAC, and WebM formats.

Discussions

No comments yet. Be the first to start the discussion!

Try in Browser

Your Connectors

Sign in to create a connector for this server.