Skip to main content
Glama

Speech AI - Pronunciation, STT & TTS

Server Details

Pronunciation scoring, speech-to-text, and text-to-speech for language learning

Status
Healthy
Last Tested
Transport
Streamable HTTP
URL
Repository
fasuizu-br/speech-ai-examples
GitHub Stars
0

Glama MCP Gateway

Connect through Glama MCP Gateway for full control over tool access and complete visibility into every call.

MCP client
Glama
MCP server

Full call logging

Every tool call is logged with complete inputs and outputs, so you can debug issues and audit what your agents are doing.

Tool access control

Enable or disable individual tools per connector, so you decide what your agents can and cannot do.

Managed credentials

Glama handles OAuth flows, token storage, and automatic rotation, so credentials never expire on your clients.

Usage analytics

See which tools your agents call, how often, and when, so you can understand usage patterns and catch anomalies.

100% free. Your data is private.

Tool Definition Quality

Score is being calculated. Check back soon.

Available Tools

10 tools
assess_pronunciationAssess PronunciationA
Read-onlyIdempotent
Inspect

Assess English pronunciation quality from audio.

Scores pronunciation at four levels: overall, sentence, word, and phoneme. Each score is 0-100. Phonemes are returned in both IPA and ARPAbet notation. Sub-300ms inference latency.

Args: audio_base64: Base64-encoded audio data. Supports WAV, MP3, OGG, and WebM formats. text: The reference English text that the speaker was expected to read aloud. audio_format: Audio format hint — one of 'wav', 'mp3', 'ogg', 'webm'. Defaults to 'wav'.

Returns: dict with keys: - overallScore (int 0-100): Overall pronunciation quality - sentenceScore (int 0-100): Sentence-level fluency and accuracy - words (list): Per-word scores, each containing: - word (str): The word - score (int 0-100): Word pronunciation score - phonemes (list): Per-phoneme scores with IPA/ARPAbet notation - decodedTranscript (str): What the model heard (ASR transcript) - transcript (str): Reference text - confidence (float 0-1): Scoring confidence - warnings (list[str]): Quality warnings if any - audioQuality (dict): Audio metrics (SNR, peak/RMS dB, etc.)

ParametersJSON Schema
NameRequiredDescriptionDefault
textYesThe reference English text that the speaker was expected to read aloud.
audio_base64YesBase64-encoded audio data. Supports WAV, MP3, OGG, and WebM formats.
audio_formatNoAudio format hint — one of 'wav', 'mp3', 'ogg', 'webm'.wav
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description adds substantial behavioral context beyond the annotations: inference latency ('Sub-300ms'), scoring methodology (four hierarchical levels), phoneme notation formats (IPA and ARPAbet), and audio quality metrics (SNR, dB levels). Since annotations only cover safety hints (readOnly/destructive), this additional performance and methodology detail provides valuable transparency.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with a clear front-loaded purpose statement ('Assess English pronunciation quality from audio'), followed by behavioral details, Args, and Returns sections. Despite length, every section earns its place—particularly the comprehensive Returns documentation which compensates for the lack of structured output schema. No excessive verbosity or redundancy.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the absence of a structured output schema, the description comprehensively documents the complex nested return structure (overall/sentence/word/phoneme scores, ASR transcript, confidence, audio quality metrics). Parameters have 100% schema coverage. Annotations cover safety profile. All critical information is present.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 100% schema description coverage, the baseline is 3. The Args section largely mirrors the schema descriptions without adding substantial semantic enrichment (e.g., no example base64 strings, no guidance on optimal audio length, no cross-parameter constraints). It does not contradict the schema but doesn't significantly augment it either.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description opens with a specific verb-resource combination ('Assess English pronunciation quality from audio') and clarifies scope (English language, four scoring levels). The detailed return structure (scores at phoneme level) distinguishes it from simpler transcription siblings like 'transcribe_audio' and health-check siblings like 'check_pronunciation_service'.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage through the detailed return value documentation (suggesting use when detailed pronunciation feedback is needed), but lacks explicit guidance on when to choose this over 'transcribe_audio' (which performs ASR without reference comparison) or when to use 'check_pronunciation_service'. No explicit 'when not to use' or alternative tool references.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

check_pronunciation_serviceCheck Pronunciation ServiceA
Read-onlyIdempotent
Inspect

Check if the pronunciation assessment service is healthy and ready.

Returns: dict with keys: - status (str): 'healthy' or error state - modelLoaded (bool): Whether the scoring model is loaded - version (str): API version

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare read-only, idempotent, and non-destructive safety properties. The description adds valuable behavioral context by detailing the exact return structure (dict with status, modelLoaded, version), which is not captured in the structured annotations or input schema.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is front-loaded with the action and efficiently structured into purpose and return value sections. The 'Returns:' block, while slightly informal, clearly documents output keys without verbosity.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a simple health check diagnostic with no input parameters, the description is complete. It explains what the tool does and documents the return values (acting as informal output schema), which is sufficient for invocation decisions.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The tool accepts zero parameters (empty schema), which per the rubric establishes a baseline score of 4. No parameter documentation is required or expected.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description uses a specific verb ('Check') and resource ('pronunciation assessment service') and explicitly names the service being checked, distinguishing it from sibling health check tools like check_stt_service and check_tts_service.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

While it lacks explicit 'when-not-to-use' directives, the specific mention of 'pronunciation assessment service' provides clear context for selecting this over other check_* siblings without redundancy. It implies usage context sufficiently for a health probe.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

check_stt_serviceCheck STT ServiceA
Read-onlyIdempotent
Inspect

Check if the speech-to-text service is healthy and ready.

Returns: dict with keys: - status (str): 'healthy' or error state - modelLoaded (bool): Whether the STT model is loaded - version (str): API version

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Excellent disclosure of return structure (dict with status, modelLoaded, version) since no output schema exists. Aligns with annotations (readOnly/destructive hints). Could improve by clarifying if connection failures throw exceptions or return 'error state' string.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two concise sentences: purpose declaration followed by structured return documentation. No redundant information. Excellent use of docstring-style formatting for return values.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Appropriately complete for a simple health-check tool with no inputs. Compensates for missing output schema by documenting return structure in description. Annotations cover safety profile (idempotent, read-only), though could mention external dependency nature implied by openWorldHint.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Zero parameters present, meeting baseline of 4. Schema is empty object with no properties requiring semantic clarification.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

Clear verb 'Check' and resource 'speech-to-text service' with specific scope (health/ready status). Distinguishes from 'tts' and 'pronunciation' siblings by naming the correct domain, though relationship to 'check_whisper_service' (also STT) remains unclear.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

No explicit when-to-use guidance or comparison to siblings like 'check_whisper_service'. The health-check pattern is universally understood, but the description misses opportunity to clarify when to prefer this over checking specific models (whisper) or the general pronunciation service.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

check_tts_serviceCheck TTS ServiceA
Read-onlyIdempotent
Inspect

Check if the text-to-speech service is healthy and ready.

Returns: dict with keys: - status (str): 'healthy' or error state - modelLoaded (bool): Whether the TTS model is loaded - version (str): API version

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Complements the readonly/idempotent annotations by documenting the exact return structure (status, modelLoaded, version keys) since no structured output schema exists. This adds crucial behavioral context about what the agent receives.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Front-loaded with the action sentence, followed by a structured Returns section. No wasted words, though the Returns formatting slightly extends length without reducing clarity.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's simplicity (no params) and rich annotations, the description is complete. It compensates for the missing output schema by documenting return values, which is sufficient for this health-check utility.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Zero parameters exist per the empty schema, meeting the baseline score of 4. The description correctly implies no arguments are needed.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the verb (Check) and resource (text-to-speech service health), distinguishing it from synthesis operations (synthesize_speech) and other health checks (check_stt_service, check_whisper_service) by specifying 'TTS'/'text-to-speech'.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Implies usage as a health check ('healthy and ready'), but lacks explicit guidance on when to call this versus alternatives, such as calling before synthesis or troubleshooting steps if unhealthy.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

check_whisper_serviceCheck Whisper ServiceA
Read-onlyIdempotent
Inspect

Check if the Whisper STT Pro service is healthy and ready.

Returns: dict with keys: - status (str): 'healthy' or error state - modelLoaded (bool): Whether the Whisper model is loaded - diarizeLoaded (bool): Whether the diarization pipeline is loaded - version (str): API version - modelName (str): Whisper model name (e.g. 'large-v3-turbo')

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations cover safety (readOnly, non-destructive, idempotent); the description adds valuable return structure details (dict keys with types and semantics) and clarifies what 'healthy' means (modelLoaded, diarizeLoaded states) that structured fields omit.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

First sentence efficiently captures purpose; the Returns block is structured, readable, and necessary given the absence of a structured output schema—no wasted prose.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a zero-parameter health check with rich annotations, the description is complete: it specifies the exact capability being verified (Whisper STT Pro health) and comprehensively documents the return payload structure that lacks formal schema definition.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Tool accepts zero parameters; baseline score of 4 applies per rubric.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Check" specifies the verb, "Whisper STT Pro service" identifies the specific resource, and "healthy and ready" clarifies the scope, distinguishing it from sibling check_*_service tools (e.g., check_stt_service, check_tts_service) by explicitly naming the Whisper implementation.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Usage is implied by the specific service name (Whisper STT Pro vs generic STT), but the description lacks explicit when-to-use guidance or distinctions from check_stt_service for agents deciding which health check to invoke.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

get_phoneme_inventoryGet Phoneme InventoryA
Read-onlyIdempotent
Inspect

Get the full phoneme inventory supported by the pronunciation scorer.

Returns a list of all English phonemes the engine can assess, including ARPAbet symbol, IPA equivalent, example word, and phoneme category (vowel, consonant, diphthong).

Returns: list of dicts, each with keys: - arpabet (str): ARPAbet symbol (e.g. 'AA', 'TH') - ipa (str): IPA notation - example (str): Example word containing the phoneme - category (str): vowel, consonant, or diphthong

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations establish read-only/idempotent safety; description adds valuable return-value context detailing the four specific fields (arpabet, ipa, example, category) and their formats. Does not mention rate limits or authentication, but sufficiently describes the data structure returned.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Excellent structure: purpose front-loaded in first sentence, followed by content summary, then structured Returns documentation. No redundant text; every sentence conveys distinct information about capability or output format.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Complete for a zero-parameter lookup tool: annotations cover behavioral safety, description covers business purpose and output schema (compensating for lack of formal output_schema). Details on ARPAbet/IPA mappings and categories provide necessary domain context.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Zero parameters per schema; baseline 4 applies. Description does not need to compensate for missing parameter documentation, and appropriately implies no filtering is possible (returns 'full' inventory).

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description opens with specific verb 'Get' and clear resource 'phoneme inventory', plus scope 'full' and domain 'supported by the pronunciation scorer'. Clearly distinguishes from siblings like 'assess_pronunciation' (scoring) and 'synthesize_speech' (generation) by focusing on metadata retrieval.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Provides implied usage context (retrieve inventory before assessment), but lacks explicit when-to-use guidance versus siblings or prerequisites. No mention of when to prefer this over querying individual phonemes or how it relates to the assessment workflow.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

list_tts_voicesList TTS VoicesA
Read-onlyIdempotent
Inspect

List all available text-to-speech voices with metadata.

Returns: dict with keys: - voices (list): Available voices, each with id, name, gender, accent, grade - defaultVoice (str): Default voice ID

ParametersJSON Schema
NameRequiredDescriptionDefault

No parameters

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnly/idempotent/destructive hints. The description adds valuable output structure documentation—detailing the returned dict contains 'voices' (with specific metadata fields: id, name, gender, accent, grade) and 'defaultVoice'—which compensates for the lack of an output schema.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Efficiently structured with two distinct sections: purpose declaration and Returns documentation. No filler text; every sentence serves a necessary function. The markdown-style formatting of the return structure is clear and readable.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Adequately complete for a zero-parameter discovery tool. The description successfully compensates for the missing output schema by documenting return values. Minor gap: could explicitly mention that returned voice IDs are used with synthesize_speech, but this is implied by the sibling tool names.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Zero parameters present, which warrants the baseline score of 4 per the evaluation rules. The empty schema requires no additional semantic explanation.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Specific verb 'List' with clear resource 'text-to-speech voices' and scope 'all available'. The description clearly distinguishes this from sibling tools like synthesize_speech (which consumes voices) and check_tts_service (which checks service health rather than enumerating voices).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

No explicit when-to-use guidance or comparison to alternatives. While the discovery purpose is implied (getting voice IDs for synthesis), it lacks explicit guidance like 'Call this before synthesize_speech to select a voice ID' or clarification of when to use this versus check_tts_service.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

synthesize_speechSynthesize SpeechA
Read-onlyIdempotent
Inspect

Generate natural speech audio from English text.

Produces high-quality speech with 12 English voices. Returns base64-encoded WAV audio (16-bit PCM, 24kHz mono) along with metadata.

Available voices:

  • af_heart (default), af_bella, af_nicole, af_sarah, af_sky (American female)

  • am_adam, am_michael (American male)

  • bf_emma, bf_isabella (British female)

  • bm_george, bm_lewis, bm_daniel (British male)

Args: text: English text to synthesize (1-5000 characters). voice: Voice ID. See list above. Defaults to 'af_heart'. speed: Speed multiplier from 0.5 to 2.0 (default: 1.0).

Returns: dict with keys: - audio_base64 (str): Base64-encoded WAV audio (16-bit PCM, 24kHz) - duration_ms (str): Audio duration in milliseconds - voice (str): Voice ID used - text_length (str): Input text character count - processing_ms (str): Synthesis time in milliseconds

ParametersJSON Schema
NameRequiredDescriptionDefault
textYesEnglish text to convert to speech. Max 5000 characters.
speedNoSpeech speed multiplier (0.5 = half speed, 2.0 = double).
voiceNoVoice ID (e.g. 'af_heart', 'am_adam'). Uses default if omitted.
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

While annotations cover safety aspects (readOnlyHint, destructiveHint, idempotentHint), the description adds substantial technical context absent from structured fields: the specific audio output format (base64-encoded WAV, 16-bit PCM, 24kHz mono), the complete voice inventory with gender/accent classifications, and the return value structure. No contradictions with annotations are present.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with front-loaded purpose, followed by technical specs, enumerated voice options (necessary given the lack of enum constraints in schema), and clearly labeled Args/Returns sections. While information-dense, every section serves a necessary function with minimal redundancy.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the absence of a formal output schema, the description comprehensively documents the return structure including all keys (audio_base64, duration_ms, etc.), voice options, input constraints, and audio encoding details. Combined with complete input schema coverage and annotations, the description provides everything needed for correct invocation.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 100% schema coverage, the baseline is 3. The description adds value by specifying concrete default values ('af_heart', 1.0) and explicit ranges (0.5-2.0, 1-5000 characters) that are only implied or absent in the schema, and it clarifies the English language requirement.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description explicitly states the core function with specific verb and resource: 'Generate natural speech audio from English text.' It clearly distinguishes from sibling tools like transcribe_audio (which performs the inverse operation) and list_tts_voices (which only enumerates options) by focusing on the synthesis/generation aspect.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context that this is for text-to-speech conversion and specifies the English language constraint (implying not to use for other languages). However, it lacks explicit guidance comparing it to siblings like 'use transcribe_audio instead for speech-to-text' or noting prerequisite steps like checking service availability first.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

transcribe_audioTranscribe AudioA
Read-onlyIdempotent
Inspect

Transcribe audio to text with word-level timestamps.

Converts spoken English audio into text with optional word-level timestamps and per-word confidence scores.

Args: audio_base64: Base64-encoded audio data (WAV, MP3, OGG, FLAC, WebM). audio_format: Audio format hint. Auto-detected from magic bytes if omitted. include_timestamps: Whether to include word-level timing (default: true).

Returns: dict with keys: - text (str): Full decoded transcript - words (list): Per-word results with timestamps, each containing: - word (str): The transcribed word - start (float): Start time in seconds - end (float): End time in seconds - confidence (float 0-1): Word-level confidence - audioDurationMs (int): Audio duration in milliseconds - metadata (dict): Processing time, audio length, model version - audioQuality (dict): Audio metrics (SNR, peak/RMS dB, etc.)

ParametersJSON Schema
NameRequiredDescriptionDefault
audio_base64YesBase64-encoded audio data. Supports WAV, MP3, OGG, FLAC, and WebM formats.
audio_formatNoAudio format hint — 'wav', 'mp3', 'ogg', 'flac', 'webm'. Auto-detected if omitted.
include_timestampsNoIf true, include word-level start/end times and confidence.
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Adds valuable context beyond annotations: specifies 'spoken English' language scope, documents auto-detection of audio formats from magic bytes, and extensively details return structure including audio quality metrics. No contradictions with readOnly/idempotent annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Front-loaded summary followed by Args/Returns sections. Slightly verbose Returns section, but justified given lack of output schema. Zero redundancy between description and schema definitions.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Excellent compensation for missing output schema: Returns section exhaustively documents 5 return keys with nested structures (word-level timing, confidence, audio metrics). Complete for an audio processing tool with 3 parameters.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% (baseline 3). Description enhances by enumerating supported formats (WAV, MP3, etc.) and noting default behaviors (include_timestamps defaults to true, format auto-detected if omitted).

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

Clear specific purpose: 'Transcribe audio to text with word-level timestamps' with verb+resource. However, it fails to distinguish from sibling 'transcribe_audio_pro' or contrast with pronunciation assessment tools.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

No guidance on when to choose this over 'transcribe_audio_pro' or when to use 'check_stt_service' instead. No prerequisites or error conditions mentioned despite complexity.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

transcribe_audio_proTranscribe Audio Pro (Whisper)A
Read-onlyIdempotent
Inspect

Transcribe audio with Whisper Large V3 Turbo — multilingual STT.

Supports 99 languages with automatic language detection, word-level timestamps, per-word confidence scores, and optional speaker diarization (identifies who spoke each word). Best-in-class WER (~2%).

Args: audio_base64: Base64-encoded audio (WAV, MP3, OGG, FLAC, WebM). language: Language code. Auto-detected if omitted. Supports 99 languages. diarize: Enable speaker diarization (default: false). When true, each word includes a speaker label (e.g. SPEAKER_00, SPEAKER_01).

Returns: dict with keys: - text (str): Full decoded transcript - words (list): Per-word results with timestamps, each containing: - word (str), start (float), end (float), confidence (float 0-1) - speaker (str|null): Speaker label when diarize=true - speakers (dict|null): Speaker info with count and labels - audioDurationMs (int): Audio duration in milliseconds - metadata (dict): Processing time, language, languageProbability - audioQuality (dict): Audio metrics (SNR, peak/RMS dB, etc.)

ParametersJSON Schema
NameRequiredDescriptionDefault
diarizeNoEnable speaker diarization to identify who spoke each word.
languageNoLanguage code (e.g. 'en', 'es', 'zh'). Auto-detected when omitted.
audio_base64YesBase64-encoded audio data. Supports WAV, MP3, OGG, FLAC, and WebM formats.
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Adds substantial context beyond annotations: detailed output structure (word objects with timestamps/confidence), 99-language support specifics, quality metrics (WER, audio SNR), and processing characteristics. Aligns with readOnlyHint (transcription is non-destructive).

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Well-structured with clear sections (summary → args → returns). Length is justified by absence of output schema requiring detailed Returns documentation, though some Args detail duplicates the complete schema.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Comprehensive coverage compensating for missing output schema: details return dict structure, nested word objects, speaker labels, metadata fields, and audio quality metrics—everything needed for agent to consume results.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, providing full param documentation. Description converts schema details to narrative 'Args:' section adding minimal new semantics beyond schema definitions, meeting the baseline expectation.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Excellent specificity: identifies exact model (Whisper Large V3 Turbo), core function (multilingual STT), and distinguishes from sibling 'transcribe_audio' by listing advanced capabilities (word-level timestamps, confidence scores, diarization, ~2% WER).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Implicit differentiation through feature listing (speaker diarization, per-word timestamps) suggests this is the advanced option vs. basic 'transcribe_audio', but lacks explicit 'use this when...' guidance or direct sibling comparison.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Discussions

No comments yet. Be the first to start the discussion!

Try in Browser

Your Connectors

Sign in to create a connector for this server.