Skip to main content
Glama

transcribe-audio

Transcribes audio from URLs, base64, or local files using Whisper, with support for large files via chunked upload and options for language, timestamps, and async processing.

Instructions

Transcribes audio via Whisper. Preferred: audio_url (most token-efficient; server fetches bytes). audio_base64 is for small clips only (<= ~60KB raw per call). audio_path only works when the MCP host shares a filesystem with the caller (often false on Claude.ai / Claude Code). For larger payloads in sandboxed environments, use transcribe_upload_start / transcribe_upload_append / transcribe_upload_finalize. Server re-encodes to Opus 16kHz mono 16kbps before Whisper unless skip_compression=true. Long audio (>5min) or async=true returns a job_id; poll transcribe_get_job.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
audio_pathNoAbsolute path to a local audio file on the MCP host (often unusable from sandboxed clients).
audio_base64NoBase64-encoded audio payload (single-call max ~60KB raw; use chunked upload for larger).
audio_resource_uriNoAudio resource URI, supported schemes: file:// and data:...;base64,...
audio_urlNoHTTP(S) URL the server will fetch (requires TRANSCRIPT_MCP_URL_ALLOWLIST).
filenameNoOptional filename hint (used when magic-byte detection is inconclusive).
skip_compressionNoIf true, skip Opus 16kbps recompression (caller already optimized). Default: false
engineNoTranscription engine preference. 'auto' uses OpenAI first and falls back to local whisper when available.
languageNoLanguage code for transcription (e.g., 'en', 'es', 'fr'). Default: auto-detect
include_timestampsNoWhen as_text=true, include [MM:SS] timestamps in the plain text output. Default: true
as_textNoIf true, return only the joined transcript string. If false, return structured JSON. Default: false
asyncNoIf true, always enqueue an async job (returns job_id). Default: false
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Despite no annotations, the description fully discloses server-side behavior: fetches bytes, re-encodes to Opus 16kHz unless skip_compression, returns job_id for long audio or async, and references polling endpoint. No contradictions.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is dense and front-loaded with the most critical information (preferred method, limits, async). Slightly lengthy but every sentence earns its place; no redundancy.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

With 11 parameters and no output schema, the description covers input methods, async handling, job polling, and compression. Lacks explicit output format details, but the as_text parameter clarifies the two return types. Sufficient for complex tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema has 100% description coverage, but the tool description adds significant value: token efficiency for audio_url, size limit for audio_base64 (<=60KB), semantics of skip_compression, and async behavior. Goes well beyond schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Clearly states it transcribes audio using Whisper, and distinguishes between sibling tools for upload and job polling (transcribe_upload_start/append/finalize, transcribe_get_job). Specific verb+resource+method.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly advises when to use each input method (audio_url most efficient, audio_base64 for small clips, audio_path only with shared filesystem), and when to use chunked upload vs this tool. Also covers async usage.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/JamesANZ/transcript-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server