Skip to main content
Glama

transcribe_audio

Transcribe audio or video files on Windows using whisper.cpp. Supports MP3, WAV, and auto-converts other formats via FFmpeg. Optionally run in background or enable privacy mode.

Instructions

Transcribe a single audio or video file using whisper.cpp on Windows. Natively supports mp3 and wav. Automatically converts mp4, mkv, avi, mov, webm, m4a, flac, ogg etc. via FFmpeg — no manual conversion needed. Output defaults to timestamps format (with time codes). For files that may take more than 4 minutes, set background=true to run as a detached job and use check_progress to monitor it. ⚠️ Privacy: transcript text returned by this tool is processed by Claude's API. Pass privacy_mode=true to this tool to enable metadata-only responses per call — no transcript text will be transmitted. Set WHISPER_PRIVACY_MODE=true in env to enable globally. When privacy mode is active, a confirmation is required before every operation.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
file_pathYesAbsolute Windows path, e.g. C:\Users\You\Downloads\recording.mp4
modelNoOverride model path. Leave blank to use active model.
languageNoLanguage code (e.g. en, ja, es, fr) or 'auto' to detect automatically. Defaults to en.en
output_formatNotimestamps = with time codes (default), text = plain, json = structured, srt = SRT subtitle file, vtt = WebVTT subtitle file, lrc = LRC lyrics/karaoke, csv = CSV with timestamps.timestamps
threadsNoCPU threads. Defaults to 2 of 2.
save_to_fileNoSave transcript as .txt next to the source file.
backgroundNoRun as a detached background job. Returns a job ID immediately. Use check_progress to monitor. Recommended for files over 10 minutes.
privacy_modeNoOverride privacy mode for this call. true = metadata only, no transcript text transmitted to API. false = return text (even if WHISPER_PRIVACY_MODE=true globally). Omit to use global WHISPER_PRIVACY_MODE setting. When active, requires confirmation before each operation.
temperatureNoSampling temperature 0.0–1.0. Default 0.0 (deterministic).
promptNoPrior context string injected before transcription. Improves accuracy for domain-specific vocabulary or speaker names. Example: 'Names: Keemstar, DramaAlert.'
condition_on_prev_textNoRe-enable conditioning each segment on its own prior output. Default false.
no_speech_tholdNoConfidence threshold below which segments are treated as silence. Default 0.6.
beam_sizeNoBeam search width. Higher = more accurate but slower. Default 5.
best_ofNoNumber of candidate sequences to evaluate. Default 5.
gpu_deviceNoGPU/Vulkan device index for multi-GPU systems. Overrides the WHISPER_GPU_DEVICE env default. Check whisper-cli's startup log for the index that lists your target card.
processorsNoNumber of parallel processors. Default 1.
word_timestampsNoOutput one word per timestamped segment. Useful for clip alignment.
max_segment_lengthNoMaximum segment length in characters.
split_on_wordNoSplit segments at word boundaries.
diarizeNoStereo speaker diarization — requires stereo audio with speakers on separate channels.
vad_modelNoAbsolute path to a Silero VAD model .bin file. Strips silence before transcription.
offset_tNoStart transcription at this offset in milliseconds.
durationNoProcess only this many milliseconds of audio from offset_t.
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations, the description carries full burden. It discloses key behaviors: uses whisper.cpp, requires FFmpeg for format conversion, default output format, privacy mode with confirmation requirement, background job execution, and that transcript text is processed by Claude's API. This provides excellent transparency.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single dense paragraph but front-loads the core function and important caveats (privacy, background). Every sentence adds value, but it could be more structured (e.g., using bullet points) for easier scanning. It is not overly verbose given the tool's complexity.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's 23 parameters, 1 required, and no output schema, the description covers essential usage: supported formats, conversion, output types, privacy, background jobs, and references to sibling tools. It provides enough information for an agent to use the tool correctly without gaps.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

All 23 parameters have descriptions in the input schema (100% coverage), so the description's role is to add context. It does this well by explaining automatic format conversion, default output, background job rationale, privacy mode override logic, and global env variable. This adds value beyond the schema, though the schema already covers details.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's function: transcribing audio/video files using whisper.cpp on Windows. It specifies supported formats, automatic conversion via FFmpeg, and mentions output defaults, making the purpose unambiguous. The description also implicitly differentiates from sibling tools like transcribe_batch and check_progress by discussing background jobs and progress monitoring.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear guidance on when to use background mode for long files and when to use privacy_mode. It references check_progress for monitoring background jobs. However, it does not explicitly state when not to use this tool or compare to all siblings, but the context given is sufficient for typical use.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/eviscerations/whisper-windows-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server