transcribe_audio
Transcribe audio or video files on Windows using whisper.cpp. Supports MP3, WAV, and auto-converts other formats via FFmpeg. Optionally run in background or enable privacy mode.
Instructions
Transcribe a single audio or video file using whisper.cpp on Windows. Natively supports mp3 and wav. Automatically converts mp4, mkv, avi, mov, webm, m4a, flac, ogg etc. via FFmpeg — no manual conversion needed. Output defaults to timestamps format (with time codes). For files that may take more than 4 minutes, set background=true to run as a detached job and use check_progress to monitor it. ⚠️ Privacy: transcript text returned by this tool is processed by Claude's API. Pass privacy_mode=true to this tool to enable metadata-only responses per call — no transcript text will be transmitted. Set WHISPER_PRIVACY_MODE=true in env to enable globally. When privacy mode is active, a confirmation is required before every operation.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| file_path | Yes | Absolute Windows path, e.g. C:\Users\You\Downloads\recording.mp4 | |
| model | No | Override model path. Leave blank to use active model. | |
| language | No | Language code (e.g. en, ja, es, fr) or 'auto' to detect automatically. Defaults to en. | en |
| output_format | No | timestamps = with time codes (default), text = plain, json = structured, srt = SRT subtitle file, vtt = WebVTT subtitle file, lrc = LRC lyrics/karaoke, csv = CSV with timestamps. | timestamps |
| threads | No | CPU threads. Defaults to 2 of 2. | |
| save_to_file | No | Save transcript as .txt next to the source file. | |
| background | No | Run as a detached background job. Returns a job ID immediately. Use check_progress to monitor. Recommended for files over 10 minutes. | |
| privacy_mode | No | Override privacy mode for this call. true = metadata only, no transcript text transmitted to API. false = return text (even if WHISPER_PRIVACY_MODE=true globally). Omit to use global WHISPER_PRIVACY_MODE setting. When active, requires confirmation before each operation. | |
| temperature | No | Sampling temperature 0.0–1.0. Default 0.0 (deterministic). | |
| prompt | No | Prior context string injected before transcription. Improves accuracy for domain-specific vocabulary or speaker names. Example: 'Names: Keemstar, DramaAlert.' | |
| condition_on_prev_text | No | Re-enable conditioning each segment on its own prior output. Default false. | |
| no_speech_thold | No | Confidence threshold below which segments are treated as silence. Default 0.6. | |
| beam_size | No | Beam search width. Higher = more accurate but slower. Default 5. | |
| best_of | No | Number of candidate sequences to evaluate. Default 5. | |
| gpu_device | No | GPU/Vulkan device index for multi-GPU systems. Overrides the WHISPER_GPU_DEVICE env default. Check whisper-cli's startup log for the index that lists your target card. | |
| processors | No | Number of parallel processors. Default 1. | |
| word_timestamps | No | Output one word per timestamped segment. Useful for clip alignment. | |
| max_segment_length | No | Maximum segment length in characters. | |
| split_on_word | No | Split segments at word boundaries. | |
| diarize | No | Stereo speaker diarization — requires stereo audio with speakers on separate channels. | |
| vad_model | No | Absolute path to a Silero VAD model .bin file. Strips silence before transcription. | |
| offset_t | No | Start transcription at this offset in milliseconds. | |
| duration | No | Process only this many milliseconds of audio from offset_t. |