transcribe_audio
Transcribe audio or video files into text, timestamps, JSON, or SRT subtitles using Whisper on Windows. Supports MP3, WAV, MP4, MKV, and more — automatically converts via FFmpeg.
Instructions
Transcribe a single audio or video file using whisper.cpp on Windows. Natively supports mp3 and wav. Automatically converts mp4, mkv, avi, mov, webm, m4a, flac, ogg etc. via FFmpeg — no manual conversion needed. Can output plain text, timestamps, JSON, or SRT subtitle files. For files that may take more than 4 minutes, set background=true to run as a detached job and use check_progress to monitor it.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| file_path | Yes | Absolute Windows path, e.g. C:\Users\You\Downloads\recording.mp4 | |
| model | No | Override model path. Leave blank to use active model. | |
| language | No | Language code (e.g. en, ja, es, fr) or 'auto' to detect automatically. Defaults to en. | en |
| output_format | No | text = plain (default), timestamps = with time codes, json = structured, srt = subtitle file saved next to source. | text |
| threads | No | CPU threads. Defaults to 2 of 2. | |
| save_to_file | No | Save transcript as .txt next to the source file. | |
| background | No | Run as a detached background job. Returns a job ID immediately. Use check_progress to monitor. Recommended for files over 10 minutes. | |
| temperature | No | Sampling temperature 0.0–1.0. Default 0.0 (deterministic). Higher values reduce hallucination on noisy audio at the cost of consistency. | |
| prompt | No | Prior context string injected before transcription. Improves accuracy for domain-specific vocabulary, speaker names, or technical terms. Example: 'Names: Keemstar, DramaAlert.' | |
| condition_on_prev_text | No | Re-enable conditioning each segment on its own prior output (removes --max-context 0 flag). Default false (off). Only enable for highly structured audio where context continuity helps. | |
| no_speech_thold | No | Confidence threshold below which segments are treated as silence rather than transcribed. Default 0.6. | |
| beam_size | No | Beam search width. Higher = more accurate but slower. Default 5. | |
| best_of | No | Number of candidate sequences to evaluate. Default 5. | |
| gpu_device | No | GPU device index for multi-GPU systems. Use check_system to see available GPUs. Default 0. | |
| processors | No | Number of parallel processors for chunk processing. Default 1. | |
| word_timestamps | No | Output one word per timestamped segment (sets --max-len 1 --split-on-word). Useful for clip alignment and precise timecode search. | |
| max_segment_length | No | Maximum segment length in characters. Controls line break frequency in output. Ignored when word_timestamps=true. | |
| split_on_word | No | Split segments at word boundaries rather than mid-word. Defaults to false. | |
| diarize | No | Stereo speaker diarization — labels left/right channel speakers in transcript. Requires stereo audio with speakers on separate channels. | |
| vad_model | No | Absolute path to a Silero VAD model .bin file. When provided, voice activity detection strips silence before transcription — reduces hallucinations and speeds up processing. Download via download_model. | |
| offset_t | No | Start transcription at this offset in milliseconds. Use to process a specific section of a file. | |
| duration | No | Process only this many milliseconds of audio starting from offset_t (or the beginning). Use with offset_t to target a specific time window. |