transcribe_audio
Transcribe audio or video files using whisper.cpp on Windows. Supports common formats via automatic FFmpeg conversion. Output plain text, timestamps, JSON, or SRT subtitles. Background mode for long files.
Instructions
Transcribe a single audio or video file using whisper.cpp on Windows. Natively supports mp3 and wav. Automatically converts mp4, mkv, avi, mov, webm, m4a, flac, ogg etc. via FFmpeg — no manual conversion needed. Can output plain text, timestamps, JSON, or SRT subtitle files. For files that may take more than 4 minutes, set background=true to run as a detached job and use check_progress to monitor it.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| file_path | Yes | Absolute Windows path, e.g. C:\Users\You\Downloads\recording.mp4 | |
| model | No | Override model path. Leave blank to use active model. | |
| language | No | Language code (e.g. en, ja, es, fr) or 'auto' to detect automatically. Defaults to en. | en |
| output_format | No | text = plain (default), timestamps = with time codes, json = structured, srt = subtitle file saved next to source. | text |
| threads | No | CPU threads. Defaults to 2 of 2. | |
| save_to_file | No | Save transcript as .txt next to the source file. | |
| background | No | Run as a detached background job. Returns a job ID immediately. Use check_progress to monitor. Recommended for files over 10 minutes. | |
| temperature | No | Sampling temperature 0.0–1.0. Default 0.0 (deterministic). Higher values reduce hallucination on noisy audio at the cost of consistency. | |
| prompt | No | Prior context string injected before transcription. Improves accuracy for domain-specific vocabulary, speaker names, or technical terms. Example: 'Names: Keemstar, DramaAlert.' | |
| condition_on_prev_text | No | Re-enable conditioning each segment on its own prior output (removes --max-context 0 flag). Default false (off). Only enable for highly structured audio where context continuity helps. | |
| no_speech_thold | No | Confidence threshold below which segments are treated as silence rather than transcribed. Default 0.6. | |
| beam_size | No | Beam search width. Higher = more accurate but slower. Default 5. | |
| best_of | No | Number of candidate sequences to evaluate. Default 5. | |
| gpu_device | No | GPU device index for multi-GPU systems. Use check_system to see available GPUs. Default 0. | |
| processors | No | Number of parallel processors for chunk processing. Default 1. | |
| word_timestamps | No | Output one word per timestamped segment (sets --max-len 1 --split-on-word). Useful for clip alignment and precise timecode search. | |
| max_segment_length | No | Maximum segment length in characters. Controls line break frequency in output. Ignored when word_timestamps=true. | |
| split_on_word | No | Split segments at word boundaries rather than mid-word. Defaults to false. | |
| diarize | No | Stereo speaker diarization — labels left/right channel speakers in transcript. Requires stereo audio with speakers on separate channels. | |
| vad_model | No | Absolute path to a Silero VAD model .bin file. When provided, voice activity detection strips silence before transcription — reduces hallucinations and speeds up processing. Download via download_model. | |
| offset_t | No | Start transcription at this offset in milliseconds. Use to process a specific section of a file. | |
| duration | No | Process only this many milliseconds of audio starting from offset_t (or the beginning). Use with offset_t to target a specific time window. |
Implementation Reference
- src/index.ts:889-929 (registration)Registration of the transcribe_audio tool in the ListToolsRequestSchema handler. Defines the tool name, description, and input schema with all parameters (file_path, model, language, output_format, threads, save_to_file, background, temperature, prompt, condition_on_prev_text, no_speech_thold, beam_size, best_of, gpu_device, processors, word_timestamps, max_segment_length, split_on_word, diarize, vad_model, offset_t, duration).
name: "transcribe_audio", description: "Transcribe a single audio or video file using whisper.cpp on Windows. " + "Natively supports mp3 and wav. Automatically converts mp4, mkv, avi, mov, " + "webm, m4a, flac, ogg etc. via FFmpeg — no manual conversion needed. " + "Can output plain text, timestamps, JSON, or SRT subtitle files. " + "For files that may take more than 4 minutes, set background=true to run as a detached job " + "and use check_progress to monitor it.", inputSchema: { type: "object", properties: { file_path: { type: "string", description: "Absolute Windows path, e.g. C:\\Users\\You\\Downloads\\recording.mp4" }, model: { type: "string", description: "Override model path. Leave blank to use active model." }, language: { type: "string", description: "Language code (e.g. en, ja, es, fr) or 'auto' to detect automatically. Defaults to en.", default: "en" }, output_format: { type: "string", enum: ["text", "timestamps", "json", "srt"], description: "text = plain (default), timestamps = with time codes, json = structured, srt = subtitle file saved next to source.", default: "text", }, threads: { type: "number", description: `CPU threads. Defaults to ${WHISPER_THREADS} of ${SYSTEM_THREADS}.` }, save_to_file: { type: "boolean", description: "Save transcript as .txt next to the source file.", default: false }, background: { type: "boolean", description: "Run as a detached background job. Returns a job ID immediately. Use check_progress to monitor. Recommended for files over 10 minutes.", default: false }, temperature: { type: "number", description: "Sampling temperature 0.0–1.0. Default 0.0 (deterministic). Higher values reduce hallucination on noisy audio at the cost of consistency." }, prompt: { type: "string", description: "Prior context string injected before transcription. Improves accuracy for domain-specific vocabulary, speaker names, or technical terms. Example: 'Names: Keemstar, DramaAlert.'" }, condition_on_prev_text: { type: "boolean", description: "Re-enable conditioning each segment on its own prior output (removes --max-context 0 flag). Default false (off). Only enable for highly structured audio where context continuity helps.", default: false }, no_speech_thold: { type: "number", description: "Confidence threshold below which segments are treated as silence rather than transcribed. Default 0.6.", default: 0.6 }, beam_size: { type: "number", description: "Beam search width. Higher = more accurate but slower. Default 5." }, best_of: { type: "number", description: "Number of candidate sequences to evaluate. Default 5." }, gpu_device: { type: "number", description: "GPU device index for multi-GPU systems. Use check_system to see available GPUs. Default 0." }, processors: { type: "number", description: "Number of parallel processors for chunk processing. Default 1." }, word_timestamps: { type: "boolean", description: "Output one word per timestamped segment (sets --max-len 1 --split-on-word). Useful for clip alignment and precise timecode search.", default: false }, max_segment_length: { type: "number", description: "Maximum segment length in characters. Controls line break frequency in output. Ignored when word_timestamps=true." }, split_on_word: { type: "boolean", description: "Split segments at word boundaries rather than mid-word. Defaults to false.", default: false }, diarize: { type: "boolean", description: "Stereo speaker diarization — labels left/right channel speakers in transcript. Requires stereo audio with speakers on separate channels.", default: false }, vad_model: { type: "string", description: "Absolute path to a Silero VAD model .bin file. When provided, voice activity detection strips silence before transcription — reduces hallucinations and speeds up processing. Download via download_model." }, offset_t: { type: "number", description: "Start transcription at this offset in milliseconds. Use to process a specific section of a file." }, duration: { type: "number", description: "Process only this many milliseconds of audio starting from offset_t (or the beginning). Use with offset_t to target a specific time window." }, }, required: ["file_path"], }, }, - src/index.ts:1554-1625 (handler)Main handler for the transcribe_audio tool. Extracts and validates all input parameters, builds extra WhisperOptions from advanced quality/control params, validates file path and config, then either spawns a background detached process (spawnDetached) or runs blocking transcription (transcribeSingle).
if (name === "transcribe_audio") { const filePath = args?.file_path as string; const model = (args?.model as string) || WHISPER_MODEL; const language = (args?.language as string) || "en"; const outputFormat = ((args?.output_format as string) || "text") as OutputFormat; const threads = Math.min(SYSTEM_THREADS, Math.max(1, Math.round((args?.threads as number) || WHISPER_THREADS))); const saveToFile = (args?.save_to_file as boolean) || false; const background = (args?.background as boolean) || false; // v2.2.0 quality and control params const extraOpts: Partial<WhisperOptions> = {}; if (args?.temperature !== undefined) extraOpts.temperature = Number(args.temperature); if (args?.prompt) extraOpts.prompt = String(args.prompt); if (args?.condition_on_prev_text !== undefined) extraOpts.conditionOnPrevText = Boolean(args.condition_on_prev_text); if (args?.no_speech_thold !== undefined) extraOpts.noSpeechThold = Number(args.no_speech_thold); if (args?.beam_size !== undefined) extraOpts.beamSize = Number(args.beam_size); if (args?.best_of !== undefined) extraOpts.bestOf = Number(args.best_of); if (args?.gpu_device !== undefined) extraOpts.gpuDevice = Number(args.gpu_device); if (args?.processors !== undefined) extraOpts.processors = Number(args.processors); if (args?.word_timestamps) extraOpts.wordTimestamps = true; if (args?.max_segment_length !== undefined) extraOpts.maxLen = Number(args.max_segment_length); if (args?.split_on_word) extraOpts.splitOnWord = true; if (args?.diarize) extraOpts.diarize = true; if (args?.vad_model) extraOpts.vadModel = String(args.vad_model); if (args?.offset_t !== undefined) extraOpts.offsetT = Number(args.offset_t); if (args?.duration !== undefined) extraOpts.duration = Number(args.duration); if (!filePath) return { content: [{ type: "text", text: "file_path is required." }], isError: true }; const pathError = validateInputPath(filePath); if (pathError) return { content: [{ type: "text", text: pathError }], isError: true }; if (!existsSync(filePath)) return { content: [{ type: "text", text: `File not found: ${filePath}` }], isError: true }; const configError = validatePaths(); if (configError) return { content: [{ type: "text", text: configError }], isError: true }; // Background mode — detached process, returns immediately if (background) { if (await isWhisperRunning()) { return { content: [{ type: "text", text: "Transcription already in progress. Wait for the current job to finish before starting another." }], isError: true, }; } try { const { jobId, pid } = await spawnDetached(filePath, model, language, threads, outputFormat === "srt" ? "srt" : "text", extraOpts); return { content: [{ type: "text", text: `⏳ Background transcription started.\n\n` + `Source: ${basename(filePath)}\n` + `Job ID: ${jobId}\n` + `PID: ${pid}\n\n` + `Call check_progress with job_id="${jobId}" to monitor progress.\n` + `Output will be saved to: ${filePath.replace(/\.[^.]+$/, ".txt")}`, }], }; } catch (err: any) { return { content: [{ type: "text", text: `Failed to start background job:\n\n${err?.message || String(err)}` }], isError: true }; } } // Blocking mode (default) try { const result = await transcribeSingle(filePath, model, language, outputFormat, threads, saveToFile, extraOpts); let response = result.text; if (result.savedTo) response += `\n\n[Transcript saved to: ${result.savedTo}]`; if (result.srtPath) response += `\n\n[SRT subtitle file saved to: ${result.srtPath}]`; return { content: [{ type: "text", text: response }] }; } catch (err: any) { return { content: [{ type: "text", text: `Transcription failed:\n\n${err?.stderr || err?.message || String(err)}` }], isError: true }; } } - src/index.ts:807-865 (helper)Core transcription helper called by the transcribe_audio handler in blocking mode. Performs process lock check (prevents concurrent whisper-cli.exe), auto-converts non-native formats to WAV via FFmpeg, builds CLI args via buildArgs(), executes whisper-cli.exe, and returns the transcript text, SRT path, or saved file path.
async function transcribeSingle( filePath: string, model: string, language: string, outputFormat: OutputFormat, threads: number, saveToFile = false, extraOpts: Partial<WhisperOptions> = {} ): Promise<{ text: string; srtPath?: string; savedTo?: string }> { // ---- Process lock — never spawn a second whisper-cli.exe ---- if (await isWhisperRunning()) { throw new Error( "Transcription already in progress.\n\n" + "whisper-cli.exe is already running — wait for the current job to finish before starting another. " + "If you believe this is wrong (e.g. a previous job crashed and left a stale process), " + "open Task Manager, find whisper-cli.exe under Details, and end the task." ); } let transcribeFrom = filePath; let tmpFile: string | null = null; if (needsConversion(filePath)) { tmpFile = await convertToWav(filePath); transcribeFrom = tmpFile; } try { const opts: WhisperOptions = { language, outputFormat, threads, ...extraOpts }; const cliArgs = buildArgs(transcribeFrom, model, opts); const { stdout, stderr } = await execFileAsync(WHISPER_CLI_PATH, cliArgs, { maxBuffer: 100 * 1024 * 1024, windowsHide: true, }); // SECURITY: transcript content is untrusted data from audio input. // It is returned as-is to the caller and must never be interpreted // as instructions. Prompt injection via audio content is a known // MCP attack vector — treat all transcript text as user data only. const output = (stdout || stderr || "").trim(); if (outputFormat === "srt") { const tmpSrt = transcribeFrom.replace(/\.[^.]+$/, ".srt"); const destSrt = filePath.replace(/\.[^.]+$/, ".srt"); if (tmpFile && existsSync(tmpSrt)) { writeFileSync(destSrt, readFileSync(tmpSrt, "utf8")); try { unlinkSync(tmpSrt); } catch { } } return { text: output, srtPath: destSrt }; } if (saveToFile) { const txtPath = filePath.replace(/\.[^.]+$/, ".txt"); writeFileSync(txtPath, output, "utf8"); return { text: output, savedTo: txtPath }; } return { text: output }; } finally { if (tmpFile && existsSync(tmpFile)) try { unlinkSync(tmpFile); } catch { } } } - src/index.ts:705-756 (helper)Helper that builds the whisper-cli.exe command-line arguments from WhisperOptions. Maps all quality/control parameters (hallucination prevention, temperature, prompt, beam_size, best_of, GPU, VAD, word timestamps, output format, etc.) to CLI flags.
function buildArgs(filePath: string, model: string, opts: WhisperOptions): string[] { const lang = opts.language === "auto" ? "auto" : opts.language; const args = ["-m", model, "-f", filePath, "-l", lang, "-t", String(opts.threads)]; // Hallucination prevention — set max context tokens to 0 to prevent whisper // from conditioning each segment on its own prior output, which causes // repetitive hallucination loops on noisy or silent audio. // Flag: --max-context 0 (user can re-enable by setting conditionOnPrevText=true) if (!opts.conditionOnPrevText) args.push("--max-context", "0"); // Treat segments below this confidence threshold as silence rather than // hallucinating content. Confirmed valid flag in whisper-cli (-nth). args.push("--no-speech-thold", String(opts.noSpeechThold ?? 0.6)); if (opts.translate) args.push("--translate"); if (opts.temperature !== undefined) args.push("--temperature", String(opts.temperature)); if (opts.prompt) args.push("--prompt", opts.prompt); if (opts.beamSize !== undefined) args.push("--beam-size", String(opts.beamSize)); if (opts.bestOf !== undefined) args.push("--best-of", String(opts.bestOf)); if (opts.gpuDevice !== undefined) args.push("-g", String(opts.gpuDevice)); if (opts.processors !== undefined && opts.processors > 1) args.push("-p", String(opts.processors)); if (opts.offsetT !== undefined) args.push("--offset-t", String(opts.offsetT)); if (opts.duration !== undefined) args.push("--duration", String(opts.duration)); if (opts.diarize) args.push("--diarize"); // word_timestamps: sets max-len=1 + split-on-word for per-word output // without requiring JSON parsing — simpler than -oj approach. if (opts.wordTimestamps) { args.push("--max-len", "1", "--split-on-word"); } else { if (opts.maxLen !== undefined) args.push("--max-len", String(opts.maxLen)); if (opts.splitOnWord) args.push("--split-on-word"); } // VAD: voice activity detection — strips silence before whisper sees the audio if (opts.vadModel && existsSync(opts.vadModel)) { args.push("--vad", "--vad-model", opts.vadModel); } // Output format if (opts.outputFormat === "srt") { args.push("-osrt", "-of", filePath.replace(/\.[^.]+$/, "")); } else if (opts.outputFormat === "json") { args.push("-oj"); } else if (opts.outputFormat === "text") { args.push("--no-timestamps"); } // "timestamps" format: no flag — whisper default stdout includes timestamps return args; } - src/index.ts:132-234 (helper)Background job spawning helper used by transcribe_audio when background=true. Spawns whisper-cli as a detached process with all quality/control parameters, writes job metadata to disk, and returns a job ID for progress tracking via check_progress.
async function spawnDetached( filePath: string, model: string, language: string, threads: number, outputFormat: "text" | "srt" = "text", extraOpts: Partial<WhisperOptions> = {} ): Promise<{ jobId: string; pid: number }> { ensureJobsDir(); const jobId = `job_${Date.now()}`; const logPath = join(JOBS_DIR, `${jobId}.log`); const jobPath = join(JOBS_DIR, `${jobId}.json`); // Convert to WAV first if needed (fast, blocking) let transcribeFrom = filePath; let isTmp = false; if (needsConversion(filePath)) { transcribeFrom = await convertToWav(filePath); isTmp = true; } // Use a clean ASCII job-ID-based output path to avoid Unicode filename issues. // After completion, readJobProgress will move the file to the correct destination. const tmpOutputBase = join(JOBS_DIR, jobId); // Determine final destination path const sourceBase = filePath.replace(/\.[^.]+$/, ""); const ext = outputFormat === "srt" ? ".srt" : ".txt"; const outputPath = outputFormat === "srt" && language !== "en" && language !== "auto" ? `${sourceBase}.${language}.srt` : `${sourceBase}${ext}`; // Build args using shared options — ensures quality flags are always applied // in background mode, matching blocking mode behaviour. const lang = language === "auto" ? "auto" : language; const args = [ "-m", model, "-f", transcribeFrom, "-l", lang, "-t", String(threads), // Hallucination prevention — must be in background mode too. // --max-context 0 prevents conditioning on prior segment output. ...(extraOpts.conditionOnPrevText ? [] : ["--max-context", "0"]), // Confirmed valid flag (-nth). Suppresses silent segments from hallucinating. "--no-speech-thold", String(extraOpts.noSpeechThold ?? 0.6), ]; if (extraOpts.temperature !== undefined) args.push("--temperature", String(extraOpts.temperature)); if (extraOpts.prompt) args.push("--prompt", extraOpts.prompt); if (extraOpts.beamSize !== undefined) args.push("--beam-size", String(extraOpts.beamSize)); if (extraOpts.bestOf !== undefined) args.push("--best-of", String(extraOpts.bestOf)); if (extraOpts.gpuDevice !== undefined) args.push("-g", String(extraOpts.gpuDevice)); if (extraOpts.processors !== undefined && extraOpts.processors > 1) args.push("-p", String(extraOpts.processors)); if (extraOpts.offsetT !== undefined) args.push("--offset-t", String(extraOpts.offsetT)); if (extraOpts.duration !== undefined) args.push("--duration", String(extraOpts.duration)); if (extraOpts.diarize) args.push("--diarize"); if (extraOpts.vadModel && existsSync(extraOpts.vadModel)) args.push("--vad", "--vad-model", extraOpts.vadModel); if (extraOpts.wordTimestamps) { args.push("--max-len", "1", "--split-on-word"); } else { if (extraOpts.maxLen !== undefined) args.push("--max-len", String(extraOpts.maxLen)); if (extraOpts.splitOnWord) args.push("--split-on-word"); } // Output format if (outputFormat === "srt") { args.push("-osrt", "-of", tmpOutputBase); } else { args.push("-otxt", "-of", tmpOutputBase); } // Spawn detached, redirect stdout+stderr to log file const logFd = openSync(logPath, "w"); const child = spawn(WHISPER_CLI_PATH, args, { detached: true, stdio: ["ignore", logFd, logFd], windowsHide: true, }); closeSync(logFd); child.unref(); const pid = child.pid ?? 0; const job: Job = { jobId, pid, sourceFile: filePath, transcribeFrom, isTmp, outputPath, tmpOutputBase, outputFormat, logPath, jobPath, startTime: new Date().toISOString(), model, language, threads, durationSec: 0, status: "running", }; writeFileSync(jobPath, JSON.stringify(job, null, 2), "utf8"); return { jobId, pid }; }