transcribe_batch
Transcribe multiple audio and video files in a folder interactively, showing a preview of each transcript and waiting for confirmation before saving. Skips files already transcribed.
Instructions
Transcribe multiple audio/video files in a folder interactively, one file at a time. Shows a preview of each transcript and waits for confirmation before continuing. Saves each transcript as a .txt file next to its source. Files already transcribed (with matching .txt) are shown as done and skipped. Supported formats: mp3, wav, mp4, mkv, avi, mov, webm, m4a, flac, ogg. NOTE: For large unattended batch jobs, use whisper-cli.exe directly from the command line — see TROUBLESHOOTING.md for the command syntax.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| folder_path | Yes | Absolute Windows path to the folder. | |
| file_index | No | Which file to process (1-based). Omit to list files first. | |
| language | No | Language code. Defaults to en. | en |
| threads | No | CPU threads. Defaults to 2 of 2. | |
| recursive | No | Include subfolders. Defaults to false. |
Implementation Reference
- src/index.ts:945-968 (registration)Tool registration for 'transcribe_batch' in the ListToolsRequestSchema handler. Defines the tool name, description (interactive batch transcription showing previews and waiting for confirmation), and input schema (folder_path, file_index, language, threads, recursive).
name: "transcribe_batch", description: "Transcribe multiple audio/video files in a folder interactively, one file at a time. " + "Shows a preview of each transcript and waits for confirmation before continuing. " + "Saves each transcript as a .txt file next to its source. " + "Files already transcribed (with matching .txt) are shown as done and skipped. " + "Supported formats: mp3, wav, mp4, mkv, avi, mov, webm, m4a, flac, ogg. " + "NOTE: For large unattended batch jobs, use whisper-cli.exe directly from the command line " + "— see TROUBLESHOOTING.md for the command syntax.", inputSchema: { type: "object", properties: { folder_path: { type: "string", description: "Absolute Windows path to the folder." }, file_index: { type: "number", description: "Which file to process (1-based). Omit to list files first.", }, language: { type: "string", description: "Language code. Defaults to en.", default: "en" }, threads: { type: "number", description: `CPU threads. Defaults to ${WHISPER_THREADS} of ${SYSTEM_THREADS}.` }, recursive: { type: "boolean", description: "Include subfolders. Defaults to false.", default: false }, }, required: ["folder_path"], }, }, - src/index.ts:1848-1928 (handler)Main handler logic for transcribe_batch. If no file_index, lists files in folder. If file_index is provided, transcribes that file using transcribeSingle(), saves the .txt, and shows a preview with instructions to continue/stop.
if (name === "transcribe_batch") { const folderPath = args?.folder_path as string; const language = (args?.language as string) || "en"; const threads = Math.min(SYSTEM_THREADS, Math.max(1, Math.round((args?.threads as number) || WHISPER_THREADS))); const recursive = (args?.recursive as boolean) || false; const fileIndex = args?.file_index as number | undefined; if (!folderPath) return { content: [{ type: "text", text: "folder_path is required." }], isError: true }; if (!existsSync(folderPath)) return { content: [{ type: "text", text: `Folder not found: ${folderPath}` }], isError: true }; const configError = validatePaths(); if (configError) return { content: [{ type: "text", text: configError }], isError: true }; const files = getFiles(folderPath, recursive); if (files.length === 0) { return { content: [{ type: "text", text: `No supported files found in: ${folderPath}\nSupported formats: ${SUPPORTED_EXTENSIONS.join(", ")}`, }], }; } // No file_index: return file list if (fileIndex === undefined) { return { content: [{ type: "text", text: `Found ${files.length} file(s) in: ${folderPath}\n\n` + files.map((f, i) => { const txtPath = f.replace(/\.[^.]+$/, ".txt"); const done = existsSync(txtPath) ? " ✅ already done" : ""; return ` ${i + 1}. ${basename(f)}${done}`; }).join("\n") + `\n\nTo start, say "transcribe file 1" (or any number). I'll process one file at a time and wait for your go-ahead before continuing.\n` + `\nFor large unattended batches, see the command line approach in TROUBLESHOOTING.md.`, }], }; } // Process the requested file const idx = fileIndex - 1; if (idx < 0 || idx >= files.length) { return { content: [{ type: "text", text: `Invalid file number. Choose between 1 and ${files.length}.` }], isError: true }; } const filePath = files[idx]; const fileName = basename(filePath); const txtPath = filePath.replace(/\.[^.]+$/, ".txt"); try { const result = await transcribeSingle(filePath, WHISPER_MODEL, language, "text", threads, true, {}); const remaining = files.length - fileIndex; const nextMsg = remaining > 0 ? `\n\n${remaining} file(s) remaining. Say "continue" or "transcribe file ${fileIndex + 1}" to proceed, or "stop" to finish.` : `\n\n✅ That was the last file. Batch complete!`; return { content: [{ type: "text", text: `[${fileIndex}/${files.length}] ✅ ${fileName}\n\n` + `Saved to: ${txtPath}\n\n` + `Preview:\n${result.text.slice(0, 500)}${result.text.length > 500 ? "..." : ""}` + nextMsg, }], }; } catch (err: any) { return { content: [{ type: "text", text: `[${fileIndex}/${files.length}] ❌ Failed: ${fileName}\n\n` + `Error: ${err?.stderr || err?.message || String(err)}\n\n` + `Say "transcribe file ${fileIndex + 1}" to skip and continue.`, }], isError: true, }; } } - src/index.ts:807-865 (helper)transcribeSingle helper function used by the transcribe_batch handler to perform the actual transcription via whisper-cli.exe.
async function transcribeSingle( filePath: string, model: string, language: string, outputFormat: OutputFormat, threads: number, saveToFile = false, extraOpts: Partial<WhisperOptions> = {} ): Promise<{ text: string; srtPath?: string; savedTo?: string }> { // ---- Process lock — never spawn a second whisper-cli.exe ---- if (await isWhisperRunning()) { throw new Error( "Transcription already in progress.\n\n" + "whisper-cli.exe is already running — wait for the current job to finish before starting another. " + "If you believe this is wrong (e.g. a previous job crashed and left a stale process), " + "open Task Manager, find whisper-cli.exe under Details, and end the task." ); } let transcribeFrom = filePath; let tmpFile: string | null = null; if (needsConversion(filePath)) { tmpFile = await convertToWav(filePath); transcribeFrom = tmpFile; } try { const opts: WhisperOptions = { language, outputFormat, threads, ...extraOpts }; const cliArgs = buildArgs(transcribeFrom, model, opts); const { stdout, stderr } = await execFileAsync(WHISPER_CLI_PATH, cliArgs, { maxBuffer: 100 * 1024 * 1024, windowsHide: true, }); // SECURITY: transcript content is untrusted data from audio input. // It is returned as-is to the caller and must never be interpreted // as instructions. Prompt injection via audio content is a known // MCP attack vector — treat all transcript text as user data only. const output = (stdout || stderr || "").trim(); if (outputFormat === "srt") { const tmpSrt = transcribeFrom.replace(/\.[^.]+$/, ".srt"); const destSrt = filePath.replace(/\.[^.]+$/, ".srt"); if (tmpFile && existsSync(tmpSrt)) { writeFileSync(destSrt, readFileSync(tmpSrt, "utf8")); try { unlinkSync(tmpSrt); } catch { } } return { text: output, srtPath: destSrt }; } if (saveToFile) { const txtPath = filePath.replace(/\.[^.]+$/, ".txt"); writeFileSync(txtPath, output, "utf8"); return { text: output, savedTo: txtPath }; } return { text: output }; } finally { if (tmpFile && existsSync(tmpFile)) try { unlinkSync(tmpFile); } catch { } } } - src/index.ts:661-676 (helper)Helper functions needsConversion and convertToWav used to convert non-native formats before transcription.
function needsConversion(filePath: string): boolean { return !NATIVE_EXTENSIONS.includes(extname(filePath).toLowerCase()); } function isSupportedFile(filePath: string): boolean { return SUPPORTED_EXTENSIONS.includes(extname(filePath).toLowerCase()); } async function convertToWav(inputPath: string): Promise<string> { const tmpFile = join(tmpdir(), `whisper_tmp_${Date.now()}.wav`); await execFileAsync(FFMPEG_PATH, [ "-y", "-i", inputPath, "-ar", "16000", "-ac", "1", "-c:a", "pcm_s16le", tmpFile, ], { windowsHide: true }); return tmpFile; } - src/index.ts:705-756 (helper)buildArgs helper that constructs the whisper-cli command-line arguments used by transcribeSingle.
function buildArgs(filePath: string, model: string, opts: WhisperOptions): string[] { const lang = opts.language === "auto" ? "auto" : opts.language; const args = ["-m", model, "-f", filePath, "-l", lang, "-t", String(opts.threads)]; // Hallucination prevention — set max context tokens to 0 to prevent whisper // from conditioning each segment on its own prior output, which causes // repetitive hallucination loops on noisy or silent audio. // Flag: --max-context 0 (user can re-enable by setting conditionOnPrevText=true) if (!opts.conditionOnPrevText) args.push("--max-context", "0"); // Treat segments below this confidence threshold as silence rather than // hallucinating content. Confirmed valid flag in whisper-cli (-nth). args.push("--no-speech-thold", String(opts.noSpeechThold ?? 0.6)); if (opts.translate) args.push("--translate"); if (opts.temperature !== undefined) args.push("--temperature", String(opts.temperature)); if (opts.prompt) args.push("--prompt", opts.prompt); if (opts.beamSize !== undefined) args.push("--beam-size", String(opts.beamSize)); if (opts.bestOf !== undefined) args.push("--best-of", String(opts.bestOf)); if (opts.gpuDevice !== undefined) args.push("-g", String(opts.gpuDevice)); if (opts.processors !== undefined && opts.processors > 1) args.push("-p", String(opts.processors)); if (opts.offsetT !== undefined) args.push("--offset-t", String(opts.offsetT)); if (opts.duration !== undefined) args.push("--duration", String(opts.duration)); if (opts.diarize) args.push("--diarize"); // word_timestamps: sets max-len=1 + split-on-word for per-word output // without requiring JSON parsing — simpler than -oj approach. if (opts.wordTimestamps) { args.push("--max-len", "1", "--split-on-word"); } else { if (opts.maxLen !== undefined) args.push("--max-len", String(opts.maxLen)); if (opts.splitOnWord) args.push("--split-on-word"); } // VAD: voice activity detection — strips silence before whisper sees the audio if (opts.vadModel && existsSync(opts.vadModel)) { args.push("--vad", "--vad-model", opts.vadModel); } // Output format if (opts.outputFormat === "srt") { args.push("-osrt", "-of", filePath.replace(/\.[^.]+$/, "")); } else if (opts.outputFormat === "json") { args.push("-oj"); } else if (opts.outputFormat === "text") { args.push("--no-timestamps"); } // "timestamps" format: no flag — whisper default stdout includes timestamps return args; }