transcribe_podcast
Transcribes podcast files with word-level timestamps and speaker detection. Returns lightweight metadata; retrieve full transcript via get_ui_state.
Instructions
STEP 1 — Transcribe a podcast video/audio file. This is typically the first tool you call.
What it does: Uses Whisper AI for word-level timestamps + pyannote for speaker detection (who said what). Returns: Lightweight metadata only — duration, language, word/segment counts, speaker summary, and packed_ready flag. The actual transcript body is NOT returned here (it would be 500KB+ for a typical episode). Read the content via get_ui_state(include_transcript: true) which returns a compact phrase-grouped markdown view (~10x smaller than raw segments). Caching: Results are cached by file hash — same file won't be re-transcribed. Supported formats: MP4, MOV, WebM, MKV, MP3, WAV.
After transcription: call get_ui_state(include_transcript: true) to read the transcript, then analyze it for viral moments and call suggest_clips.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| language | No | ISO language code | |
| file_path | Yes | Absolute path to the podcast file | |
| model_size | No | Whisper model size | base |
| num_speakers | No | Exact number of speakers if known (e.g. 2). Auto-detects if omitted. | |
| enable_diarization | No | Enable speaker detection (who is speaking). Default: true |