Transcribe Audio/Video
transcribeConvert speech from audio or video files to text. Add timestamps, label speakers, and provide language hints to improve accuracy.
Instructions
Transcribe speech from an audio or video file to text using Gemini. Optional timestamps, speaker labels (diarization) and language hint. Provide a local file_path (loaded server-side) or inline base64 data.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| file_path | No | Absolute path to a local audio/video file (mp3, wav, m4a, mp4, mov, webm, ...). | |
| data | No | Base64-encoded audio/video data (alternative to file_path). | |
| mime_type | No | MIME type for inline data (e.g. "audio/mpeg", "video/mp4"). Required with data. | |
| language | No | Optional spoken-language hint (e.g. "German"). Improves accuracy. | |
| timestamps | No | Prefix lines with [mm:ss] timestamps. | |
| diarization | No | Label distinct speakers (Speaker 1, Speaker 2, ...). | |
| prompt | No | Optional extra instruction appended to the transcription prompt. | |
| model | No | Model to use (defaults to the configured image/analysis model). | |
| max_tokens | No | Maximum tokens in response (default 32768 for long transcripts). |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| content | Yes | ||
| success | Yes |