Voxtral speech-to-text
voxtral_transcribeTranscribe audio files to text using Mistral Voxtral models. Accepts public URLs or uploaded files, with options for language hint, speaker diarization, and context bias.
Instructions
Transcribe an audio file to text using Mistral Voxtral.
Accepted models:
voxtral-mini-latest
voxtral-small-latest
Audio source is one of:
{ type: "file_url", fileUrl: "https://..." } (public URL)
{ type: "file", fileId: "" }
Options:
language: ISO-639-1 hint (e.g. 'fr', 'en'). Boosts accuracy when known.temperature: sampling temperature.diarize: return per-speaker segments (default false).timestampGranularities: ['segment'] to return per-segment timestamps.contextBias: list of phrases/terms that should bias the decoder.
Returns plain text, detected language, optional segments[], and token usage.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| audio | Yes | ||
| model | No | STT model. Default: voxtral-mini-latest. | |
| language | No | ISO-639-1 language hint (e.g. 'fr', 'en'). | |
| temperature | No | ||
| diarize | No | ||
| timestampGranularities | No | Only 'segment' is currently supported. | |
| contextBias | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| text | Yes | ||
| model | Yes | ||
| language | Yes | ||
| segments | No | ||
| usage | No |
Implementation Reference
- src/tools-audio.ts:145-195 (handler)Handler function for voxtral_transcribe. Calls mistral.audio.transcriptions.complete with the audio source (fileUrl or fileId) and optional parameters (language, temperature, diarize, timestampGranularities, contextBias). Maps the response segments and returns text content plus structured metadata.
async (input) => { try { const model = input.model ?? DEFAULT_STT_MODEL; const request: { model: string; fileUrl?: string; fileId?: string; language?: string; temperature?: number; diarize?: boolean; timestampGranularities?: Array<"segment">; contextBias?: string[]; } = { model }; if (input.audio.type === "file_url") request.fileUrl = input.audio.fileUrl; if (input.audio.type === "file") request.fileId = input.audio.fileId; if (input.language !== undefined) request.language = input.language; if (input.temperature !== undefined) request.temperature = input.temperature; if (input.diarize !== undefined) request.diarize = input.diarize; if (input.timestampGranularities !== undefined) request.timestampGranularities = input.timestampGranularities; if (input.contextBias !== undefined) request.contextBias = input.contextBias; const res = await mistral.audio.transcriptions.complete( request as never ); const segments = res.segments?.map((s) => ({ text: s.text, start: s.start, end: s.end, score: s.score ?? undefined, speaker_id: s.speakerId ?? undefined, })); const structured = { text: res.text, model: res.model, language: res.language, segments, usage: mapUsage(res.usage), }; return { content: [toTextBlock(res.text)], structuredContent: structured, }; } catch (err) { return errorResult("voxtral_transcribe", err); } } ); - src/tools-audio.ts:30-45 (schema)Output schema for voxtral_transcribe. Defines the TranscribeOutputShape with text, model, language, optional segments (each with text/start/end/score/speaker_id), and optional usage.
const TranscriptionSegmentSchema = z.object({ text: z.string(), start: z.number(), end: z.number(), score: z.number().optional(), speaker_id: z.string().optional(), }); export const TranscribeOutputShape = { text: z.string(), model: z.string(), language: z.string().nullable(), segments: z.array(TranscriptionSegmentSchema).optional(), usage: UsageSchema.optional(), }; export const TranscribeOutputSchema = z.object(TranscribeOutputShape); - src/tools-audio.ts:69-84 (schema)Input schema for audio source (file_url or file discriminator). Used as part of the voxtral_transcribe input.
const AudioSourceSchema = z.union([ z.object({ type: z.literal("file_url"), fileUrl: z .string() .describe("HTTPS URL to an audio file (mp3/wav/flac/ogg/webm/m4a)."), }), z.object({ type: z.literal("file"), fileId: z .string() .describe( "ID of an audio file previously uploaded via the Files API (purpose=audio)." ), }), ]); - src/tools-audio.ts:119-135 (schema)Input schema for voxtral_transcribe tool: audio source, optional model, language, temperature, diarize, timestampGranularities, and contextBias.
inputSchema: { audio: AudioSourceSchema, model: SttModelSchema.optional().describe( `STT model. Default: ${DEFAULT_STT_MODEL}.` ), language: z .string() .optional() .describe("ISO-639-1 language hint (e.g. 'fr', 'en')."), temperature: z.number().min(0).max(2).optional(), diarize: z.boolean().optional(), timestampGranularities: z .array(z.enum(["segment"])) .optional() .describe("Only 'segment' is currently supported."), contextBias: z.array(z.string()).optional(), }, - src/tools-audio.ts:96-195 (registration)Registration of the voxtral_transcribe tool on the MCP server via server.registerTool, including title, description, input/output schemas, and annotations.
server.registerTool( "voxtral_transcribe", { title: "Voxtral speech-to-text", description: [ "Transcribe an audio file to text using Mistral Voxtral.", "", "Accepted models:", STT_MODELS.map((m) => ` - ${m}`).join("\n"), "", "Audio source is one of:", ' - { type: "file_url", fileUrl: "https://..." } (public URL)', ' - { type: "file", fileId: "<id-from-files-api>" }', "", "Options:", " - `language`: ISO-639-1 hint (e.g. 'fr', 'en'). Boosts accuracy when known.", " - `temperature`: sampling temperature.", " - `diarize`: return per-speaker segments (default false).", " - `timestampGranularities`: ['segment'] to return per-segment timestamps.", " - `contextBias`: list of phrases/terms that should bias the decoder.", "", "Returns plain `text`, detected `language`, optional `segments[]`, and token usage.", ].join("\n"), inputSchema: { audio: AudioSourceSchema, model: SttModelSchema.optional().describe( `STT model. Default: ${DEFAULT_STT_MODEL}.` ), language: z .string() .optional() .describe("ISO-639-1 language hint (e.g. 'fr', 'en')."), temperature: z.number().min(0).max(2).optional(), diarize: z.boolean().optional(), timestampGranularities: z .array(z.enum(["segment"])) .optional() .describe("Only 'segment' is currently supported."), contextBias: z.array(z.string()).optional(), }, outputSchema: TranscribeOutputShape, annotations: { title: "Voxtral speech-to-text", readOnlyHint: true, destructiveHint: false, idempotentHint: true, openWorldHint: true, }, }, async (input) => { try { const model = input.model ?? DEFAULT_STT_MODEL; const request: { model: string; fileUrl?: string; fileId?: string; language?: string; temperature?: number; diarize?: boolean; timestampGranularities?: Array<"segment">; contextBias?: string[]; } = { model }; if (input.audio.type === "file_url") request.fileUrl = input.audio.fileUrl; if (input.audio.type === "file") request.fileId = input.audio.fileId; if (input.language !== undefined) request.language = input.language; if (input.temperature !== undefined) request.temperature = input.temperature; if (input.diarize !== undefined) request.diarize = input.diarize; if (input.timestampGranularities !== undefined) request.timestampGranularities = input.timestampGranularities; if (input.contextBias !== undefined) request.contextBias = input.contextBias; const res = await mistral.audio.transcriptions.complete( request as never ); const segments = res.segments?.map((s) => ({ text: s.text, start: s.start, end: s.end, score: s.score ?? undefined, speaker_id: s.speakerId ?? undefined, })); const structured = { text: res.text, model: res.model, language: res.language, segments, usage: mapUsage(res.usage), }; return { content: [toTextBlock(res.text)], structuredContent: structured, }; } catch (err) { return errorResult("voxtral_transcribe", err); } } );