Mistral multimodal chat (vision)
mistral_visionAnalyze images and text in chat messages to produce assistant responses using Mistral vision models.
Instructions
Chat completion with multimodal input: text + image_url parts.
Requires a vision-capable model. Accepted:
pixtral-large-latest
pixtral-12b-latest
mistral-large-latest
mistral-medium-latest
mistral-small-latest
Each message's content is either a plain string (pure text) or an array of
parts { type: 'text', text } / { type: 'image_url', imageUrl }. The image URL
can be an https URL or a data: URI base64 payload.
Returns the assistant text + token usage. For non-visual requests, prefer mistral_chat.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| messages | Yes | Chat messages. Pure-text requests are accepted, but this tool is intended primarily for multimodal prompts containing image parts. | |
| model | No | Vision-capable Mistral model. Default: pixtral-large-latest. | |
| temperature | No | ||
| max_tokens | No | ||
| top_p | No | ||
| seed | No | Random seed for deterministic sampling. Maps to Mistral's `random_seed`. |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| text | Yes | ||
| model | Yes | ||
| usage | No | ||
| finish_reason | No |
Implementation Reference
- src/tools-vision.ts:190-221 (handler)Async handler function that calls mistral.chat.complete() with multimodal messages, extracts text from the response, and returns structured content with text, model, usage, and finish_reason.
async (input) => { try { const model = input.model ?? DEFAULT_VISION_MODEL; const res = await mistral.chat.complete({ model, messages: input.messages as never, temperature: input.temperature, maxTokens: input.max_tokens, topP: input.top_p, randomSeed: input.seed, }); const choice = res.choices?.[0]; const content = choice?.message?.content ?? ""; const text = typeof content === "string" ? content : JSON.stringify(content); const structured = { text, model, usage: mapUsage(res.usage), finish_reason: choice?.finishReason ?? undefined, }; return { content: [toTextBlock(text)], structuredContent: structured, }; } catch (err) { return errorResult("mistral_vision", err); } } - src/tools-vision.ts:151-222 (registration)Registration of the 'mistral_vision' tool via server.registerTool() within the registerVisionTools() function, including input/output schemas and annotations.
export function registerVisionTools(server: McpServer, mistral: Mistral) { // ========== mistral_vision ========== server.registerTool( "mistral_vision", { title: "Mistral multimodal chat (vision)", description: [ "Chat completion with multimodal input: text + image_url parts.", "", "Requires a vision-capable model. Accepted:", VISION_MODELS.map((m) => ` - ${m}`).join("\n"), "", "Each message's `content` is either a plain string (pure text) or an array of", "parts `{ type: 'text', text }` / `{ type: 'image_url', imageUrl }`. The image URL", "can be an https URL or a data: URI base64 payload.", "", "Returns the assistant text + token usage. For non-visual requests, prefer `mistral_chat`.", ].join("\n"), inputSchema: { messages: z .array(MultimodalMessageSchema) .min(1) .describe( "Chat messages. Pure-text requests are accepted, but this tool is intended primarily for multimodal prompts containing image parts." ), model: VisionModelSchema.optional().describe( `Vision-capable Mistral model. Default: ${DEFAULT_VISION_MODEL}.` ), ...ChatSamplingParams, }, outputSchema: VisionOutputShape, annotations: { title: "Mistral multimodal chat (vision)", readOnlyHint: true, destructiveHint: false, idempotentHint: false, openWorldHint: true, }, }, async (input) => { try { const model = input.model ?? DEFAULT_VISION_MODEL; const res = await mistral.chat.complete({ model, messages: input.messages as never, temperature: input.temperature, maxTokens: input.max_tokens, topP: input.top_p, randomSeed: input.seed, }); const choice = res.choices?.[0]; const content = choice?.message?.content ?? ""; const text = typeof content === "string" ? content : JSON.stringify(content); const structured = { text, model, usage: mapUsage(res.usage), finish_reason: choice?.finishReason ?? undefined, }; return { content: [toTextBlock(text)], structuredContent: structured, }; } catch (err) { return errorResult("mistral_vision", err); } } ); - src/tools-vision.ts:38-43 (schema)VisionOutputShape and VisionOutputSchema defining the tool's output shape (text, model, usage, finish_reason).
export const VisionOutputShape = { text: z.string(), model: z.string(), usage: UsageSchema.optional(), finish_reason: z.string().optional(), }; - src/shared.ts:28-56 (schema)ContentPartSchema and MultimodalMessageSchema used by the vision tool's input schema for multimodal messages with text, image_url, and document_url parts.
export const ContentPartSchema = z.union([ z.object({ type: z.literal("text"), text: z.string(), }), z.object({ type: z.literal("image_url"), imageUrl: z.union([ z .string() .describe("https URL or data:image/...;base64,... payload"), z.object({ url: z.string(), detail: z.enum(["auto", "low", "high"]).optional(), }), ]), }), z.object({ type: z.literal("document_url"), documentUrl: z.string(), documentName: z.string().optional(), }), ]); /** Multimodal chat message (text OR array of parts). */ export const MultimodalMessageSchema = z.object({ role: z.enum(["system", "user", "assistant"]), content: z.union([z.string(), z.array(ContentPartSchema).min(1)]), }); - src/models.ts:107-113 (helper)DEFAULT_VISION_MODEL constant ('pixtral-large-latest') and VISION_MODELS list used as defaults and validation for the vision tool.
export const DEFAULT_VISION_MODEL: (typeof VISION_MODELS)[number] = "pixtral-large-latest"; export const DEFAULT_OCR_MODEL: (typeof OCR_MODELS)[number] = "mistral-ocr-latest"; export const DEFAULT_STT_MODEL: (typeof STT_MODELS)[number] = "voxtral-mini-latest"; export const DEFAULT_MODERATION_MODEL: (typeof MODERATION_MODELS)[number] = "mistral-moderation-latest";