Mistral multimodal chat (vision)
mistral_visionCombine text and image inputs to generate responses with vision-capable models. Accepts image URLs or data URIs for visual reasoning.
Instructions
Chat completion with multimodal input: text + image_url parts.
Requires a vision-capable model. Accepted:
pixtral-large-latest
pixtral-12b-latest
mistral-large-latest
mistral-medium-latest
mistral-small-latest
Each message's content is either a plain string (pure text) or an array of
parts { type: 'text', text } / { type: 'image_url', imageUrl }. The image URL
can be an https URL or a data: URI base64 payload.
Returns the assistant text + token usage. For non-visual requests, prefer mistral_chat.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| messages | Yes | Chat messages. Pure-text requests are accepted, but this tool is intended primarily for multimodal prompts containing image parts. | |
| model | No | Vision-capable Mistral model. Default: pixtral-large-latest. | |
| temperature | No | ||
| max_tokens | No | ||
| top_p | No | ||
| seed | No | Random seed for deterministic sampling. Maps to Mistral's `random_seed`. |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| text | Yes | ||
| model | Yes | ||
| usage | No | ||
| finish_reason | No |