Skip to main content
Glama

Mistral multimodal chat (vision)

mistral_vision
Read-only

Analyze images and text in chat messages to produce assistant responses using Mistral vision models.

Instructions

Chat completion with multimodal input: text + image_url parts.

Requires a vision-capable model. Accepted:

  • pixtral-large-latest

  • pixtral-12b-latest

  • mistral-large-latest

  • mistral-medium-latest

  • mistral-small-latest

Each message's content is either a plain string (pure text) or an array of parts { type: 'text', text } / { type: 'image_url', imageUrl }. The image URL can be an https URL or a data: URI base64 payload.

Returns the assistant text + token usage. For non-visual requests, prefer mistral_chat.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
messagesYesChat messages. Pure-text requests are accepted, but this tool is intended primarily for multimodal prompts containing image parts.
modelNoVision-capable Mistral model. Default: pixtral-large-latest.
temperatureNo
max_tokensNo
top_pNo
seedNoRandom seed for deterministic sampling. Maps to Mistral's `random_seed`.

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
textYes
modelYes
usageNo
finish_reasonNo

Implementation Reference

  • Async handler function that calls mistral.chat.complete() with multimodal messages, extracts text from the response, and returns structured content with text, model, usage, and finish_reason.
    async (input) => {
      try {
        const model = input.model ?? DEFAULT_VISION_MODEL;
        const res = await mistral.chat.complete({
          model,
          messages: input.messages as never,
          temperature: input.temperature,
          maxTokens: input.max_tokens,
          topP: input.top_p,
          randomSeed: input.seed,
        });
    
        const choice = res.choices?.[0];
        const content = choice?.message?.content ?? "";
        const text =
          typeof content === "string" ? content : JSON.stringify(content);
    
        const structured = {
          text,
          model,
          usage: mapUsage(res.usage),
          finish_reason: choice?.finishReason ?? undefined,
        };
    
        return {
          content: [toTextBlock(text)],
          structuredContent: structured,
        };
      } catch (err) {
        return errorResult("mistral_vision", err);
      }
    }
  • Registration of the 'mistral_vision' tool via server.registerTool() within the registerVisionTools() function, including input/output schemas and annotations.
    export function registerVisionTools(server: McpServer, mistral: Mistral) {
      // ========== mistral_vision ==========
      server.registerTool(
        "mistral_vision",
        {
          title: "Mistral multimodal chat (vision)",
          description: [
            "Chat completion with multimodal input: text + image_url parts.",
            "",
            "Requires a vision-capable model. Accepted:",
            VISION_MODELS.map((m) => `  - ${m}`).join("\n"),
            "",
            "Each message's `content` is either a plain string (pure text) or an array of",
            "parts `{ type: 'text', text }` / `{ type: 'image_url', imageUrl }`. The image URL",
            "can be an https URL or a data: URI base64 payload.",
            "",
            "Returns the assistant text + token usage. For non-visual requests, prefer `mistral_chat`.",
          ].join("\n"),
          inputSchema: {
            messages: z
              .array(MultimodalMessageSchema)
              .min(1)
              .describe(
                "Chat messages. Pure-text requests are accepted, but this tool is intended primarily for multimodal prompts containing image parts."
              ),
            model: VisionModelSchema.optional().describe(
              `Vision-capable Mistral model. Default: ${DEFAULT_VISION_MODEL}.`
            ),
            ...ChatSamplingParams,
          },
          outputSchema: VisionOutputShape,
          annotations: {
            title: "Mistral multimodal chat (vision)",
            readOnlyHint: true,
            destructiveHint: false,
            idempotentHint: false,
            openWorldHint: true,
          },
        },
        async (input) => {
          try {
            const model = input.model ?? DEFAULT_VISION_MODEL;
            const res = await mistral.chat.complete({
              model,
              messages: input.messages as never,
              temperature: input.temperature,
              maxTokens: input.max_tokens,
              topP: input.top_p,
              randomSeed: input.seed,
            });
    
            const choice = res.choices?.[0];
            const content = choice?.message?.content ?? "";
            const text =
              typeof content === "string" ? content : JSON.stringify(content);
    
            const structured = {
              text,
              model,
              usage: mapUsage(res.usage),
              finish_reason: choice?.finishReason ?? undefined,
            };
    
            return {
              content: [toTextBlock(text)],
              structuredContent: structured,
            };
          } catch (err) {
            return errorResult("mistral_vision", err);
          }
        }
      );
  • VisionOutputShape and VisionOutputSchema defining the tool's output shape (text, model, usage, finish_reason).
    export const VisionOutputShape = {
      text: z.string(),
      model: z.string(),
      usage: UsageSchema.optional(),
      finish_reason: z.string().optional(),
    };
  • ContentPartSchema and MultimodalMessageSchema used by the vision tool's input schema for multimodal messages with text, image_url, and document_url parts.
    export const ContentPartSchema = z.union([
      z.object({
        type: z.literal("text"),
        text: z.string(),
      }),
      z.object({
        type: z.literal("image_url"),
        imageUrl: z.union([
          z
            .string()
            .describe("https URL or data:image/...;base64,... payload"),
          z.object({
            url: z.string(),
            detail: z.enum(["auto", "low", "high"]).optional(),
          }),
        ]),
      }),
      z.object({
        type: z.literal("document_url"),
        documentUrl: z.string(),
        documentName: z.string().optional(),
      }),
    ]);
    
    /** Multimodal chat message (text OR array of parts). */
    export const MultimodalMessageSchema = z.object({
      role: z.enum(["system", "user", "assistant"]),
      content: z.union([z.string(), z.array(ContentPartSchema).min(1)]),
    });
  • DEFAULT_VISION_MODEL constant ('pixtral-large-latest') and VISION_MODELS list used as defaults and validation for the vision tool.
    export const DEFAULT_VISION_MODEL: (typeof VISION_MODELS)[number] =
      "pixtral-large-latest";
    export const DEFAULT_OCR_MODEL: (typeof OCR_MODELS)[number] = "mistral-ocr-latest";
    export const DEFAULT_STT_MODEL: (typeof STT_MODELS)[number] =
      "voxtral-mini-latest";
    export const DEFAULT_MODERATION_MODEL: (typeof MODERATION_MODELS)[number] =
      "mistral-moderation-latest";
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description explains the input format, accepted model list, and return value (assistant text + token usage). Annotations already declare readOnlyHint=true, and the description adds context about multimodal capabilities without contradicting annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is front-loaded with the core purpose, followed by a concise list of models, then message format details. Every sentence serves a purpose, and the structure is easy to parse. No unnecessary verbosity.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The description is complete for a multimodal chat tool: it covers input structure, accepted models, return value, and when to use an alternative. Given the presence of an output schema and thorough description, agents have sufficient context.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The description adds meaning beyond the schema by explaining that content can be a plain string or array of typed parts, and that image URL can be https or data URI. Schema coverage is 50%, so the description compensates well, especially for the complex messages parameter.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool does chat completion with multimodal input (text + image_url parts). It lists accepted models and differentiates from the sibling mistral_chat for non-visual requests. The purpose is specific and actionable.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit guidance on when to use this tool (requires vision-capable model) and when not to (for non-visual requests, prefer mistral_chat). It also details the required message format and model options, leaving no ambiguity.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Swih/mistral-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server