Skip to main content
Glama

image_understanding

Analyze images using Grok AI vision capabilities to extract information, answer questions, and understand visual content based on text prompts.

Instructions

Analyze images using Grok AI vision capabilities (Note: Grok 3 may support image creation)

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
base64_imageNoBase64-encoded image data (without the data:image prefix)
image_urlNoURL of the image to analyze
modelNoGrok vision model to use (e.g., grok-2-vision-latest, potentially grok-3 variants)grok-2-vision-latest
promptYesText prompt to accompany the image

Implementation Reference

  • The primary handler function for the image_understanding tool. It validates inputs (prompt required, image_url or base64_image), constructs a user message with image content and text prompt, calls the Grok API client for image understanding, and formats the response.
    private async handleImageUnderstanding(args: any) {
      console.error('[Tool] Handling image_understanding tool call');
      
      const { image_url, base64_image, prompt, model, ...otherOptions } = args;
      
      // Validate inputs
      if (!prompt) {
        throw new Error('Prompt is required');
      }
      
      if (!image_url && !base64_image) {
        throw new Error('Either image_url or base64_image is required');
      }
      
      // Prepare message content
      const content: any[] = [];
      
      // Add image
      if (image_url) {
        content.push({
          type: 'image_url',
          image_url: {
            url: image_url,
            detail: 'high',
          },
        });
      } else if (base64_image) {
        content.push({
          type: 'image_url',
          image_url: {
            url: `data:image/jpeg;base64,${base64_image}`,
            detail: 'high',
          },
        });
      }
      
      // Add text prompt
      content.push({
        type: 'text',
        text: prompt,
      });
      
      // Create messages array
      const messages = [
        {
          role: 'user',
          content,
        },
      ];
      
      // Create options object
      const options = {
        model: model || 'grok-2-vision-latest',
        ...otherOptions
      };
      
      // Call Grok API
      const response = await this.grokClient.createImageUnderstanding(messages, options);
      
      return {
        content: [
          {
            type: 'text',
            text: response.choices[0].message.content,
          },
        ],
      };
    }
  • src/index.ts:107-133 (registration)
    Tool registration in the ListTools response, including name, description, and input schema definition.
    {
      name: 'image_understanding',
      description: 'Analyze images using Grok AI vision capabilities (Note: Grok 3 may support image creation)',
      inputSchema: {
        type: 'object',
        properties: {
          image_url: {
            type: 'string',
            description: 'URL of the image to analyze'
          },
          base64_image: {
            type: 'string',
            description: 'Base64-encoded image data (without the data:image prefix)'
          },
          prompt: {
            type: 'string',
            description: 'Text prompt to accompany the image'
          },
          model: {
            type: 'string',
            description: 'Grok vision model to use (e.g., grok-2-vision-latest, potentially grok-3 variants)',
            default: 'grok-2-vision-latest'
          }
        },
        required: ['prompt']
      }
    },
  • Input schema defining the expected parameters for the image_understanding tool: prompt (required), image_url or base64_image, model.
    inputSchema: {
      type: 'object',
      properties: {
        image_url: {
          type: 'string',
          description: 'URL of the image to analyze'
        },
        base64_image: {
          type: 'string',
          description: 'Base64-encoded image data (without the data:image prefix)'
        },
        prompt: {
          type: 'string',
          description: 'Text prompt to accompany the image'
        },
        model: {
          type: 'string',
          description: 'Grok vision model to use (e.g., grok-2-vision-latest, potentially grok-3 variants)',
          default: 'grok-2-vision-latest'
        }
      },
      required: ['prompt']
  • Helper method in GrokApiClient that sends the vision-enabled chat completion request to the xAI API endpoint /chat/completions.
    async createImageUnderstanding(messages: any[], options: any = {}): Promise<any> {
      try {
        console.error('[API] Creating image understanding request...');
        
        const requestBody = {
          messages,
          model: options.model || 'grok-2-vision-latest',
          ...options
        };
        
        const response = await this.axiosInstance.post('/chat/completions', requestBody);
        return response.data;
      } catch (error) {
        console.error('[Error] Failed to create image understanding request:', error);
        throw error;
      }
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It states the tool analyzes images but does not describe what the analysis entails (e.g., object detection, captioning, OCR), potential limitations (e.g., image size restrictions, rate limits), or authentication needs. The note about Grok 3 adds confusion rather than transparency.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness3/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is brief but includes a parenthetical note that is speculative and not directly relevant to the tool's current functionality, reducing efficiency. It is front-loaded with the core purpose, but the extra sentence detracts from conciseness without adding value.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a tool with no annotations and no output schema, the description is incomplete. It lacks details on what the analysis returns (e.g., text descriptions, structured data), error conditions, or behavioral traits like rate limits. The note about Grok 3 does not compensate for these gaps, leaving the agent with insufficient context for effective use.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents all four parameters thoroughly. The description adds no additional meaning about parameters beyond what the schema provides, such as explaining interactions between base64_image and image_url or elaborating on model options. Baseline 3 is appropriate as the schema does the heavy lifting.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose as 'Analyze images using Grok AI vision capabilities' with a specific verb ('Analyze') and resource ('images'), distinguishing it from sibling tools like chat_completion and function_calling. However, it includes a parenthetical note about Grok 3 potentially supporting image creation, which slightly dilutes the clarity by introducing unrelated future capabilities.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives like chat_completion or function_calling. It mentions Grok 3 may support image creation, but this is speculative and not actionable for current usage decisions. No explicit when/when-not scenarios or prerequisites are included.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Bob-lance/grok-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server