Datai MCP Server

AGNO_Multimodal_Capabilities.mdc•10.6 KiB

--- description: AGNO provides native support for multimodal interactions, allowing agents to process and generate not just text, but also images, audio, and video. This rule outlines how to implement multimodal capabilities in AGNO agents. globs: alwaysApply: false --- # AGNO Multimodal Capabilities AGNO provides native support for multimodal interactions, allowing agents to process and generate not just text, but also images, audio, and video. This rule outlines how to implement multimodal capabilities in AGNO agents. ## Multimodal Input Support AGNO agents can accept various types of inputs: ```python from agno.agent import Agent from agno.models.anthropic import Claude from agno.models.openai import OpenAIChat from agno.tools.reasoning import ReasoningTools # Create a multimodal agent using Claude multimodal_agent = Agent( model=Claude(id="claude-3-7-sonnet-latest"), # Supports image understanding tools=[ReasoningTools(add_instructions=True)], instructions=[ "Analyze both text and images provided by the user", "Describe image content in detail when present", "Reference visual elements in your responses when relevant", ], markdown=True, ) # Process an image input from pathlib import Path image_path = Path("data/sample_image.jpg") with open(image_path, "rb") as f: image_data = f.read() # Process text and image together response = multimodal_agent.run( "What can you tell me about this image?", images=[image_data] # Pass images as a list ) ``` ## Image Processing ### Working with Image Data AGNO supports multiple ways to provide image data: ```python from agno.agent import Agent from agno.models.openai import OpenAIChat vision_agent = Agent( model=OpenAIChat(id="gpt-4-vision-preview"), # Vision-capable model instructions=["Analyze images and provide detailed descriptions"], ) # Method 1: From file path response = vision_agent.run( "What's in this image?", images=["data/image1.jpg"] # File path as string ) # Method 2: From binary data with open("data/image2.png", "rb") as f: image_bytes = f.read() response = vision_agent.run( "Describe this image in detail", images=[image_bytes] # Binary data ) # Method 3: From URL response = vision_agent.run( "What does this diagram show?", images=["https://example.com/diagram.jpg"] # Image URL ) # Method 4: Multiple images response = vision_agent.run( "Compare these two images", images=["data/image1.jpg", "data/image2.jpg"] # Multiple images ) ``` ### Image Analysis with Tools Combine image understanding with other tools: ```python from agno.agent import Agent from agno.models.openai import OpenAIChat from agno.tools.reasoning import ReasoningTools from agno.tools.duckduckgo import DuckDuckGoTools # Create an agent that can analyze images and search for information image_analysis_agent = Agent( model=OpenAIChat(id="gpt-4-vision-preview"), tools=[ ReasoningTools(add_instructions=True), DuckDuckGoTools(), ], instructions=[ "First analyze any images provided by the user", "If needed, search for additional information about items in the image", "Provide a comprehensive analysis combining visual and found information", ], markdown=True, ) # Process an image and potentially search for related information response = image_analysis_agent.run( "What is this building and when was it constructed?", images=["data/architecture.jpg"] ) ``` ## Multimodal Output Generation AGNO can also generate multimodal outputs using appropriate models: ```python from agno.agent import Agent from agno.models.openai import OpenAIChat from agno.tools.dall_e import DallETools # Create an agent that can generate images image_generation_agent = Agent( model=OpenAIChat(id="gpt-4o"), tools=[DallETools()], instructions=[ "Use DALL-E to generate images when relevant to the user's request", "Create detailed prompts for image generation", ], markdown=True, ) # Generate an image based on user request response = image_generation_agent.run( "Create an image of a futuristic city with flying cars and vertical gardens" ) ``` ## Audio Processing AGNO supports audio processing capabilities: ```python from agno.agent import Agent from agno.models.openai import OpenAIChat from agno.tools.audio import AudioTranscriptionTools, TextToSpeechTools # Create an agent that can process audio and generate speech audio_agent = Agent( model=OpenAIChat(id="gpt-4o"), tools=[ AudioTranscriptionTools(), # For transcribing audio to text TextToSpeechTools(), # For converting text to speech ], instructions=[ "Transcribe audio inputs and respond appropriately", "Generate audio responses when requested", ], ) # Process audio input with open("data/recording.mp3", "rb") as f: audio_data = f.read() response = audio_agent.run( "Transcribe this audio and summarize its content", audio=[audio_data] ) # Generate audio output text_response = audio_agent.run("Convert this response to speech") audio_response = audio_agent.tools["text_to_speech"].convert( text=text_response.content, voice="alloy" # Specify voice style ) ``` ## Combined Multimodal Agent Create a fully multimodal agent that can handle various input and output types: ```python from agno.agent import Agent from agno.models.openai import OpenAIChat from agno.tools.reasoning import ReasoningTools from agno.tools.dall_e import DallETools from agno.tools.audio import AudioTranscriptionTools, TextToSpeechTools # Create a comprehensive multimodal agent multimodal_agent = Agent( model=OpenAIChat(id="gpt-4-vision-preview"), # Support image input tools=[ ReasoningTools(add_instructions=True), DallETools(), # Image generation AudioTranscriptionTools(), # Audio transcription TextToSpeechTools(), # Text-to-speech ], instructions=[ "Process text, image, and audio inputs", "Generate text, image, and audio outputs as appropriate", "Use reasoning to determine the best output modality for each response", ], markdown=True, ) # Process mixed-modal input response = multimodal_agent.run( "What's happening in this image? Can you also generate a similar image but at night?", images=["data/city_day.jpg"] ) ``` ## Structured Multimodal Responses Combine structured outputs with multimodal content: ```python from typing import List, Optional from pydantic import BaseModel, Field from agno.agent import Agent from agno.models.openai import OpenAIChat from agno.tools.dall_e import DallETools # Define a structured response model with image URLs class ImageAnalysisResult(BaseModel): description: str = Field(..., description="Detailed description of the analyzed image") identified_objects: List[str] = Field(..., description="List of objects identified in the image") main_colors: List[str] = Field(..., description="Main colors present in the image") style_assessment: str = Field(..., description="Assessment of the image style or aesthetic") similar_images: Optional[List[str]] = Field(None, description="URLs of generated similar images") # Create a structured multimodal agent structured_vision_agent = Agent( model=OpenAIChat(id="gpt-4-vision-preview"), tools=[DallETools()], response_model=ImageAnalysisResult, instructions=[ "Analyze the provided image comprehensively", "Identify objects, colors, and style", "Generate similar images if requested", ], ) # Process image with structured response response = structured_vision_agent.run( "Analyze this image and generate a similar one with different lighting", images=["data/sample_image.jpg"] ) # Access structured data with image URLs result = response.content print(f"Description: {result.description}") print(f"Objects: {', '.join(result.identified_objects)}") print(f"Colors: {', '.join(result.main_colors)}") print(f"Style: {result.style_assessment}") if result.similar_images: print(f"Similar images: {result.similar_images}") ``` ## Best Practices for Multimodal Agents 1. **Model Selection**: - Use vision-capable models for image inputs (GPT-4 Vision, Claude 3) - Match model capabilities to your multimodal needs 2. **Input Preparation**: - Optimize images before sending (resize large images) - Send images in appropriate formats (JPEG, PNG) - Consider image quality vs. token usage tradeoffs 3. **Instructions for Multimodal Models**: - Give clear instructions for how to handle different modalities - Specify output format expectations for different media types - Provide examples of desired multimodal interactions 4. **Performance Considerations**: - Multimodal processing uses more tokens and compute - Cache results when possible - Consider async processing for large media files 5. **User Experience**: - Provide fallback text descriptions for image outputs - Consider accessibility needs for multimodal interactions - Be explicit about which modalities are being processed ## Example: Advanced Image Analysis Agent ```python from agno.agent import Agent from agno.models.openai import OpenAIChat from agno.tools.reasoning import ReasoningTools from agno.tools.dall_e import DallETools from agno.tools.duckduckgo import DuckDuckGoTools # Create advanced image analysis agent image_expert = Agent( model=OpenAIChat(id="gpt-4-vision-preview"), tools=[ ReasoningTools(add_instructions=True, chain_of_thought=True), DallETools(), DuckDuckGoTools(), ], instructions=[ "Provide expert analysis of images with the following steps:", "1. Describe the image contents in detail", "2. Identify any notable objects, people, or elements", "3. Analyze composition, style, and technical aspects", "4. Research relevant information about identified elements", "5. Generate similar or modified images when requested", ], markdown=True, ) # Function to handle image analysis with progress tracking def analyze_image(image_path, query): print(f"Analyzing image: {image_path}") print("Processing...") with open(image_path, "rb") as f: image_data = f.read() response = image_expert.print_response( f"{query}", images=[image_data], stream=True, show_full_reasoning=True, ) print("\nAnalysis complete!") return response # Example usage analyze_image( "data/artwork.jpg", "Analyze this artwork's style and composition. What art movement does it belong to? Can you generate a similar image in a different style?" ) ```

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Datai-Network/datai-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

AGNO_Multimodal_Capabilities.mdc•10.6 KiB