Skip to main content
Glama
unscene

Gemini Audio Upload

by unscene

analyze_audio

Analyze audio files using Google Gemini AI to extract descriptions, insights, or answers to specific prompts with optional JSON context and system instructions.

Instructions

Analyze an audio file using Google Gemini.

Args: audio_path: Path to the audio file (wav, mp3, etc.) prompt: The prompt to send to Gemini. json_path: Optional path to a JSON file to provide as context. json_context: Optional JSON string to provide as context (overrides json_path if provided). instruction_file: Optional path to a text file containing system instructions. model: The Gemini model to use.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
audio_pathYes
promptNoDescribe this audio.
json_pathNo
json_contextNo
instruction_fileNo
modelNogemini-3-pro-preview

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • The MCP handler function for the 'analyze_audio' tool, which handles parameters, loads system instructions if provided, and delegates to analyze_audio_content.
    @mcp.tool()
    def analyze_audio( # pylint: disable=too-many-arguments, too-many-positional-arguments
            audio_path: str,
            prompt: str = "Describe this audio.",
            json_path: Optional[str] = None,
            json_context: Optional[str] = None,
            instruction_file: Optional[str] = None,
            model: str = "gemini-3-pro-preview"
    ) -> str:
        """
        Analyze an audio file using Google Gemini.
    
        Args:
            audio_path: Path to the audio file (wav, mp3, etc.)
            prompt: The prompt to send to Gemini.
            json_path: Optional path to a JSON file to provide as context.
            json_context: Optional JSON string to provide as context (overrides json_path if provided).
            instruction_file: Optional path to a text file containing system instructions.
            model: The Gemini model to use.
        """
        api_key = os.getenv("GOOGLE_API_KEY")
        if not api_key:
            return "Error: GOOGLE_API_KEY not set."
    
        system_instruction = None
        if instruction_file:
            if os.path.exists(instruction_file):
                try:
                    with open(instruction_file, 'r', encoding='utf-8') as f:
                        system_instruction = f.read()
                except Exception as e: # pylint: disable=broad-exception-caught
                    return f"Error reading instruction file: {str(e)}"
            else:
                return f"Error: Instruction file not found at {instruction_file}"
    
        try:
            return analyze_audio_content(
                audio_path,
                prompt,
                json_path,
                json_context,
                model,
                api_key,
                system_instruction
            )
        except Exception as e: # pylint: disable=broad-exception-caught
            return f"Error analyzing audio: {str(e)}"
  • Core helper function implementing the audio analysis logic: uploads audio to Gemini, handles JSON context, waits for processing, configures model, and generates response.
    def analyze_audio_content( # pylint: disable=too-many-arguments, too-many-locals, too-many-positional-arguments
            audio_path,
            prompt,
            json_path=None,
            json_context=None,
            model_name="gemini-3-pro-preview",
            api_key=None,
            system_instruction=None
    ):
        """
        Analyzes audio content using Google Gemini.
    
        Args:
            audio_path (str): Path to the audio file.
            prompt (str): Prompt for the model.
            json_path (str, optional): Path to a JSON context file.
            json_context (str, optional): JSON context string.
            model_name (str, optional): Gemini model name.
            api_key (str, optional): Google API key.
            system_instruction (str, optional): System instruction for the model.
    
        Returns:
            str: The model's response.
        """
        if not api_key:
            raise ValueError("API key is required")
    
        genai.configure(api_key=api_key) # type: ignore
    
        files_to_upload = []
    
        # Upload Audio
        if not os.path.exists(audio_path):
            return f"Error: Audio file not found at {audio_path}"
    
        print(f"Uploading audio: {audio_path}")
        # Simple mime type detection or default to wav/mp3
        mime_type = "audio/wav"
        if audio_path.lower().endswith(".mp3"):
            mime_type = "audio/mp3"
    
        audio_file = upload_to_gemini(audio_path, mime_type=mime_type)
        files_to_upload.append(audio_file)
    
        # Handle JSON
        json_content = ""
        if json_context:
            json_content = json_context
        elif json_path:
            if os.path.exists(json_path):
                print(f"Reading JSON: {json_path}")
                with open(json_path, 'r', encoding='utf-8') as f:
                    json_content = f.read()
            else:
                print(f"Warning: JSON file not found at {json_path}")
    
        # Wait for processing
        wait_for_files_active(files_to_upload)
    
        # Create the model
        generation_config = {
            "temperature": 1,
            "top_p": 0.95,
            "top_k": 64,
            "max_output_tokens": 8192,
            "response_mime_type": "text/plain",
        }
    
        model = genai.GenerativeModel( # type: ignore
            model_name=model_name,
            generation_config=generation_config, # type: ignore
            system_instruction=system_instruction
        )
    
        # Construct the prompt parts
        prompt_parts = []
        if json_content:
            prompt_parts.append(f"Context JSON:\n{json_content}\n")
    
        prompt_parts.append(prompt)
        prompt_parts.append(audio_file)
    
        # Generate content
        print("Generating content...")
        response = model.generate_content(prompt_parts)
    
        return response.text
  • Helper function to upload audio file to Gemini API.
    def upload_to_gemini(path, mime_type=None):
        """Uploads the given file to Gemini.
    
        See https://ai.google.dev/gemini-api/docs/prompting_with_media
        """
        file = genai.upload_file(path, mime_type=mime_type) # type: ignore
        print(f"Uploaded file '{file.display_name}' as: {file.uri}")
        return file
  • Helper function to wait for uploaded files to be processed and active in Gemini API.
    def wait_for_files_active(files):
        """Waits for the given files to be active.
    
        Some files uploaded to the Gemini API need to be processed before they can
        be used as prompt inputs. The status can be seen by querying the file's
        "state" field.
    
        This implementation relies on the file's "name" field to perform the query,
        and if the state is not ACTIVE, it waits 10 seconds and checks again.
        """
        print("Waiting for file processing...")
        for name in (file.name for file in files):
            file = genai.get_file(name) # type: ignore
            while file.state.name == "PROCESSING":
                print(".", end="", flush=True)
                time.sleep(10)
                file = genai.get_file(name) # type: ignore
            if file.state.name != "ACTIVE":
                raise RuntimeError(f"File {file.name} failed to process")
        print("...all files ready")
  • MCP server run block that registers and runs the tools, including the help mentioning analyze_audio.
    if __name__ == "__main__":
        if "--help" in sys.argv:
            print("GeminiAudio MCP Server")
            print("Run this server using an MCP client (e.g. Claude Desktop, VS Code MCP extension).")
            print("\nTools:")
            print("  - analyze_audio: Analyze an audio file using Google Gemini.")
            sys.exit(0)
        mcp.run()
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries full burden for behavioral disclosure. While 'analyze' implies a read-only operation, the description doesn't clarify permissions, rate limits, costs, response format, or error handling. It mentions using Google Gemini but lacks details on what analysis entails or behavioral traits.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is appropriately sized and front-loaded with the core purpose in the first sentence. The parameter list is structured but could be more concise; some explanations are brief yet clear. No redundant information is present, making it efficient.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given 6 parameters with 0% schema coverage and no annotations, the description provides basic parameter info but lacks behavioral context, usage guidelines, and output details. An output schema exists, so return values needn't be explained, but for a complex tool with multiple inputs, more completeness on constraints and interactions would be beneficial.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, so the description must compensate. It lists all 6 parameters with brief explanations (e.g., 'Path to the audio file', 'The prompt to send to Gemini'), adding basic semantics beyond schema titles. However, it doesn't detail formats (e.g., audio file types beyond 'wav, mp3, etc.'), constraints, or interactions between parameters like json_context overriding json_path.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Analyze an audio file using Google Gemini.' It specifies the verb ('analyze'), resource ('audio file'), and technology ('Google Gemini'), making the function unambiguous. However, with no sibling tools provided, there's no opportunity to differentiate from alternatives, preventing a score of 5.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives, prerequisites, or typical use cases. It simply lists parameters without contextual advice. With no sibling tools mentioned, there's no comparison, but general usage context is missing.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/unscene/gemini-audio-upload'

If you have feedback or need assistance with the MCP directory API, please join our Discord server