Skip to main content
Glama

multimodal-mcp

Multi-provider media generation MCP server. Generate images, videos, audio, and transcriptions from text prompts using OpenAI, xAI, Gemini, ElevenLabs, and BFL (FLUX) through a single unified interface.

Features

  • 🎨 Image Generation — Generate images via OpenAI (gpt-image-1), xAI (grok-imagine-image), Gemini (imagen-4), or BFL (FLUX Pro 1.1)

  • ✏️ Image Editing — Edit images via OpenAI, xAI, Gemini, or BFL (FLUX Kontext)

  • 🎬 Video Generation — Generate videos via OpenAI (sora-2), xAI (grok-imagine-video), or Gemini (veo-3.1)

  • 🔊 Audio Generation — Text-to-speech via OpenAI (tts-1), Gemini, or ElevenLabs (Flash v2.5). Sound effects via ElevenLabs

  • 🎙️ Audio Transcription — Speech-to-text via OpenAI (Whisper) or ElevenLabs (Scribe)

  • 🔄 Auto-Discovery — Automatically detects configured providers from environment variables

  • 🎯 Provider Selection — Auto-selects or explicitly choose a provider per request

  • 📁 File Output — Saves all generated media to disk with descriptive filenames

Related MCP server: universal-image-mcp

Quick Start

Set the API key for at least one provider. Most users only need one — add more to access additional providers.

# Using OpenAI
claude mcp add multimodal-mcp -e OPENAI_API_KEY=sk-... -- npx -y @r16t/multimodal-mcp@latest

# Or using xAI
# claude mcp add multimodal-mcp -e XAI_API_KEY=xai-... -- npx -y @r16t/multimodal-mcp@latest

# Or using Gemini
# claude mcp add multimodal-mcp -e GEMINI_API_KEY=AIza... -- npx -y @r16t/multimodal-mcp@latest

# Or using ElevenLabs (audio + transcription)
# claude mcp add multimodal-mcp -e ELEVENLABS_API_KEY=xi-... -- npx -y @r16t/multimodal-mcp@latest

# Or using BFL/FLUX (images)
# claude mcp add multimodal-mcp -e BFL_API_KEY=... -- npx -y @r16t/multimodal-mcp@latest

Using a different editor? See setup instructions for Claude Desktop, Cursor, VS Code, Windsurf, and Cline.

Environment Variables

Variable

Required

Description

OPENAI_API_KEY

At least one provider key

OpenAI API key — enables image, video, audio generation, and transcription via gpt-image-1, sora-2, tts-1, and whisper-1

XAI_API_KEY

At least one provider key

xAI API key — enables image and video generation via grok-imagine-image and grok-imagine-video

GEMINI_API_KEY

At least one provider key

Gemini API key — enables image, video, and audio generation via imagen-4, veo-3.1, and gemini-2.5-flash-preview-tts

GOOGLE_API_KEY

Alias for GEMINI_API_KEY; either name is accepted

ELEVENLABS_API_KEY

At least one provider key

ElevenLabs API key — enables audio generation (TTS, sound effects) and transcription via Flash v2.5 and Scribe v1

BFL_API_KEY

At least one provider key

BFL API key — enables image generation and editing via FLUX Pro 1.1 and FLUX Kontext

MEDIA_OUTPUT_DIR

No

Directory for saved media files. Defaults to the current working directory

Available Tools

generate_image

Generate an image from a text prompt.

Parameter

Type

Required

Description

prompt

string

Yes

Text description of the image to generate

provider

string

No

Provider to use: openai, xai, google, bfl. Auto-selects if omitted

aspectRatio

string

No

Aspect ratio: 1:1, 16:9, 9:16, 4:3, 3:4

quality

string

No

Quality level: low, standard, high

outputDirectory

string

No

Directory to save the generated file. Absolute or relative path. Defaults to MEDIA_OUTPUT_DIR or cwd

providerOptions

object

No

Provider-specific parameters passed through directly

generate_video

Generate a video from a text prompt. Video generation is asynchronous and may take several minutes.

Parameter

Type

Required

Description

prompt

string

Yes

Text description of the video to generate

provider

string

No

Provider to use: openai, xai, google. Auto-selects if omitted

duration

number

No

Video duration in seconds (provider limits apply)

aspectRatio

string

No

Aspect ratio: 16:9, 9:16, 1:1

resolution

string

No

Resolution: 480p, 720p, 1080p

outputDirectory

string

No

Directory to save the generated file. Absolute or relative path. Defaults to MEDIA_OUTPUT_DIR or cwd

providerOptions

object

No

Provider-specific parameters passed through directly

generate_audio

Generate audio from text. Supports text-to-speech and sound effects. Audio generation is synchronous.

Parameter

Type

Required

Description

text

string

Yes

Text to convert to speech, or a description of the sound effect to generate

provider

string

No

Provider to use: openai, google, elevenlabs. Auto-selects if omitted

voice

string

No

Voice name (provider-specific). OpenAI: alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer. Google: Kore, Charon, Fenrir, Aoede, Puck, etc. ElevenLabs: voice ID

speed

number

No

Speech speed multiplier (OpenAI only): 0.25 to 4.0

format

string

No

Output format (OpenAI only): mp3, opus, aac, flac, wav, pcm

outputDirectory

string

No

Directory to save the generated file. Absolute or relative path. Defaults to MEDIA_OUTPUT_DIR or cwd

providerOptions

object

No

Provider-specific parameters passed through directly. ElevenLabs: set mode: "sound-effect" for sound effects, model for TTS model selection

transcribe_audio

Transcribe audio to text (speech-to-text).

Parameter

Type

Required

Description

audioPath

string

Yes

Absolute path to the audio file to transcribe

provider

string

No

Provider to use: openai, elevenlabs. Auto-selects if omitted

language

string

No

Language code (e.g., en, fr, es) to hint the transcription language

providerOptions

object

No

Provider-specific parameters passed through directly

list_providers

List all configured media generation providers and their capabilities. Takes no parameters.

Provider Capabilities

Provider

Image

Image Editing

Video

Audio

Transcription

Key Models

OpenAI

gpt-image-1, sora-2, tts-1, whisper-1

xAI

grok-imagine-image, grok-imagine-video

Gemini

imagen-4, veo-3.1, gemini-2.5-flash-preview-tts

ElevenLabs

eleven_flash_v2_5, scribe_v1

BFL

flux-pro-1.1, flux-kontext-pro

Image Aspect Ratios

Provider

1:1

16:9

9:16

4:3

3:4

OpenAI

xAI

Gemini

BFL

Video Aspect Ratios & Resolutions

Provider

16:9

9:16

1:1

480p

720p

1080p

OpenAI

xAI

Gemini

Audio Formats

Provider

mp3

opus

aac

flac

wav

pcm

OpenAI

Gemini

ElevenLabs

Troubleshooting

No providers configured

[config] No provider API keys detected

Set at least one of OPENAI_API_KEY, XAI_API_KEY, GEMINI_API_KEY, ELEVENLABS_API_KEY, or BFL_API_KEY in the MCP server's env block.

Provider not available for requested media type

Each provider supports different media types (see Provider Capabilities). If you specify a provider that isn't configured (no API key) or doesn't support the requested media type, you'll receive an error. Omit the provider parameter to auto-select from configured providers.

Video generation timeout

Video generation polls for up to 10 minutes. If your video hasn't completed in that window, the request will fail with a timeout error. Try a shorter duration or a simpler prompt.

xAI image generation returned no data

This indicates the xAI API returned an empty response. Check that your XAI_API_KEY is valid and that your prompt does not violate xAI content policies.

Gemini image/video generation failed: 403

Verify your GEMINI_API_KEY has the Generative Language API enabled in Google Cloud Console.

Development

npm run build      # Compile TypeScript to build/
npm test           # Run tests with Vitest
npm run lint       # Lint and auto-fix with ESLint
npm run typecheck  # Type-check without emitting
npm run dev        # Watch mode for TypeScript compilation

Editor Setup

Replace OPENAI_API_KEY with your provider of choice (XAI_API_KEY, GEMINI_API_KEY, ELEVENLABS_API_KEY, BFL_API_KEY). You can set multiple keys to enable multiple providers.

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Cursor

Add to .cursor/mcp.json in your project root (or ~/.cursor/mcp.json globally):

{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

VS Code (GitHub Copilot)

Add to .vscode/mcp.json in your project root:

{
  "servers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Windsurf

Add to ~/.codeium/windsurf/mcp_config.json:

{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Cline

Add to ~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json:

{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

License

MIT

Install Server
A
license - permissive license
A
quality
C
maintenance

Maintenance

Maintainers
Response time
0dRelease cycle
4Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/rsmdt/multimodal-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server