Which integrations are available for this server?

Provides tools for managing Kubeflow training jobs on Kubernetes clusters, enabling AI agents to plan, submit, monitor, and manage training jobs through natural language.

How do I use Kubeflow MCP Server?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@Kubeflow MCP Server Fine-tune gemma-2b on the alpaca dataset" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

Kubeflow MCP Server

Official

by kubeflow

Overview Schema Related Servers Score Discussions

Python

Remote

Kubeflow MCP Server

License Python Join Slack

Proposal: KEP-936 · ROADMAP · SECURITY · CONTRIBUTING

Overview

The Kubeflow MCP Server exposes Kubeflow Training operations as Model Context Protocol tools, enabling AI agents (Claude, Cursor, Claude Code, or any custom agents etc.) to plan, submit, monitor, and manage training jobs through natural language — without users needing to learn Kubernetes or the Kubeflow SDK directly.

Benefits

Agent-Native: Tools auto-discovered via MCP — no manual API wiring
Guided Workflow: Phase ordering with next-step hints (Plan → Discover → Train → Monitor)
Preview-Before-Submit: Every mutating operation requires explicit confirmation
Security-First: Persona gating, namespace enforcement, input validation, bearer/JWT auth
Multi-Platform: Auto-detects OpenShift, EKS, GKE with platform-specific guidance
Token-Efficient: Progressive/semantic modes compress 23 tools into 2-3 meta-tools
Extensible: Plugin architecture for additional Kubeflow clients (TODO: optimizer, hub)

Related MCP server: mlctl

Get Started

Install from source

git clone https://github.com/kubeflow/mcp-server.git
cd mcp-server
pip install .

Run the server

kubeflow-mcp serve

Once published to PyPI, install with pip install kubeflow-mcp.

Run with Docker

Pre-built multi-arch images are published to GHCR on every release:

docker run --rm -p 8000:8000 \
  -e KUBEFLOW_MCP_AUTH_TOKEN=my-secret-token \
  ghcr.io/kubeflow/mcp-server:latest

The server listens on http://localhost:8000/mcp.

Environment variables

Variable	Default	Description
`MCP_TRANSPORT`	`http`	Transport protocol (`http`, `sse`, `stdio`)
`KUBEFLOW_MCP_AUTH_TOKEN`	(none)	Bearer token for HTTP auth
`KUBEFLOW_MCP_JWKS_URI`	(none)	JWKS endpoint for JWT verification (production)
`KUBEFLOW_MCP_JWT_ISSUER`	(none)	Expected JWT issuer
`KUBEFLOW_MCP_JWT_AUDIENCE`	(none)	Expected JWT audience
`KUBEFLOW_MCP_CLIENTS`	`trainer`	Comma-separated client modules to load
`KUBEFLOW_MCP_PERSONA`	`readonly`	Tool persona (`readonly`, `data-scientist`, `ml-engineer`, `platform-admin`)
`LOG_FORMAT`	`json`	Log format (`json`, `console`)
`LOG_LEVEL`	`INFO`	Log level (`DEBUG`, `INFO`, `WARNING`, `ERROR`)

MCP client config (HTTP transport)

{
  "mcpServers": {
    "kubeflow": {
      "url": "http://localhost:8000/mcp",
      "headers": { "Authorization": "Bearer my-secret-token" }
    }
  }
}

For in-cluster deployments, replace localhost:8000 with the Kubernetes Service address and mount KUBEFLOW_MCP_AUTH_TOKEN from a Secret.

Example: Fine-tune a model via AI agent

Once connected, your AI agent can run a complete training workflow through natural language:

User: "Fine-tune gemma-2b on the alpaca dataset"

Agent calls: check_compatibility()        → ✅ K8s 1.29, Trainer CRD installed
Agent calls: get_cluster_resources()      → 4x A100 GPUs available
Agent calls: estimate_resources("google/gemma-2b") → needs ~16GB GPU, 1x A100
Agent calls: list_runtimes()              → torchtune-llama, torchtune-gemma, ...
Agent calls: fine_tune(                   → preview config (confirmed=False)
    model="hf://google/gemma-2b",
    dataset="hf://tatsu-lab/alpaca",
    runtime="torchtune-gemma-2b"
)
Agent calls: fine_tune(..., confirmed=True) → TrainJob "train-gemma-abc" created
Agent calls: get_training_logs("train-gemma-abc") → training progress...

Every mutating tool requires confirmed=True — agents always preview before submitting.

MCP Client Config

Add to .cursor/mcp.json (or use the .mcp.json at the repo root for local dev):

{
  "mcpServers": {
    "kubeflow": {
      "command": "uv",
      "args": ["run", "kubeflow-mcp", "serve"]
    }
  }
}

claude mcp add kubeflow -- kubeflow-mcp serve

Tools

23 tools organized by workflow phase:

Phase	Tools	Description
Planning	`pre_flight`, `check_compatibility`, `get_cluster_resources`, `estimate_resources`	Environment validation and resource estimation
Discovery	`list_training_jobs`, `get_training_job`, `list_runtimes`, `get_runtime`	Browse jobs and available runtimes
Training	`fine_tune`, `run_custom_training`, `run_container_training`	Submit LoRA/QLoRA fine-tuning, custom scripts, or container jobs
Monitoring	`get_training_logs`, `get_training_events`, `wait_for_training`	Track progress, debug failures
Lifecycle	`delete_training_job`, `update_training_job`	Manage existing jobs (ownership-guarded)
Platform	`inspect_crd`, `inspect_controller`, `patch_runtime`, `create_runtime`, `delete_runtime`	Cluster inspection and runtime management
Health	`health_check`, `get_server_logs`	Server diagnostics

Requirements

MCP Server	Kubeflow Trainer	Kubeflow SDK	Python	Kubernetes
0.1.x	>= 2.2.0	>= 0.4.0	3.10 - 3.12	>= 1.27

CLI Reference

`kubeflow-mcp serve`

kubeflow-mcp serve \
  --clients trainer \             # modules: trainer, optimizer (stub), hub (stub)
  --persona ml-engineer \         # readonly | data-scientist | ml-engineer | platform-admin
  --mode full \                   # full | progressive | semantic
  --instruction-tier full \       # full | compact | minimal
  --transport stdio \             # stdio | http | sse
  --auth-token SECRET \           # bearer token for HTTP auth (dev/staging)
  --log-level INFO \              # DEBUG | INFO | WARNING | ERROR
  --log-format console \          # console | json (auto-detected if omitted)
  --no-banner                     # suppress startup banner

--mode progressive exposes 3 meta-tools (~85 tokens) for hierarchical discovery. --mode semantic exposes 2 meta-tools (~69 tokens) using embedding search. Both reduce token consumption significantly for agent workflows.

When using --transport http, configure auth to secure the endpoint:

# Simple API key (dev/staging)
kubeflow-mcp serve --transport http --auth-token my-secret-token

# Or via env var
export KUBEFLOW_MCP_AUTH_TOKEN=my-secret-token
kubeflow-mcp serve --transport http

# JWT verification (production)
export KUBEFLOW_MCP_JWKS_URI=https://auth.example.com/.well-known/jwks.json
export KUBEFLOW_MCP_JWT_ISSUER=https://auth.example.com
export KUBEFLOW_MCP_JWT_AUDIENCE=kubeflow-mcp
kubeflow-mcp serve --transport http

Without auth configured, the server logs a warning that the HTTP endpoint is open.

kubeflow-mcp agent \
  --backend ollama \              # ollama (default; more backends planned)
  --model qwen3:8b \              # model name for the backend
  --mode full \                   # full | progressive | semantic
  --thinking                      # enable thinking output (supported models)

Development

make install-dev                  # setup environment
make verify                       # lint + format check
make test-python                  # run tests
make inspector                    # launch MCP Inspector (stdio)
make inspector TRANSPORT=http     # Inspector + Streamable HTTP (start server separately)
make inspector TRANSPORT=sse      # Inspector + SSE (start server separately)

Community

Slack: Join #kubeflow-ml-experience on CNCF Slack
Meetings: Attend the Kubeflow SDK and ML Experience bi-weekly call
GitHub: Issues and contributions at kubeflow/mcp-server

Documentation

CONTRIBUTING: Development workflow and PR guidelines
KEP-936: Design proposal

License

Apache License 2.0 — see LICENSE.

This server cannot be installed

license - permissive license

quality - not tested

maintenance

How are these scores calculated?

Maintenance

–Maintainers

<1hResponse time

–Release cycle

–Releases (12mo)

Commit activity

Issues opened vs closed

Resources

GitHub Repository

Need Help?

Related Servers

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly
Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
OpenAI
open source

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/kubeflow/mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server