Skip to main content
Glama
ManishMaurya22

Kubernetes + Prometheus SRE MCP Server

πŸ€– Kubernetes + Prometheus SRE MCP Server β€” natural language cluster ops, SLO monitoring, and PromQL queries via Claude

Natural language Kubernetes operations β€” powered by Model Context Protocol (MCP)
Built to scale from a single cluster to multi-cluster, multi-team enterprise environments.

Python MCP Kubernetes License: MIT


🎯 What Is This?

An MCP (Model Context Protocol) server that exposes Kubernetes SRE operations as tools an AI assistant can call.

You:    "Run the high error rate runbook for the production namespace"

Claude: [calls run_runbook β†’ executes org-approved diagnosis sequence]
        Step 1: Checked deployments β€” nginx (3/3), api-service (1/3 ⚠️)
        Step 2: Found pod api-service-7f9d β€” 47 restarts, OOMKilled
        Step 3: Warning events β€” OOMKilled x3 in last 10 minutes
        Recommendation: Increase memory limit to 512Mi + scale to 5 replicas

✨ What's New in v2.0

Feature

v1

v2

Clusters supported

1 (hardcoded)

Many (dynamic context switching)

Write operations

Unrestricted

Policy-checked with guardrails

Audit trail

None

Full structured JSON log

Incident diagnosis

Ad-hoc

Encoded runbooks (standardized)

Operational consistency

Per-engineer

Org-wide enforced


πŸ› οΈ Tools

Read

Tool

Description

list_clusters

All clusters in kubeconfig

get_pods

Pod status, restarts, container states

get_crashlooping_pods

CrashLoopBackOff pods across all namespaces

get_pod_logs

Logs including previous crashed container

get_node_health

Node readiness and pressure conditions

get_deployments

Desired vs ready vs available replicas

get_events

Warning events β€” key incident signal

get_namespaces

All namespaces

Write (Policy-checked + Audit-logged)

Tool

Policy Enforced

scale_deployment

Max replicas Β· Blocked namespaces Β· Prod minimums

SRE Runbooks

Tool

Description

list_runbooks

Available runbooks with triggers

run_runbook

Execute org-standard diagnosis sequence

Governance

Tool

Description

get_audit_log

All recent operations with timestamps


πŸ—οΈ Architecture

Claude Desktop (MCP Host)
       β”‚
       β”‚  MCP Protocol (stdio / JSON-RPC)
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         SRE MCP Server v2           β”‚
β”‚  server.py        ← entry point     β”‚
β”‚  cluster_manager  ← multi-cluster   β”‚
β”‚  policy.py        ← write guards    β”‚
β”‚  audit.py         ← JSON audit log  β”‚
β”‚  runbooks.py      ← SRE runbooks    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚  kubernetes Python SDK
               β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Kubernetes Clusters β”‚
    β”‚  (any kubeconfig     β”‚
    β”‚   context)           β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Quick Start

git clone https://github.com/ManishMaurya22/sre-mcp-server
cd sre-mcp-server
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Edit ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "sre-k8s": {
      "command": "/Users/<YOUR_USERNAME>/sre-mcp-server/venv/bin/python",
      "args": ["/Users/<YOUR_USERNAME>/sre-mcp-server/server.py"]
    }
  }
}

See docs/SETUP.md for full setup guide.


πŸ” Policy Configuration

export POLICY_MAX_REPLICAS=30
export POLICY_SCALE_BLOCKED_NS="kube-system,gatekeeper-system"
export POLICY_PROD_NAMESPACES="production,prod"
export POLICY_PROD_MIN_REPLICAS=2
You:    "Scale nginx to 0 in production"
Claude: ❌ Policy Denied β€” scaling to 0 not allowed in production (min: 2)
        Operation audit-logged.

πŸ“‹ Encoded Runbooks

Available: high_error_rate Β· node_pressure Β· deployment_rollback

You: "Run the high_error_rate runbook for production"

Claude runs in order:
  1. get_deployments    β†’ spot unhealthy deployments
  2. get_pods           β†’ check restart counts
  3. get_events         β†’ surface warning signals
  4. get_crashlooping_pods β†’ cluster-wide check
  + surfaces remediation hints

πŸ—‚οΈ Structure

sre-mcp-server/
β”œβ”€β”€ server.py              # Main MCP server
β”œβ”€β”€ cluster_manager.py     # Multi-cluster context management
β”œβ”€β”€ policy.py              # Write operation guardrails
β”œβ”€β”€ audit.py               # Structured audit trail
β”œβ”€β”€ runbooks.py            # Encoded SRE runbooks
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ tools/k8s_tools.py
β”œβ”€β”€ config/claude_desktop_config.example.json
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ SETUP.md
β”‚   └── INTERVIEW_GUIDE.md
└── .github/workflows/ci.yaml

πŸ—ΊοΈ Roadmap

  • Prometheus MCP β€” SLO burn rate queries

  • PagerDuty MCP β€” incident acknowledgement

  • ArgoCD MCP β€” GitOps sync and triggers

  • Central MCP Gateway β€” auth + multi-team routing


πŸ“„ License

MIT β€” See LICENSE


Built by Manish Maurya β€” DevOps/SRE Leader | 16+ Years | Abu Dhabi, UAE Website: https://manishmaurya22.github.io/

A
license - permissive license
-
quality - not tested
C
maintenance

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ManishMaurya22/sre-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server