Which integrations are available for this server?

Allows listing and searching Grafana dashboards and creating annotations to mark incident start/end on dashboards. Provides tools to list open incidents, create new incidents with severity, and update incidents (acknowledge/resolve) with timeline notes. Enables querying Prometheus for active alerts, metrics via PromQL, and silencing alerts for specified durations. Enables sending color-coded alert messages with severity emojis to specified Slack channels.

How do I use PilotOps MCP?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@PilotOps MCP Investigate the active alert on prod-server-01 and generate a runbook" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

PilotOps MCP

by muhammedehab35

Overview Schema Related Servers Score Discussions

Python

Remote

✈️ PilotOps MCP

AI-powered Incident Response Autopilot for DevOps & SRE teams

Prometheus Grafana Loki PagerDuty Slack Docker

Connect Claude AI to your entire monitoring stack and respond to incidents in natural language — no more jumping between 5 different tools at 3am.

The Problem

When an incident fires at 3am, an SRE must manually:

Step	Tool	Time
Check alerts	Prometheus	2 min
Analyze metrics	Grafana	5 min
Search logs	Loki / ELK	10 min
Diagnose root cause	Brain	15 min
Write runbook	Notion / Confluence	10 min
Page on-call	PagerDuty	2 min
Notify team	Slack	2 min
Total	7 tools	~46 min

Related MCP server: Linuxfabrik MCP Server for Icinga

The Solution

With PilotOps MCP, you just tell Claude:

"There's an alert on prod, investigate and generate a runbook"

And Claude handles everything in under 2 minutes.

How It Works

┌─────────────────────────────────────────────────────────────┐
│                        You (Claude Desktop)                  │
│  "Investigate the active alert on prod-server-01"           │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                    PilotOps MCP Server                       │
│                                                              │
│  1. prometheus_get_active_alerts()                          │
│     → CPU 95% on prod-server-01 since 10min                 │
│                                                              │
│  2. prometheus_get_metrics("node_cpu...")                    │
│     → Spike started at 22:15, still climbing                │
│                                                              │
│  3. loki_get_logs('{host="prod-server-01"}')                │
│     → 847 errors: "OOM Killer activated"                    │
│                                                              │
│  4. analyze_incident(alerts, metrics, logs)                  │
│     → P1 | Memory leak in payments-api | Confidence: HIGH   │
│                                                              │
│  5. generate_runbook("memory_leak", "P1")                   │
│     → 4-phase runbook generated                             │
│                                                              │
│  6. pagerduty_create_incident("P1: Memory leak")            │
│     → On-call engineer paged                                │
│                                                              │
│  7. slack_notify("#incidents", severity="critical")          │
│     → Team notified with communication template             │
│                                                              │
│  8. grafana_create_annotation("[P1 START] 22:15")           │
│     → Incident marked on all dashboards                     │
└─────────────────────────────────────────────────────────────┘

Features

12 MCP Tools across 5 integrations
AI Correlation Engine — matches alerts + metrics + logs against 7 incident patterns
Auto Runbook Generator — produces 4-phase runbooks (Triage → Mitigation → Investigation → Resolution)
Slack Communication Templates — ready-to-send status updates
Full Docker Demo Stack — simulate real incidents locally with 1 command
Zero vendor lock-in — works with any Prometheus-compatible stack

Tools Reference

Prometheus

Tool	Description
`prometheus_get_active_alerts`	Fetch all firing alerts with severity, labels, and annotations
`prometheus_get_metrics`	Query any PromQL expression with time range
`prometheus_silence_alert`	Silence an alert for a specified duration

Grafana

Tool	Description
`grafana_get_dashboards`	List and search available dashboards
`grafana_create_annotation`	Mark incident start/end on dashboards for post-mortem

Loki

Tool	Description
`loki_get_logs`	Query logs via LogQL with level filtering and error detection

Tool	Description
`pagerduty_get_incidents`	List open incidents by status
`pagerduty_create_incident`	Create P1-P4 incident and page on-call
`pagerduty_update_incident`	Acknowledge or resolve with timeline note

Slack

Tool	Description
`slack_notify`	Send color-coded alert with severity emoji

AI Core

Tool	Description
`analyze_incident`	Correlates alerts + metrics + logs → root cause + confidence
`generate_runbook`	Generates structured 4-phase runbook with Slack template

Supported Incident Types

Type	Trigger	Pattern
`memory_leak`	OOM kills, heap growth	Memory > 85% + OOM logs
`high_cpu`	CPU saturation	CPU > 80% sustained
`disk_full`	Disk space exhaustion	No space left errors
`network_issue`	Connectivity problems	Timeouts + packet loss
`database_issue`	DB overload / deadlocks	Slow queries + connection pool
`service_crash`	App crash / restart loop	Segfault + panic logs
`deployment_issue`	Failed K8s rollout	CrashLoopBackOff + ImagePull

Tech Stack

Language    : Python 3.11+
MCP Server  : FastMCP (official Anthropic SDK)
Metrics     : Prometheus + Alertmanager
Dashboards  : Grafana
Logs        : Loki + Promtail
Incidents   : PagerDuty
Alerts      : Slack
Containers  : Docker + Docker Compose

Quick Start

Prerequisites

Python 3.11+
Docker & Docker Compose
Claude Desktop

1. Clone & install

git clone https://github.com/muhammedehab35/PILOT_OPS-MCP.git
cd PILOT_OPS-MCP
pip install -r requirements.txt

2. Configure

cp .env.example .env

# Minimum required for local demo
PROMETHEUS_URL=http://localhost:9090
GRAFANA_URL=http://localhost:3000
GRAFANA_API_KEY=your_grafana_api_key
LOKI_URL=http://localhost:3100

# Optional: for full incident workflow
PAGERDUTY_API_KEY=your_pagerduty_key
PAGERDUTY_SERVICE_ID=PXXXXXX
SLACK_BOT_TOKEN=xoxb-your-slack-token
SLACK_DEFAULT_CHANNEL=#incidents

3. Launch the full demo stack

cd docker
docker-compose up -d

Service	URL	Credentials
Demo App	http://localhost:8080	—
Prometheus	http://localhost:9090	—
Alertmanager	http://localhost:9093	—
Grafana	http://localhost:3000	admin / admin123
Loki	http://localhost:3100	—

4. Trigger a real incident

# CPU spike → fires HighCPUUsage alert after 30s
curl -X POST http://localhost:8080/simulate/cpu-spike

# Memory leak → fires HighMemoryUsage alert after 30s
curl -X POST http://localhost:8080/simulate/memory-leak

# High error rate → fires HighErrorRate alert after 30s
curl -X POST http://localhost:8080/simulate/high-errors

# Slow responses → fires SlowResponseTime alert after 30s
curl -X POST http://localhost:8080/simulate/slow-response

# Reset all incidents
curl -X POST http://localhost:8080/simulate/reset

5. Connect to Claude Desktop

Add to %APPDATA%\Claude\claude_desktop_config.json (Windows) or ~/Library/Application Support/Claude/claude_desktop_config.json (Mac):

{
  "mcpServers": {
    "pilotops": {
      "command": "python",
      "args": ["/full/path/to/PILOT_OPS-MCP/server.py"],
      "env": {
        "PROMETHEUS_URL": "http://localhost:9090",
        "GRAFANA_URL": "http://localhost:3000",
        "GRAFANA_API_KEY": "your_key",
        "LOKI_URL": "http://localhost:3100",
        "PAGERDUTY_API_KEY": "your_key",
        "SLACK_BOT_TOKEN": "your_token"
      }
    }
  }
}

Restart Claude Desktop → look for the 🔨 hammer icon in the chat bar.

6. Run your first incident response

You:     "There's an active alert on prod, investigate and generate a runbook"

Claude:  → Fetching active alerts from Prometheus...
         → Querying CPU and memory metrics...
         → Pulling last 15 minutes of error logs from Loki...
         → Analyzing correlation...
         → [P1] Memory leak detected in payments-api (confidence: HIGH)
         → Generating runbook...
         → Creating PagerDuty incident #42...
         → Notifying #incidents on Slack...
         ✅ Full incident response completed in 45 seconds.

Project Structure

PILOT_OPS-MCP/
├── server.py                    # FastMCP server — registers all 12 tools
├── config.py                    # Pydantic settings — loads from .env
├── requirements.txt
├── .env.example
│
├── tools/                       # One file per integration
│   ├── prometheus.py            # get_alerts, get_metrics, silence
│   ├── grafana.py               # dashboards, annotations
│   ├── loki.py                  # log queries via LogQL
│   ├── pagerduty.py             # create / update incidents
│   └── slack.py                 # team notifications
│
├── core/                        # AI intelligence layer
│   ├── correlator.py            # Pattern-matching correlation engine
│   └── runbook.py               # 4-phase runbook generator (7 types)
│
└── docker/                      # Full local demo environment
    ├── docker-compose.yml
    ├── demo-app/                # Flask app — simulates real incidents
    │   ├── app.py               # /simulate/* endpoints + Prometheus metrics
    │   ├── Dockerfile
    │   └── requirements.txt
    ├── prometheus/
    │   ├── prometheus.yml       # Scrape config
    │   └── alerts.yml           # 5 alert rules
    ├── grafana/
    │   ├── provisioning/        # Auto-configured datasources
    │   └── dashboards/          # Pre-built infrastructure dashboard
    ├── loki/loki-config.yml
    ├── promtail/promtail-config.yml
    └── alertmanager/alertmanager.yml

Example Runbook Output

📋 RUNBOOK: Memory Leak / OOM Incident
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Severity : P1  |  SLA: 15 minutes
Services : payments-api
Hosts    : prod-server-01

PHASE 1 — TRIAGE
  1. Confirm memory usage: free -h or Grafana memory dashboard
  2. Identify top memory consumers: ps aux --sort=-%mem | head -20
  3. Check OOM kills: dmesg | grep -i 'oom'

PHASE 2 — MITIGATION
  1. Restart the affected service to free memory immediately
  2. Enable memory limits (K8s: resources.limits.memory)
  3. Set up swap if not present

PHASE 3 — INVESTIGATION
  1. Collect heap dump (JVM: jmap, Go: pprof)
  2. Review recent code changes for memory regressions
  3. Check GC logs for anomalies

PHASE 4 — RESOLUTION
  1. Deploy fix or roll back the problematic version
  2. Verify memory returns to baseline
  3. Resolve PagerDuty + post-mortem

💬 SLACK TEMPLATE:
  [P1 INCIDENT] Memory Leak / OOM
  • Affected: payments-api
  • Hosts: prod-server-01
  • Status: Investigating
  • SLA: Resolve within 15 minutes
  • Next update: In 15 minutes

Contributing

Contributions are welcome! Ideas for new integrations:

OpsGenie support
Datadog metrics
Kubernetes events via kubectl
Jira ticket creation
Email notifications

Author

Ehab Muhammed — DevOps Engineer GitHub: @muhammedehab35

License

This server cannot be installed

license - not found

quality - not tested

maintenance

How are these scores calculated?

Maintenance

–Maintainers

–Response time

–Release cycle

–Releases (12mo)

Commit activity

Resources

GitHub Repository

Need Help?

Related Servers

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Related MCP Servers

OpenTelemetry MCP Server
Observability Monitoring
agarwalvivek29
F
license
A
quality
D
maintenance
Enables AI agents to query Prometheus metrics and Loki logs for intelligent alert investigation and troubleshooting. Provides service discovery, metric querying, log searching, and correlation tools to help identify root causes of issues.
Last updated 2025-12-07
9
Linuxfabrik MCP Server for
Monitoring Observability
Linuxfabrik
A
license
-
quality
B
maintenance
Enables AI clients like Claude to triage, investigate, and operate Icinga installations through natural language, integrating with Icinga's REST APIs and providing deep awareness of monitoring plugins and historical performance data.
Last updated 2026-07-24
4
The Unlicense
agent-debugger
Observability Monitoring Autonomous Agents
UnCooe
A
license
-
quality
C
maintenance
Enables AI agents to investigate backend incidents by executing runbooks that gather evidence from observability and storage systems.
Last updated 2026-06-21
25
9
MIT
alertmanager-mcp
Monitoring Observability
kaznak
A
license
C
quality
C
maintenance
Enables Claude AI to interact with Prometheus Alertmanager for alert retrieval, silence management, and alert grouping through natural language.
Last updated 2025-03-28
6
33
10
MIT

View all related MCP servers

Related MCP Connectors

agentmonitorrelay-mcp
AI agent run monitoring with incident replay and SLA receipts.
WHOOP — MissingMCP
WHOOP recovery, strain, sleep and workouts in Claude via official WHOOP OAuth. Free, open source.
Rootr
Connect your team's living knowledge base — docs, data, issues, CRM — to Claude and ChatGPT.

View all MCP Connectors

Latest Blog Posts

Who's Calling? MCP Hosts Are an Identity Blind Spot (And the Spec Knows It)
By Om-Shree-0709 on July 25, 2026.
mcp
Agent Identity
OAuth 2.1
Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/muhammedehab35/PILOT_OPS-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server