Kubectl MCP Tool

Overview Schema Related Servers Score Discussions

SKILL.md•4.9 KiB

--- name: k8s-incident description: Respond to Kubernetes incidents with runbooks and diagnostics. Use for outages, pod failures, node issues, network problems, and emergency response. license: Apache-2.0 metadata: author: rohitg00 version: "1.0.0" tools: 15 category: observability --- # Kubernetes Incident Response Runbooks and diagnostic workflows for common Kubernetes incidents. ## When to Apply Use this skill when: - User mentions: "incident", "outage", "emergency", "down", "not working" - Operations: emergency response, production issues, service degradation - Keywords: "urgent", "broken", "fix", "restore", "recover" ## Priority Rules | Priority | Rule | Impact | Tools | |----------|------|--------|-------| | 1 | Check control plane first | CRITICAL | `get_pods(namespace="kube-system")` | | 2 | Assess node health | CRITICAL | `get_nodes` | | 3 | Gather events before changes | HIGH | `get_events` | | 4 | Document timeline | HIGH | Manual notes | | 5 | Rollback if safe | MEDIUM | `rollback_deployment` | ## Quick Reference | Incident | First Tool | Next Steps | |----------|------------|------------| | Pod failure | `get_pod_logs(previous=True)` | `describe_pod`, `get_events` | | Node down | `describe_node` | Check kubelet logs | | Service unreachable | `get_endpoints` | `get_network_policies` | | Control plane | `get_pods(namespace="kube-system")` | Check API server logs | ## Incident Triage ### Quick Health Check ```python get_nodes() get_pods(namespace="kube-system") get_events(namespace) ``` ### Severity Assessment | Indicator | Severity | Action | |-----------|----------|--------| | Multiple nodes NotReady | Critical | Escalate immediately | | kube-system pods failing | Critical | Control plane issue | | Single pod CrashLoop | Medium | Debug pod | | High latency | Medium | Check resources | ## Runbook: Pod Failures ### CrashLoopBackOff ```python get_pod_logs(name, namespace, previous=True) describe_pod(name, namespace) get_events(namespace, field_selector="involvedObject.name=<pod>") get_pod_metrics(name, namespace) ``` **Common Causes:** - OOMKilled → Increase memory limits - Exit code 1 → Application error in logs - Exit code 137 → Killed by OOM or SIGKILL - Exit code 143 → Graceful SIGTERM ### ImagePullBackOff ```python describe_pod(name, namespace) get_secrets(namespace) ``` ### Pending Pod ```python describe_pod(name, namespace) get_nodes() get_events(namespace) ``` ## Runbook: Node Issues ### Node NotReady ```python describe_node(name) get_events(namespace="", field_selector="involvedObject.name=<node>") node_logs_tool(name, "kubelet") ``` ### Node DiskPressure ```python describe_node(name) get_pods(field_selector="spec.nodeName=<node>") ``` ## Runbook: Network Issues ### Service Not Accessible ```python get_services(namespace) get_endpoints(namespace) get_pods(namespace, label_selector="<service-selector>") get_network_policies(namespace) ``` ### DNS Resolution Failures ```python get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns") get_pod_logs("coredns-xxx", "kube-system") ``` ### With Cilium ```python cilium_status_tool() cilium_endpoints_list_tool(namespace) hubble_flows_query_tool(namespace) ``` ### With Istio ```python istio_analyze_tool(namespace) istio_proxy_status_tool() ``` ## Runbook: Storage Issues ### PVC Pending ```python describe_pvc(name, namespace) get_storage_classes() get_events(namespace) ``` ### Pod Stuck in ContainerCreating ```python describe_pod(name, namespace) get_pvc(namespace) get_events(namespace) ``` ## Runbook: Control Plane Issues ### API Server Unavailable ```python get_pods(namespace="kube-system", label_selector="component=kube-apiserver") get_events(namespace="kube-system") ``` ### etcd Issues ```python get_pods(namespace="kube-system", label_selector="component=etcd") get_pod_logs("etcd-xxx", "kube-system") ``` ## Emergency Actions ### Force Delete Pod ```python delete_pod(name, namespace, grace_period=0, force=True) ``` ### Rollback Deployment ```python rollback_deployment(name, namespace, revision=0) ``` ### Helm Rollback ```python rollback_helm_release(name, namespace, revision=1) ``` ## Diagnostic Collection Script For comprehensive incident diagnostics, see [scripts/collect-diagnostics.py](scripts/collect-diagnostics.py). ## Multi-Cluster Incident Response Check all clusters: ```python for context in ["prod-1", "prod-2", "staging"]: get_nodes(context=context) get_pods(namespace="kube-system", context=context) get_events(namespace="kube-system", context=context) ``` ## Post-Incident ### Document Timeline 1. When did the incident start? 2. What was the impact? 3. What was the root cause? 4. What fixed it? ### Prevent Recurrence - Add monitoring/alerting - Improve resource limits - Add readiness probes - Document runbook ## Related Skills - [k8s-troubleshoot](../k8s-troubleshoot/SKILL.md) - Detailed debugging - [k8s-security](../k8s-security/SKILL.md) - Security incidents

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/rohitg00/kubectl-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

SKILL.md•4.9 KiB