Enables GPU diagnostics and monitoring through tools that track GPU utilization, run nvidia-smi inside GPU-enabled pods for real-time process and memory details, and check for GPU hardware errors (XID) and throttling events across the cluster.
Provides monitoring tools that query Prometheus metrics via OpenShift route for real-time cluster observability, including resource balance analysis, pod restart detection, GPU utilization tracking, and vLLM inference server performance metrics.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@OpenShift MCP Servercheck GPU utilization in the cluster"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
OpenShift MCP Server
A Model Context Protocol (MCP) server for OpenShift diagnostics and troubleshooting.
Features
Storage Tools
Storage Analysis:
get_cluster_storage_report- comprehensive report of ephemeral storage usage on nodes, including top pod consumers.Deep Forensics:
inspect_node_storage_forensics- deep analysis of disk usage on a specific node, checking for unused images and container writable layers.PV Capacity:
check_persistent_volume_capacity- monitor persistent volume usage across namespaces with configurable thresholds.
Monitoring Tools
Resource Balance:
get_cluster_resource_balance- analyze CPU and memory resource distribution across nodes.Pod Restarts:
detect_pod_restarts_anomalies- identify pods with excessive restart counts within a time window.GPU Utilization:
get_gpu_utilization- track GPU usage and identify idle GPU resources.Inspect GPU Pod:
inspect_gpu_pod- runnvidia-smiinside a GPU-enabled pod to view real-time process and memory details.Check GPU Health:
check_gpu_health- check for GPU hardware errors (XID) and throttling events across the cluster.vLLM Metrics:
get_vllm_metrics- monitor vLLM inference server performance metrics (throughput, queue size, cache usage).
All monitoring tools use Prometheus metrics via OpenShift route for real-time cluster observability.
Pod Diagnostics Tools
Pod Logs:
get_pod_logs- retrieve and analyze logs from a specific pod, with support for previous container logs, tail limits, and time-based filtering.Pod Diagnostics:
get_pod_diagnostics- comprehensive health check of a pod including status, conditions, container states, restart counts, and issue detection.
Installation
# Using uv (recommended)
uv tool install .
# Or pip
pip install .Configuration
This server relies on the oc command line tool.
Ensure
ocis installed and in your PATH.Ensure you are authenticated (
oc login ...) to your target cluster before running the server.
MCP Client Configuration
Configure the MCP server in your Claude Desktop or Gemini CLI settings:
{
"mcpServers": {
"openshift-tools": {
"command": "uv",
"args": ["run", "openshift-mcp-server"]
}
}
}Example Usage
Once configured, you can ask questions like:
"Give me a summary of storage usage for all nodes"
"Check GPU utilization in the cluster"
"Diagnose pod health for my-app-pod in production namespace"
Simulated Tool Output (get_cluster_storage_report):
# Storage Usage Report (3 nodes)
### Node: master0.example.com
- **Filesystem**: Used: 36.70 Gi | Capacity: 99.44 Gi | Available: 62.74 Gi
- **Image FS**: Used: 34.17 Gi
- **Total Pod Ephemeral Storage**: 5.19 Gi
**Top Pod Consumers:**
- 2.60 Gi: `openshift-marketplace/redhat-operators-gb8ff`
- 974.96 Mi: `openshift-marketplace/community-operators-fq744`Simulated Tool Output (get_gpu_utilization):
### GPU Utilization Report
**Total GPUs Found:** 4
| Node | GPU | Utilization | Memory Used | Status |
|------|-----|-------------|-------------|--------|
| `host-a:9400` | 0 | **0.0%** | 0.0% | ⚠️ Idle |
| `host-a:9400` | 1 | **85.2%** | 92.3% | ✅ Active |Simulated Tool Output (detect_pod_restarts_anomalies):
### Pod Restart Anomalies (>5 in last 1h)
| Namespace | Pod | Restarts |
|-----------|-----|----------|
| `ns-1` | `pod-a-7b666bd598-cvrlk` | **34** |
| `ns-2` | `pod-b-6dcf7d7bb8-dw8sg` | **16** |
#### 📋 Recommendations
1. **Check Logs**: `oc logs <pod> -n <namespace> --previous`
2. **Check Events**: `oc get events -n <namespace>`Development
# Run locally
uv run openshift-mcp-serverTesting the server directly
When run directly, the server expects JSON-RPC messages on standard input. You can verify the registered tools by simulating a full client handshake (Initialize -> Initialized -> Tools/List):
(echo '{"jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {"protocolVersion": "2024-11-05", "capabilities": {}, "clientInfo": {"name": "test-client", "version": "1.0"}}}'; sleep 0.5; echo '{"jsonrpc": "2.0", "method": "notifications/initialized"}'; sleep 0.5; echo '{"jsonrpc": "2.0", "id": 2, "method": "tools/list"}') | uv run openshift-mcp-serverExpected output (truncated for brevity):
{"jsonrpc":"2.0","id":1,"result":{...}}
{"jsonrpc":"2.0","id":2,"result":{"tools":[{"name":"get_cluster_storage_report",...},{"name":"inspect_node_storage_forensics",...},...]}}Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.