Skip to main content
Glama
junzzhu

OpenShift MCP Server

by junzzhu

Server Configuration

Describes the environment variables required to run the server.

NameRequiredDescriptionDefault

No arguments

Capabilities

Server capabilities have not been inspected yet.

Tools

Functions exposed to the LLM to take actions

NameDescription
get_cluster_storage_report
Use this FIRST. Fast, high-level summary of storage usage for all nodes or a specific node. Checks quotas and reported usage from the Kubelet API. If 'node' is provided, analyzes only that node. If 'node' is not provided, analyzes all worker nodes.
inspect_node_storage_forensics
SLOW operation (10s+). Performs a deep forensic analysis of a node's storage. Use this tool ONLY when: 1. A specific node is known to be problematic (full disk). 2. 'get_storage_usage' does not reveal the root cause. This tool runs a debug pod on the node to calculate: - Real disk usage (df -h). - Reclaimable space from UNUSED images. - Growth of container writable layers (indicating log/file issues inside containers).
check_persistent_volume_capacity
Monitor Persistent Volume Claim (PVC) capacity usage across the cluster. Critical for preventing database crashes and data loss due to full disks. This checks PVCs (persistent storage) which is distinct from ephemeral storage. Args: namespace: Optional namespace to filter PVCs. If None, checks all namespaces. threshold: Alert threshold percentage (default: 85%). PVCs above this will be flagged. Returns: Formatted report of PVC usage with warnings for volumes exceeding threshold.
get_cluster_resource_balance
Analyze cluster resource balance, focusing on Request vs Usage gaps. Why Essential: - Scheduling bottleneck diagnosis: Explains why pods are Pending despite capacity. - Resource fragmentation detection: Identifies "request vs usage" gaps. - Cost efficiency: Reveals over-provisioned nodes. Returns: Markdown table showing CPU/Memory Requests vs Usage per node.
detect_pod_restarts_anomalies
Identify unstable pods experiencing high restart rates. Why: - Application stability indicator: High restart rates signal code issues (OOM, panics, misconfigurations) - Proactive detection: Catches intermittent failures before they become incidents - Actionable: Directly points to problematic workloads Args: threshold: Minimum number of restarts to flag (default: 5). duration: Window of time to analyze (e.g., '1h', '24h', '10m'). Returns: Markdown report of unstable pods.
get_gpu_utilization
Monitor GPU usage and health across the cluster. Why: - Cost efficiency: GPUs are expensive. Low utilization indicates wasted money. - Resource optimization: Identifies idle GPUs that could be deallocated. - Hardware health: High error rates indicate hardware issues. Prerequisites: - NVIDIA GPU Operator installed (exports DCGM metrics). Returns: Markdown report of GPU utilization per node.
get_pod_logs
Retrieve container logs from a pod. Args: namespace: Pod namespace pod_name: Pod name container: Specific container name (if None, gets all containers) previous: Get logs from previous container instance tail: Number of recent lines to retrieve (default: 100) since: Time duration to retrieve logs from (e.g., "1h", "30m") Returns: Formatted logs with container separation
get_pod_diagnostics
Comprehensive pod health analysis with events, status, and actionable recommendations. Args: namespace: Pod namespace pod_name: Pod name Returns: Detailed diagnostic report with recommendations
inspect_gpu_pod
Run 'nvidia-smi' inside a GPU-enabled pod to view real-time process and memory details. Why: - Debug OOM: See exact memory usage per process. - Verify allocation: Confirm the pod actually sees the GPU. - Check processes: Identify zombie processes or unexpected workloads. Args: namespace: Pod namespace pod_name: Pod name Returns: Output of nvidia-smi from inside the pod.
check_gpu_health
Check for GPU hardware errors (XID) and throttling events across the cluster. Why: - Detect Hardware Failures: XID errors often indicate physical GPU faults. - Explain Performance Issues: Thermal or Power throttling explains why a model is slow. Returns: Markdown report of GPU health issues.
get_vllm_metrics
Monitor vLLM inference server performance metrics by directly querying pods. Why: - Performance monitoring: Track request latency and throughput - Capacity planning: Monitor queue size and running requests - Resource optimization: Track GPU cache usage - Proactive alerting: Detect performance degradation Args: namespace: Optional namespace filter pod_filter: Optional pod name filter (supports partial match) Returns: Markdown report of vLLM metrics

Prompts

Interactive templates invoked by user choice

NameDescription

No prompts

Resources

Contextual data attached and managed by the client

NameDescription

No resources

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/junzzhu/openshift-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server