Skip to main content
Glama
opendatahub-io

RHOAI Observability MCP

Official

Red Hat OpenShift AI (RHOAI) Observability MCP

CI Build codecov Python 3.11+ License: MIT

An MCP (Model Context Protocol) server that gives AI assistants direct access to Red Hat OpenShift AI observability data. Query Prometheus metrics, Alertmanager alerts, Loki logs, Grafana dashboards, and Kubernetes cluster state to troubleshoot vLLM inference workloads.

Features

  • 21 tools across 7 categories for comprehensive observability

  • vLLM-aware metrics (TTFT, TPOT, E2E latency, KV cache, queue depth)

  • Composite investigation tools that correlate metrics, logs, and alerts automatically

  • Auto-detection of in-cluster vs external access to OpenShift services

  • Built on FastMCP with async backends via httpx

Architecture

graph TD
    A[Claude / AI Assistant] -->|MCP Protocol| B[rhoai-observability-mcp]
    B --> C[Thanos / Prometheus]
    B --> D[Alertmanager]
    B --> E[Loki]
    B --> F[Tempo]
    B --> G[Grafana]
    B --> H[Kubernetes / OpenShift]

Backends:

Backend

Purpose

Source

Prometheus (Thanos)

Metrics queries (PromQL)

backends/prometheus.py

Alertmanager

Active alerts and alert groups

backends/alertmanager.py

Loki

Log queries (LogQL)

backends/loki.py

Tempo

Distributed trace queries (TraceQL)

backends/tempo.py

Grafana

Dashboard discovery and panel queries

backends/grafana.py

Kubernetes (OpenShift)

Pods, events, nodes, InferenceServices

backends/openshift.py

Quick Start

# Clone and install
git clone https://github.com/opendatahub-io/rhoai-observability-mcp.git
cd rhoai-observability-mcp
uv pip install -e ".[dev]"

# Configure (see INSTALL.md for all options)
export THANOS_URL=https://thanos-querier.openshift-monitoring.svc:9091
export ALERTMANAGER_URL=https://alertmanager-main.openshift-monitoring.svc:9093
export OPENSHIFT_TOKEN=$(oc whoami -t)

# Run
python -m rhoai_obs_mcp.server

See INSTALL.md for detailed setup, configuration, and Claude Desktop integration.

Build & Deploy

Build the container image

make build

Override the image name or tag:

make build IMAGE_NAME=quay.io/myorg/rhoai-observability-mcp IMAGE_TAG=v1.0.0

Push to registry

make push

Deploy to OpenShift

Prerequisites: oc login to your cluster, kustomize installed, and create the target project:

oc new-project rhoai-obs-mcp

Then deploy:

make deploy

This uses Kustomize to build the OpenShift overlay (deploy/overlays/openshift/) on top of the base manifests (deploy/base/) and applies them to the rhoai-obs-mcp namespace. To deploy to a different namespace:

make deploy NAMESPACE=my-namespace

Undeploy

make undeploy

If you deployed to a custom namespace, pass the same value:

make undeploy NAMESPACE=my-namespace

CI-built images

Container images are automatically built from main and published to GHCR:

ghcr.io/opendatahub-io/rhoai-observability-mcp:latest

Local Development with Kind

Set up a local Kubernetes cluster with mock observability backends for development and testing:

# Prerequisites: kind, kubectl, helm, kustomize
make kind-up

This creates a Kind cluster, installs Prometheus + Alertmanager + Grafana via Helm, deploys a fake vLLM metrics exporter, and deploys the MCP server. Access the MCP server at http://localhost:30080.

To point at real external backends instead of the mocks:

make kind-deploy THANOS_URL=https://real-cluster:9091 ALERTMANAGER_URL=https://real-cluster:9093 GRAFANA_URL=https://real-cluster:3000 TEMPO_URL=https://real-cluster:8080

Tear down:

make kind-down

Tool Reference

Metrics

Tool

Description

query_prometheus

Execute a raw PromQL query against ThanosQuerier

query_prometheus_range

Execute a PromQL range query to get time-series data (trends, spikes, correlations)

get_vllm_metrics

Get a summary of key vLLM metrics (TTFT, TPOT, E2E, cache, queue) for a model

list_metrics

List available Prometheus metric names, optionally filtered by regex

Alerts

Tool

Description

get_alerts

Get active alerts from Alertmanager, filterable by severity and labels

get_alert_groups

Get alerts grouped by their routing labels

Logs

Tool

Description

query_logs

Execute a LogQL query against OpenShift LokiStack

get_pod_logs

Get logs for a specific pod by namespace and name

Traces

Tool

Description

get_trace

Fetch a distributed trace by its trace ID

search_traces

Search for traces using TraceQL expressions

list_trace_tags

List available trace tag names for building TraceQL queries

Cluster

Tool

Description

get_pods

List pods in a namespace with status, restarts, and creation time

get_events

List Kubernetes events, filterable by resource and reason

get_node_status

Get node status, capacity, and GPU allocation info

describe_resource

Get detailed description of a Kubernetes resource

get_inference_services

List KServe InferenceService resources

Dashboards

Tool

Description

list_dashboards

List available Grafana dashboards, filterable by tag or title

get_dashboard_panels

Get panels and their queries from a Grafana dashboard

Investigation

Tool

Description

investigate_latency

Correlate latency metrics, error logs, and alerts for a vLLM model

investigate_gpu

Correlate GPU utilization, KV cache, queue depth, and pod status

investigate_errors

Correlate error logs, alerts, and Kubernetes events in a namespace

Documentation

License

MIT

A
license - permissive license
-
quality - not tested
D
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Issues opened vs closed

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/opendatahub-io/rhoai-observability-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server