Skip to main content
Glama
ThoTischner

observability-mcp

Server Configuration

Describes the environment variables required to run the server.

NameRequiredDescriptionDefault
LOKI_URLNoURL of Loki server (e.g., http://localhost:3100).
GRAFANA_TOKENNoGrafana Cloud API token for basic auth.
PROMETHEUS_URLNoURL of Prometheus server (e.g., http://localhost:9090).
GRAFANA_LOKI_USERNoGrafana Cloud Loki instance ID (numeric) for basic auth.
GRAFANA_PROM_USERNoGrafana Cloud Prometheus instance ID (numeric) for basic auth.

Capabilities

Features and capabilities supported by this server

CapabilityDetails
tools
{
  "listChanged": true
}
prompts
{
  "listChanged": true
}
resources
{
  "listChanged": true
}

Tools

Functions exposed to the LLM to take actions

NameDescription
list_sourcesA

List the configured observability backends (Prometheus, Loki, and any connector) and whether each is currently reachable. When to use: call this first to learn which source names exist and are healthy before passing source to other tools, or to debug why a query returns no data. Behavior: read-only, no side effects. Returns one entry per source with its name, type, signal types (metrics/logs), and a live up/down status (the backend URL is intentionally not exposed — it may carry embedded credentials). Never throws for an unreachable backend — the backend is reported as down instead. Related: use list_services to see what is monitored within these sources.

list_servicesA

Discover the service names that can be queried, aggregated across every connected backend. When to use: call this before query_metrics, query_logs, or get_service_health to obtain the exact, case-sensitive service name those tools require. Behavior: read-only, no side effects. Returns one entry per service with the service name, the source(s) it was discovered in, and which signals are available for it (metrics, logs, or both). Related: list_sources for backend health; get_service_health for a per-service overview.

query_metricsA

Fetch the raw time-series for ONE metric of ONE service over a look-back window, returned together with pre-computed summary statistics. When to use: when you need the actual numeric values or the trend of a known metric. For a 'is this service OK?' verdict use get_service_health; to find which services are misbehaving use detect_anomalies. Prerequisites: get the exact service name from list_services and choose a metric from the list at the end of this description. Behavior: read-only, no side effects. Returns an ordered array of {timestamp, value} points plus a summary {current, average, min, max, trend}. When no series matched (e.g. a logs-only service has no such metric), values is empty and summary is null (not all-zeros) with a note — absent data is not a real zero reading. With groupBy set, returns one labelled series per distinct label value under groups instead of a single aggregated series. Units depend on the metric (e.g. CPU as %, latency as ms, rates as per-second). An unknown service/metric or an unreachable backend yields a structured explanatory error, never an exception. Available metrics: No metrics sources configured.

query_logsA

Fetch recent log entries for ONE service over a look-back window, with a pre-computed summary (error/warning counts and the most frequent error patterns). When to use: to inspect what a service actually logged, or to investigate an error spike surfaced by detect_anomalies / get_service_health. For numeric metrics use query_metrics instead. Golden rule: filter + aggregate server-side — pass labels to scope and aggregate (count_over_time/sum/topk) to get numbers, not raw rows. A high-volume window returned raw will blow past your context limit. Prerequisites: get the exact service name from list_services (the service must expose a logs signal). Behavior: read-only, no side effects. Returns the matching log entries (newest first, capped by limit) plus a summary with total/error/warn counts and top recurring error patterns. No matches yields an empty result with a zeroed summary; an unreachable backend yields a structured explanatory error, never an exception.

get_anomaly_historyA

Replay historical anomaly scores for a service from the TSDB the gateway writes to (omcp_anomaly_score series). When to use: post-mortem reconstruction, trend analysis on detector noise, or pulling context for the LLM when an incident is reviewed after the fact. Prerequisites: the operator must have OMCP_ANOMALY_HISTORY_REMOTE_WRITE configured AND a Prometheus source pointed at the same TSDB so the round-trip closes. Behavior: read-only. Returns the time-series of scores. Empty result means either no anomalies in the window or history is disabled. Related: detect_anomalies for the live scores; query_metrics if you want to write the PromQL by hand.

generate_postmortemA

Stitch the gateway's primitives (anomaly history, blast-radius, traces, log highlights) into a single markdown post-mortem report for one service over a given window. When to use: after an incident, when the operator or LLM wants 'one document the on-call can read in 60 seconds' instead of poking the individual tools. Prerequisites: anomaly history requires OMCP_ANOMALY_HISTORY_REMOTE_WRITE + a Prometheus source. Traces require Tempo / Jaeger. Blast-radius requires a topology provider. Behavior: read-only. Returns markdown by default; pass format='json' for the structured shape. Output capped (timeline 20 rows, blast-radius 30 nodes, 10 traces) — JSON shape carries the full data. Related: get_anomaly_history, query_traces, get_blast_radius for the underlying primitives.

query_tracesA

Query distributed traces for a service over a given timeframe. Returns ranked trace summaries (duration, span count, error status) with a p50/p95 aggregate across the returned set. When to use: investigate tail-latency outliers, walk call chains across services for a specific time window, or pull traces related to an anomaly that the metric/log tools surfaced first. Prerequisites: get the exact service name from list_services. A traces connector (e.g. Tempo, installable from the connector hub) must be configured — none is bundled by default, so without one this returns a clean 'No trace backends configured' result. Behavior: read-only. filter accepts the backend's native query language (e.g. TraceQL on Tempo). When errorsOnly=true, only traces with at least one error span are returned. Default limit is 50.

get_service_healthA

Produce a single aggregated health verdict for ONE service by combining its metrics and logs. When to use: the fastest way to answer 'is this service healthy right now and why?'. Use query_metrics/query_logs to drill into the underlying numbers, or detect_anomalies to scan many services at once. Prerequisites: get the exact service name from list_services. Behavior: read-only, no side effects. Returns a weighted health score (0–100), a status of healthy | degraded | critical, the key contributing metrics, a log error summary, detected anomalies, and cross-signal correlations explaining the score. A service with no data yields an explanatory result rather than an exception.

detect_anomaliesA

Scan one or all monitored services for abnormal behavior and return the findings ranked by severity. When to use: the entry point for 'is anything wrong anywhere?' triage. Once a service is flagged, follow up with get_service_health for the verdict or query_metrics/query_logs for the raw evidence. Behavior: read-only, no side effects. Applies z-score analysis to metrics, detects log error-rate spikes, and correlates the two. Returns a list of anomalies, each with the affected service, metric/signal, severity, the deviation (e.g. σ and % change), and a short explanation. No anomalies yields an empty list, not an error. Related: get_service_health (single-service verdict), query_metrics (raw series behind a flagged metric).

get_topologyA

Return the infrastructure topology graph (Resources and Edges) from every topology-capable connector. When to use: when an agent needs to reason about which workload runs on which host, who owns whom, or which scope (namespace/project/folder) a resource belongs to. Pair with get_blast_radius for shared-host RCA. Behavior: read-only, no side effects. Returns { sources, resources, edges, total, truncated }. Filters compose: source to one connector, kind to one resource type (e.g. 'pod', 'node', 'deployment'), scope to members of a namespace/folder/project. Output is capped by limit (default 500, max 5000) and edges referencing dropped resources are removed. Related: get_blast_radius to evaluate the impact of a host failure; list_sources to discover topology-capable connectors.

get_blast_radiusA

Given a resource, return who else fails if its underlying host(s) fail. When to use: cross-cutting RCA — when several services degrade together and you suspect a shared host. Works for any RUNS_ON relationship: pod→node, vm→hypervisor, container→host. Behavior: read-only, no side effects. Resolves resource to a Resource (accepts canonical id, exact name, or unique substring), determines its host(s) via RUNS_ON, then lists every other resource that runs on those hosts, bucketed by ownership root (the terminal OWNED_BY target — e.g. the Deployment, not the ReplicaSet). If the target is itself a host, its tenants are reported. Returns a structured error if the resource is ambiguous or unknown. Related: get_topology for the full graph; get_service_health for the per-service verdict on each co-tenant.

enrich_ipsA

Resolve a batch of IPv4 or IPv6 addresses to geo (country/city), ASN/org, and a hosting/proxy flag. When to use: answering 'where are these visitors from?' or 'which of these IPs are bots / datacenter / VPN exit nodes?' over access logs, without an out-of-band geo-API call per IP. Both IPv4 and IPv6 clients are resolved — don't pre-filter v6 out. Behavior: read-only. By default looks each IP up in a LOCAL offline dataset the operator configured (OMCP_IP_ENRICH_FILE) with NO external network call — safe in air-gapped deployments. Optionally, if the operator enabled OMCP_IP_ENRICH_RDAP, IPs the dataset doesn't cover fall back to an online RDAP query (country/org only) and the result carries via:'rdap'; the offline dataset is always preferred. Returns one row per input IP with found=true/false plus any known fields. If neither is configured it returns a clear notice explaining how to enable them. RDAP rate-limits: a row with found=false AND transient:true (error names the cause, e.g. 'rate_limited') is NOT a confirmed negative — the registry throttled or failed the lookup, so the IP may resolve on a later retry or in a smaller batch. Such rows are counted in summary.transient (separate from summary.unmatched) and a top-level note is added. Don't treat transient rows as 'unknown/suspicious'; retry them (results are cached, so repeats are cheap). Related: pull the IPs from query_logs (use labels/aggregate to find the IPs of interest first).

Prompts

Interactive templates invoked by user choice

NameDescription
triage-incidentGuided incident triage for one service: health verdict, anomaly scan, blast radius, and the log slice that matters.
write-postmortemGenerate and refine a post-incident report for one service over a window.

Resources

Contextual data attached and managed by the client

NameDescription
agent-usage-guideHow to use this gateway effectively as an agent: the proven filter→aggregate→enrich triage recipe, signal-vs-silence behaviours, and the operator flags that unlock optional tools.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ThoTischner/observability-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server