issues
Identify and prioritize live Kubernetes failures: crashing pods, missing references, scheduling blockers, and unhealthy conditions. Shows critical and warning issues with severity and diagnostic context.
Instructions
Use when the agent's decision is 'what's broken right now?' — LIVE OPERATIONAL STATE, not config posture. Returns a ranked list of currently failing resources: failing Deployments/StatefulSets/CronJobs/HPAs/Nodes/Jobs/PVCs, dangling-reference errors like Pod→missing PVC/CM/Secret/SA, HPA→missing scaleTargetRef, Ingress→missing backend Service, RoleBinding→missing Role, webhook→missing Service, pod startup blockers — why a Pod can't reach Running: unschedulable (arch/taint/resources/affinity), admission-rejected (quota/PodSecurity/webhook), or stuck post-bind (CNI/volume), and False .status.conditions on CRDs from Argo/Flux/Knative/Crossplane/cert-manager/KEDA. Severity normalized to critical/warning. This is one curated stream — there is no source filter; each row carries a source label (problem|missing_ref|scheduling|condition) you can slice on via the CEL filter= if needed. Some rows include diagnostic_context: deterministic facts such as explicit missing refs, selected backend issues, or workload rollups; treat these as triage context, not proof of root cause. When recent_changes is present, consider it if the issue list does not explain the reported symptom; recent_changes_reason says why Radar attached it. It lists recent spec/config changes that may explain failures not yet visible as runtime issues, or help distinguish creation-time baseline failures from the active incident. For raw Kubernetes Warning events use get_events; for static best-practice / security-posture findings (runAsRoot, missing PDB, no probes, missing resource limits) use get_cluster_audit — a separate axis that must never be conflated (a healthy pod can have many audit findings; a crashing pod can have zero). Kyverno PolicyReport violations are not in either — they surface per-resource via get_resource's resourceContext policy rollup. After identifying a suspect issue, call diagnose when the affected resource is a workload (Pod/Deployment/StatefulSet/DaemonSet) or GitOps reconciler (Application/Kustomization/HelmRelease). For other non-workload kinds, call get_resource. Use get_neighborhood when the failure likely crosses Services/workloads/Pods/dependencies. Use namespace for app-local triage; omit it when the root may be cluster-scoped or outside the app namespace.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| namespace | No | filter to one namespace | |
| severity | No | comma-separated: critical,warning | |
| kind | No | comma-separated kind filter (e.g. Deployment,Pod) | |
| limit | No | max issues returned (default 200, max 1000) | |
| filter | No | optional CEL boolean expression run against each composed Issue. Bindings: severity (critical|warning), category (e.g. crashloop, image_pull_failed, missing_config_ref, gitops_sync_failed), category_group (startup|runtime|scheduling|configuration|networking|storage|scaling|security|control_plane; runtime here is an issue taxonomy group, not issue_timing), source (problem=built-in Radar detector, missing_ref=dangling by-name reference, scheduling=pod startup blocker, condition=False controller/CRD condition), kind, group, ns (the namespace — use 'ns', not 'namespace' which is a CEL reserved word), name, reason, message, cause, action, remediation_kind, remediation_target, count (int, the affected-resource fan-out), grouping_scope (workload|service|node|…), restart_count (int), last_terminated_reason, operation_retry_count (int, a GitOps controller's sync-operation retries — distinct from restart_count), stuck (bool, issue not expected to self-recover), issue_timing (string timing evidence: 'started_at_resource_creation' = evidence places the failing state during resource creation or first reconciliation; 'started_after_resource_was_healthy' = evidence shows a meaningful healthy window before the failing condition appeared; absent = Radar has no clean signal, do NOT infer timing from age alone; this is timing evidence, not a root-cause verdict), issue_timing_basis (string: evidence used — 'condition' | 'owner_condition' | 'pod_creation' | 'deletion' | 'phase' | 'spec'), first_seen + last_seen (unix seconds — prefer first_seen for onset/age; last_seen churns to compose-time). For cross-cluster scoping use clusters= (not a CEL predicate). Examples: 'severity == "critical" && count > 5', 'category_group == "startup"', 'restart_count > 10', 'remediation_kind == "create-namespace"', 'stuck && operation_retry_count >= 5', 'issue_timing == "started_after_resource_was_healthy"', 'first_seen < timestamp("2026-05-01T00:00:00Z").getSeconds()' |