get_vllm_metrics
Monitor vLLM inference server performance metrics by directly querying pods to track latency, throughput, queue size, and GPU cache usage for capacity planning and proactive alerting.
Instructions
Monitor vLLM inference server performance metrics by directly querying pods.
Why:
- Performance monitoring: Track request latency and throughput
- Capacity planning: Monitor queue size and running requests
- Resource optimization: Track GPU cache usage
- Proactive alerting: Detect performance degradation
Args:
namespace: Optional namespace filter
pod_filter: Optional pod name filter (supports partial match)
Returns:
Markdown report of vLLM metrics
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| namespace | No | ||
| pod_filter | No |