Overwatch MCP

usage-guide.md•6.69 KiB

# Overwatch MCP Usage Guide How to ask questions and get useful answers from your observability data. ## Quick Start Examples ### Finding Problems ``` "Show me errors from the last hour" "What's failing in production?" "Find any timeout errors in the last 6 hours" "Are there any 5xx errors from the API?" ``` ### Investigating Issues ``` "Search for errors mentioning 'connection refused'" "Find logs where the database query failed" "Show me authentication failures" "What happened around 2pm today with the payment service?" ``` ### System Health ``` "Is everything up?" "Show me the CPU usage trend for the last hour" "What's the request rate for the API?" "Are there any services with high error rates?" ``` --- ## Graylog (Log Search) ### Basic Searches | Ask This | What It Does | |----------|--------------| | "Show me errors" | Searches for ERROR level logs | | "Find warnings from the last 30 minutes" | Time-bounded WARN search | | "Search logs for 'timeout'" | Keyword search | | "Show me logs from service X" | Filter by source/service | ### Time Ranges ``` "errors from the last hour" → -1h "what happened in the last 6 hours" → -6h "show me today's errors" → -24h (max) "errors between 2pm and 3pm" → specific time range ``` ### Filtering ``` "Show me only ERROR level logs" "Find logs from the api-gateway service" "Search for 'NullPointerException' errors" "Show logs where http_status is 500" ``` ### Combining Filters ``` "Find errors from the payment service mentioning 'timeout'" "Show warnings from any API service in the last 2 hours" "Search for database connection errors excluding test environments" ``` ### Non-Production Searches By default, searches focus on production. To search other environments: ``` "Search all environments for errors" "Show me staging logs with errors" "Find dev environment warnings" "Search non-prod for deployment issues" ``` --- ## Prometheus (Metrics) ### Current State ``` "Is everything up?" "Show me current CPU usage" "What's the memory usage right now?" "Are all services healthy?" ``` ### Trends Over Time ``` "Show me CPU usage over the last hour" "What's the request rate trend for the last 6 hours?" "Graph memory usage for today" "Show error rate over time" ``` ### Specific Metrics ``` "What's the p95 latency for HTTP requests?" "Show me the request count per service" "What's the disk usage percentage?" "How many active connections are there?" ``` ### Comparisons ``` "Compare CPU usage between services" "Which service has the highest error rate?" "Show me the top 5 services by request volume" ``` --- ## InfluxDB (Time Series Data) ### Basic Queries ``` "Show me CPU metrics from the last hour" "What's the average memory usage?" "Get disk I/O statistics" ``` ### Aggregations ``` "Show me average CPU per host" "What's the max memory usage in the last hour?" "Sum of all requests by endpoint" ``` ### Filtering ``` "Show metrics for host server-01" "Get CPU data where usage > 80%" "Filter metrics by region" ``` --- ## Investigation Workflows ### When Something is Broken 1. **Start broad**: "What errors are happening right now?" 2. **Identify the source**: "Which service has the most errors?" 3. **Dig deeper**: "Show me the actual error messages from [service]" 4. **Find the pattern**: "When did these errors start?" ### Performance Issues 1. **Check metrics**: "Show me CPU and memory trends for the last hour" 2. **Look for correlation**: "Did request latency increase at the same time?" 3. **Find the cause**: "Search logs for slow query warnings" ### After a Deployment ``` "Compare error rates before and after 2pm" "Show me any new errors in the last 30 minutes" "What's the current health of [service]?" ``` --- ## Tips for Better Results ### Be Specific About Time ``` ❌ "Show me errors" → Defaults to 1 hour ✅ "Show me errors from the last 6 hours" ✅ "Errors between 2pm and 4pm today" ``` ### Name Services/Sources ``` ❌ "What's broken?" ✅ "What errors are coming from the API service?" ✅ "Show me payment service logs" ``` ### Start Narrow, Then Expand ``` 1. "Show me ERROR level logs only" 2. "Include WARN as well" 3. "Search all log levels for 'timeout'" ``` ### Use Keywords ``` "timeout", "connection", "failed", "exception", "denied" "refused", "unauthorized", "500", "503", "null" ``` ### Avoid Leading Wildcards ``` ❌ source:*api* → Leading wildcards often fail ✅ source:api* → Trailing wildcards work ✅ source:api-service → Exact match is best ``` --- ## Understanding Results ### Log Results - **total_results**: How many logs matched (may be more than shown) - **truncated**: True if there are more results than displayed - **level_breakdown**: Count of logs by severity level - **source_breakdown**: Which services/sources produced the logs ### Hints Section Results include `_hints` with: - **analysis_tips**: Suggestions for what to look at first - **suggested_filters**: Recommended follow-up queries - **level_breakdown**: ERROR/WARN/INFO counts ### When You See "truncated: true" ``` "There are more results. Show me only ERRORs to narrow it down." "Filter to just the [service-name] source" "Narrow the time range to the last 30 minutes" ``` --- ## Common Query Patterns ### Error Investigation ``` "Show me all errors from the last hour" → Review the source_breakdown → "Now show me just the errors from [top source]" → "What was happening 5 minutes before these errors started?" ``` ### Service Health Check ``` "Is the [service] healthy?" "Show me error rate for [service] over the last hour" "Any warnings from [service] recently?" ``` ### Incident Response ``` "What errors are happening right now?" "Which services are affected?" "When did this start?" "Show me related logs from 10 minutes before the incident" ``` ### Post-Mortem ``` "Show me all errors between [start time] and [end time]" "What was the error rate during the incident?" "Compare metrics before, during, and after" ``` --- ## Environment Filtering ### Default Behavior - Searches automatically focus on **production** logs - This saves tokens and focuses on what matters ### Override When Needed ``` "Search staging for this error" "Show me dev environment logs" "Search ALL environments for this issue" "Include non-prod in the search" ``` --- ## Getting Help ### Discover Available Fields ``` "What fields are available in Graylog?" "Show me fields related to HTTP" "List all kubernetes-related fields" ``` ### Discover Available Metrics ``` "What Prometheus metrics are available?" "Show me metrics related to CPU" "List HTTP metrics" ``` ### Query Syntax Help ``` "How do I search for exact phrase in Graylog?" "What's the PromQL for error rate?" "How do I filter by multiple services?" ```

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/malindarathnayake/Overwatch-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

usage-guide.md•6.69 KiB