MONITORING_GUIDE.md•13.4 kB
# Monitoring & Logging Guide
**Tableau MCP Server - Phase 6**
**Version**: 1.0
**Last Updated**: November 18, 2025
---
## Table of Contents
1. [Overview](#overview)
2. [Cloud Run Logging](#cloud-run-logging)
3. [Metrics & Monitoring](#metrics--monitoring)
4. [Alerting](#alerting)
5. [Performance Monitoring](#performance-monitoring)
6. [Error Tracking](#error-tracking)
7. [Cost Monitoring](#cost-monitoring)
8. [Troubleshooting](#troubleshooting)
---
## Overview
This guide covers monitoring and logging for the Tableau MCP Server deployed on Google Cloud Run. Cloud Run automatically integrates with Google Cloud's operations suite (formerly Stackdriver) for comprehensive monitoring.
### Key Monitoring Areas
1. **Logs**: Application logs, request logs, system logs
2. **Metrics**: Request count, latency, error rate, resource usage
3. **Traces**: Request tracing for performance analysis
4. **Alerts**: Automated notifications for issues
---
## Cloud Run Logging
### View Logs in Console
1. Navigate to [Cloud Run Console](https://console.cloud.google.com/run)
2. Click on your service (e.g., `tableau-mcp-staging`)
3. Click **"Logs"** tab
### View Logs via gcloud CLI
#### Recent Logs
```bash
# Staging - last 50 entries
gcloud run logs read tableau-mcp-staging \
--region=australia-southeast1 \
--limit=50
# Production - last 50 entries
gcloud run logs read tableau-mcp-production \
--region=australia-southeast1 \
--limit=50
```
#### Real-Time Log Streaming
```bash
# Staging - tail logs
gcloud run logs tail tableau-mcp-staging \
--region=australia-southeast1
# Production - tail logs
gcloud run logs tail tableau-mcp-production \
--region=australia-southeast1
```
#### Filter Logs
```bash
# Only errors
gcloud run logs read tableau-mcp-staging \
--region=australia-southeast1 \
--log-filter='severity>=ERROR' \
--limit=20
# Specific time range
gcloud run logs read tableau-mcp-staging \
--region=australia-southeast1 \
--limit=100 \
--format='value(timestamp,textPayload)' \
--after='2025-01-01T00:00:00Z'
# Search for specific text
gcloud run logs read tableau-mcp-staging \
--region=australia-southeast1 \
--log-filter='textPayload=~"authentication failed"' \
--limit=20
```
### Log Levels
The server logs at different severity levels:
- **DEBUG**: Detailed debugging information
- **INFO**: General informational messages (default)
- **WARN**: Warning messages (non-critical issues)
- **ERROR**: Error messages (failures)
### Structured Logging
The server implements structured logging with:
- **Timestamp**: When the event occurred
- **Level**: Severity level
- **Message**: Log message
- **Context**: Request ID, user info, etc.
Example log entry:
```json
{
"timestamp": "2025-11-18T12:34:56.789Z",
"severity": "INFO",
"message": "MCP tool executed successfully",
"tool": "tableau_list_workbooks",
"requestId": "abc-123-def-456",
"duration": "234ms"
}
```
### Sensitive Data Sanitization
The server automatically sanitizes sensitive data in logs:
- API keys are never logged
- Tableau credentials are never logged
- Request/response bodies with secrets are sanitized
---
## Metrics & Monitoring
### View Metrics in Console
1. Navigate to [Cloud Run Console](https://console.cloud.google.com/run)
2. Click on your service
3. Click **"Metrics"** tab
### Key Metrics
#### Request Metrics
- **Request Count**: Total number of requests
- **Request Latency**: Response time (p50, p95, p99)
- **Error Rate**: Percentage of failed requests
#### Resource Metrics
- **CPU Utilization**: Percentage of allocated CPU used
- **Memory Utilization**: Memory usage
- **Container Instance Count**: Number of running instances
#### Billable Metrics
- **Billable Instance Time**: Time instances were running
- **Request Time**: Time spent processing requests
### Query Metrics via gcloud
```bash
# Request count (last hour)
gcloud monitoring time-series list \
--filter='metric.type="run.googleapis.com/request_count" AND resource.labels.service_name="tableau-mcp-staging"' \
--format='value(point.value)'
# Average latency
gcloud monitoring time-series list \
--filter='metric.type="run.googleapis.com/request_latencies" AND resource.labels.service_name="tableau-mcp-staging"' \
--format='table(point.value.distribution_value.mean)'
```
### Custom Metrics
The server exposes custom metrics via logs that can be converted to metrics:
1. **MCP Tool Execution Count**
- Filter: `jsonPayload.message="MCP tool executed"`
- Group by: `jsonPayload.tool`
2. **Authentication Failures**
- Filter: `jsonPayload.message="Authentication failed"`
3. **Tableau API Errors**
- Filter: `jsonPayload.message=~"Tableau API error"`
---
## Alerting
### Recommended Alerts
#### 1. High Error Rate
Alert when error rate exceeds 5%:
```bash
# Create alert policy (via Console or Terraform recommended)
# Filter: resource.type="cloud_run_revision" AND severity >= ERROR
# Threshold: Count > 10 in 5 minutes
```
**Console Setup**:
1. Go to [Monitoring > Alerting](https://console.cloud.google.com/monitoring/alerting)
2. Click **"Create Policy"**
3. Add condition:
- Resource type: Cloud Run Revision
- Metric: Log entries
- Filter: `severity >= ERROR`
- Threshold: `> 10` over `5 minutes`
4. Add notification channel (email, SMS, etc.)
#### 2. High Latency
Alert when p95 latency exceeds 2 seconds:
```bash
# Condition:
# Metric: run.googleapis.com/request_latencies
# Aggregation: 95th percentile
# Threshold: > 2000ms
```
#### 3. Service Down
Alert when service is unhealthy:
```bash
# Condition:
# Metric: run.googleapis.com/request_count
# Condition: No requests in last 5 minutes (for production)
# Note: Don't use for staging if it scales to zero
```
#### 4. Authentication Failures
Alert on repeated authentication failures:
```bash
# Filter: jsonPayload.message="Authentication failed"
# Threshold: > 20 in 10 minutes
```
#### 5. Tableau API Failures
Alert on Tableau API connectivity issues:
```bash
# Filter: jsonPayload.message=~"Tableau API error" OR jsonPayload.message=~"Authentication failed"
# Threshold: > 5 in 5 minutes
```
### Notification Channels
Set up notification channels:
1. **Email**: Send alerts to team email
2. **Slack**: Integrate with Slack channel
3. **PagerDuty**: For production critical alerts
4. **SMS**: For urgent production issues
### Alert Configuration via Console
1. Navigate to [Monitoring > Alerting](https://console.cloud.google.com/monitoring/alerting)
2. Click **"Create Policy"**
3. Configure:
- **Condition**: Metric, threshold, duration
- **Notification**: Channels to notify
- **Documentation**: Help text for responders
- **Incident auto-close**: Duration before auto-close
---
## Performance Monitoring
### Request Tracing
Enable request tracing for detailed performance analysis:
```bash
# View traces in Console
# Navigate to: Trace > Trace List
# Filter by service: tableau-mcp-staging or tableau-mcp-production
```
### Latency Analysis
#### View Latency Breakdown
1. Go to Cloud Run service
2. Click **"Metrics"** tab
3. View **"Request latencies"** chart
4. Analyze p50, p95, p99 percentiles
#### Identify Slow Requests
```bash
# Find slow requests in logs
gcloud run logs read tableau-mcp-staging \
--region=australia-southeast1 \
--log-filter='jsonPayload.duration>1000' \
--limit=20 \
--format='table(timestamp,jsonPayload.tool,jsonPayload.duration)'
```
### Cold Start Monitoring
Monitor cold start frequency and duration:
```bash
# Find cold start events
gcloud run logs read tableau-mcp-staging \
--region=australia-southeast1 \
--log-filter='textPayload=~"Cold start"' \
--limit=10
```
**Reduce Cold Starts**:
- For staging: Accept cold starts (cost-effective)
- For production: Set `minScale=1` (already configured)
---
## Error Tracking
### Error Log Analysis
#### View All Errors
```bash
# Last 50 errors
gcloud run logs read tableau-mcp-staging \
--region=australia-southeast1 \
--log-filter='severity>=ERROR' \
--limit=50 \
--format='table(timestamp,severity,textPayload)'
```
#### Group Errors by Type
```bash
# Authentication errors
gcloud run logs read tableau-mcp-staging \
--region=australia-southeast1 \
--log-filter='severity>=ERROR AND textPayload=~"authentication"' \
--limit=20
# Tableau API errors
gcloud run logs read tableau-mcp-staging \
--region=australia-southeast1 \
--log-filter='severity>=ERROR AND textPayload=~"Tableau"' \
--limit=20
# MCP tool errors
gcloud run logs read tableau-mcp-staging \
--region=australia-southeast1 \
--log-filter='severity>=ERROR AND jsonPayload.tool!=null' \
--limit=20
```
### Error Rate Dashboard
Create a custom dashboard in Cloud Monitoring:
1. Go to [Monitoring > Dashboards](https://console.cloud.google.com/monitoring/dashboards)
2. Create dashboard
3. Add charts:
- Request count (success vs error)
- Error rate percentage
- Errors by type
- Top error messages
---
## Cost Monitoring
### View Cloud Run Costs
1. Navigate to [Billing > Reports](https://console.cloud.google.com/billing)
2. Filter by:
- Service: Cloud Run
- Project: Your project
- Time range: Last 30 days
### Monitor Resource Usage
```bash
# View instance count over time
gcloud monitoring time-series list \
--filter='metric.type="run.googleapis.com/container/instance_count" AND resource.labels.service_name="tableau-mcp-staging"' \
--format='table(point.value,point.interval.endTime)'
# View billable time
gcloud monitoring time-series list \
--filter='metric.type="run.googleapis.com/container/billable_instance_time" AND resource.labels.service_name="tableau-mcp-staging"' \
--format='table(point.value,point.interval.endTime)'
```
### Cost Optimization Tips
1. **Staging**: Keep `minScale=0` to scale to zero when idle
2. **Production**: Monitor actual usage and adjust `maxScale`
3. **Request batching**: Batch Tableau API calls where possible
4. **Caching**: Consider caching frequently accessed data
5. **Log retention**: Reduce log retention period for non-critical logs
---
## Troubleshooting
### Issue: No Logs Appearing
**Solutions**:
```bash
# 1. Check logs are enabled
gcloud run services describe tableau-mcp-staging \
--region=australia-southeast1 \
--format='value(spec.template.metadata.annotations)'
# 2. Verify service is receiving requests
curl https://YOUR_SERVICE_URL/health
# 3. Check Cloud Logging quota
gcloud logging read "resource.type=cloud_run_revision" --limit=1
```
### Issue: High Memory Usage
**Solutions**:
```bash
# 1. Check current memory limit
gcloud run services describe tableau-mcp-staging \
--region=australia-southeast1 \
--format='value(spec.template.spec.containers[0].resources.limits.memory)'
# 2. Increase memory if needed
gcloud run services update tableau-mcp-staging \
--region=australia-southeast1 \
--memory=1Gi
# 3. Investigate memory leaks in logs
gcloud run logs read tableau-mcp-staging \
--region=australia-southeast1 \
--log-filter='textPayload=~"memory"' \
--limit=50
```
### Issue: Alerts Not Firing
**Solutions**:
1. Verify alert policy is enabled
2. Check notification channels are configured
3. Test with manual threshold breach
4. Verify metric filter is correct
5. Check alert history for past alerts
---
## Log Retention
### Default Retention
- **Cloud Run logs**: 30 days (default)
- **Can be extended**: Up to 3650 days (10 years)
### Configure Retention
```bash
# Via gcloud (for specific log bucket)
gcloud logging buckets update _Default \
--location=global \
--retention-days=90
```
**Recommendation**:
- **Staging**: 30 days (default)
- **Production**: 90 days (compliance)
---
## Monitoring Checklist
After deployment, verify:
- [ ] Logs are appearing in Cloud Logging
- [ ] Metrics are populating in Cloud Monitoring
- [ ] Health check endpoint is monitored
- [ ] Error rate alert is configured
- [ ] High latency alert is configured
- [ ] Authentication failure alert is configured
- [ ] Notification channels are tested
- [ ] Cost monitoring is set up
- [ ] Log retention is configured
- [ ] Dashboard is created (optional)
---
## Useful Queries
### Top 10 Most Common Errors
```bash
gcloud logging read "resource.type=cloud_run_revision \
AND resource.labels.service_name=tableau-mcp-staging \
AND severity>=ERROR" \
--limit=100 \
--format=json | jq '.[] | .textPayload' | sort | uniq -c | sort -rn | head -10
```
### Request Count by Hour
```bash
gcloud logging read "resource.type=cloud_run_revision \
AND resource.labels.service_name=tableau-mcp-staging" \
--limit=1000 \
--format='value(timestamp)' | cut -d'T' -f2 | cut -d':' -f1 | sort | uniq -c
```
### Average Response Time
```bash
gcloud logging read "resource.type=cloud_run_revision \
AND resource.labels.service_name=tableau-mcp-staging \
AND jsonPayload.duration!=null" \
--limit=100 \
--format='value(jsonPayload.duration)' | awk '{sum+=$1; n++} END {print sum/n}'
```
---
## Resources
- [Cloud Run Logging Docs](https://cloud.google.com/run/docs/logging)
- [Cloud Monitoring Docs](https://cloud.google.com/monitoring/docs)
- [Alerting Docs](https://cloud.google.com/monitoring/alerts)
- [Log Query Language](https://cloud.google.com/logging/docs/view/logging-query-language)
---
**Monitoring Guide Version**: 1.0
**Last Updated**: November 18, 2025