Hostaway MCP Server

MONITORING_OBSERVABILITY.md•14.8 KiB

# Monitoring & Observability Guide # Context Window Protection - Cursor-Based Pagination **Version**: 1.0 **Date**: October 15, 2025 ## Overview This guide covers monitoring and observability for the cursor-based pagination feature, including metrics collection, dashboards, alerting, and troubleshooting. ## Health Endpoint The `/health` endpoint provides real-time metrics about the pagination system. ### Endpoint ``` GET /health ``` **No authentication required** (public endpoint for monitoring) ### Response ```json { "status": "healthy", "timestamp": "2025-10-15T14:32:10.123456Z", "version": "0.1.0", "service": "hostaway-mcp", "context_protection": { "total_requests": 15234, "pagination_adoption": 0.87, "summarization_adoption": 0.45, "avg_response_size_bytes": 2400, "avg_latency_ms": 145.2, "oversized_events": 12, "uptime_seconds": 86400 } } ``` ### Metrics Explained | Metric | Type | Description | Target | |--------|------|-------------|--------| | `total_requests` | Counter | Total API requests processed | N/A | | `pagination_adoption` | Rate (0-1) | % of requests using pagination | >0.95 | | `summarization_adoption` | Rate (0-1) | % of responses summarized | >0.70 | | `avg_response_size_bytes` | Gauge | Average response size | <2500 | | `avg_latency_ms` | Gauge | Average response time | <200 | | `oversized_events` | Counter | Responses exceeding token budget | <100/day | | `uptime_seconds` | Counter | Service uptime | N/A | ### Monitoring Health Endpoint ```bash # Manual check curl https://api.example.com/health | jq . # Automated monitoring (every 60s) watch -n 60 'curl -s https://api.example.com/health | jq .context_protection' # Prometheus scraping curl https://api.example.com/health | jq -r '.context_protection | to_entries | .[] | "\(.key) \(.value)"' ``` ## Metrics Collection ### Built-in Telemetry Service The application includes a telemetry service (`src/services/telemetry_service.py`) that tracks: 1. **Request Counters** - Total requests - Paginated vs non-paginated requests - Summarized vs full responses 2. **Performance Metrics** - Response times (histogram) - Response sizes (histogram) - Cursor encode/decode times 3. **Error Tracking** - Invalid cursor attempts - Token budget exceeded events - API errors by type ### Accessing Metrics ```python from src.services.telemetry_service import get_telemetry_service telemetry = get_telemetry_service() metrics = telemetry.get_metrics() print(f"Total requests: {metrics['total_requests']}") print(f"Pagination adoption: {metrics['pagination_adoption']:.1%}") print(f"Avg response size: {metrics['avg_response_size']:.0f} bytes") ``` ## Log Aggregation ### Structured Logging All logs are output in JSON format for easy parsing: ```json { "timestamp": "2025-10-15T14:32:10.123456Z", "level": "INFO", "logger": "src.api.routes.listings", "message": "Cursor pagination requested", "extra": { "correlation_id": "abc123", "endpoint": "/api/listings", "cursor_present": true, "limit": 50, "duration_ms": 145 } } ``` ### Log Levels | Level | Usage | Examples | |-------|-------|----------| | DEBUG | Development debugging | Cursor encoding details | | INFO | Normal operations | API requests, pagination events | | WARNING | Recoverable issues | Invalid cursors, high load | | ERROR | Error conditions | Database failures, auth errors | | CRITICAL | System failures | Service unavailable | ### Key Log Events #### Pagination Events ```json { "event": "pagination_requested", "endpoint": "/api/listings", "limit": 50, "cursor_provided": true } ``` ```json { "event": "cursor_encoded", "offset": 50, "encode_time_ms": 0.5 } ``` ```json { "event": "cursor_decoded", "offset": 50, "decode_time_ms": 0.6 } ``` #### Error Events ```json { "event": "invalid_cursor", "reason": "Signature verification failed", "client_ip": "192.168.1.100" } ``` ```json { "event": "cursor_expired", "age_minutes": 15, "client_ip": "192.168.1.100" } ``` ### Log Queries (Examples) #### Find all invalid cursor attempts (last hour) ```bash # Using grep grep "invalid_cursor" application.log | jq 'select(.timestamp > (now - 3600))' # Using jq jq 'select(.event == "invalid_cursor")' application.log # Using CloudWatch Insights fields @timestamp, @message | filter event = "invalid_cursor" | sort @timestamp desc | limit 100 ``` #### Track pagination adoption over time ```bash # Group by hour jq -r 'select(.event == "pagination_requested") | .timestamp' application.log | \ cut -d'T' -f2 | cut -d':' -f1 | sort | uniq -c ``` ## Dashboards ### Grafana Dashboard Setup #### Data Source: Prometheus **Prometheus Configuration** (`prometheus.yml`): ```yaml scrape_configs: - job_name: 'hostaway-mcp' scrape_interval: 30s metrics_path: '/metrics' # If Prometheus exporter added static_configs: - targets: ['localhost:8000'] # Or scrape health endpoint - job_name: 'hostaway-mcp-health' scrape_interval: 60s metrics_path: '/health' static_configs: - targets: ['localhost:8000'] metric_relabel_configs: - source_labels: [__name__] regex: 'context_protection_(.*)' target_label: __name__ replacement: 'hostaway_${1}' ``` #### Dashboard Panels **1. Request Volume** ```promql # Total requests per minute rate(hostaway_total_requests[1m]) # Paginated vs non-paginated rate(hostaway_paginated_requests[1m]) rate(hostaway_non_paginated_requests[1m]) ``` **2. Pagination Adoption** ```promql # Adoption rate (%) (hostaway_paginated_requests / hostaway_total_requests) * 100 ``` **3. Response Time** ```promql # Average response time histogram_quantile(0.95, rate(hostaway_response_time_bucket[5m])) # By endpoint histogram_quantile(0.95, rate(hostaway_response_time_bucket{endpoint="/api/listings"}[5m])) ``` **4. Error Rate** ```promql # Invalid cursor rate rate(hostaway_invalid_cursor_total[1m]) # Overall error rate rate(hostaway_errors_total[1m]) / rate(hostaway_total_requests[1m]) ``` **5. Token Budget Usage** ```promql # Average response size avg(hostaway_response_size_bytes) # Oversized responses rate(hostaway_oversized_events[1m]) ``` ### CloudWatch Dashboard (AWS) **Metrics to Track:** 1. **Custom Metrics** (via CloudWatch SDK): ```python import boto3 cloudwatch = boto3.client('cloudwatch') cloudwatch.put_metric_data( Namespace='HostawayMCP', MetricData=[ { 'MetricName': 'PaginationAdoption', 'Value': metrics['pagination_adoption'], 'Unit': 'Percent', 'Timestamp': datetime.now(UTC) }, { 'MetricName': 'AvgResponseSize', 'Value': metrics['avg_response_size'], 'Unit': 'Bytes', 'Timestamp': datetime.now(UTC) } ] ) ``` 2. **Application Logs** (via CloudWatch Logs): ```python import watchtower logger.addHandler(watchtower.CloudWatchLogHandler( log_group='/aws/hostaway-mcp', stream_name='pagination' )) ``` ### Datadog Dashboard **Integration:** ```python from datadog import initialize, statsd initialize( api_key='your-api-key', app_key='your-app-key' ) # Track metrics statsd.increment('hostaway.pagination.requests') statsd.histogram('hostaway.response.size', response_size) statsd.gauge('hostaway.adoption.rate', adoption_rate) ``` **Dashboard Widgets:** - Timeseries: Request volume over time - Query value: Current pagination adoption % - Heatmap: Response time distribution - Top list: Most common error types ## Alerting ### Alert Rules #### 1. High Invalid Cursor Rate ```yaml alert: HighInvalidCursorRate expr: | ( rate(hostaway_invalid_cursor_total[5m]) / rate(hostaway_total_requests[5m]) ) > 0.05 for: 10m severity: warning annotations: summary: "High invalid cursor rate detected" description: "{{ $value | humanizePercentage }} of requests have invalid cursors" action: - notify: slack-oncall - create: jira-ticket ``` #### 2. Pagination Adoption Too Low ```yaml alert: LowPaginationAdoption expr: hostaway_pagination_adoption < 0.20 for: 24h severity: info annotations: summary: "Pagination adoption below target" description: "Only {{ $value | humanizePercentage }} of clients using pagination" action: - notify: slack-team ``` #### 3. High Token Budget Overruns ```yaml alert: FrequentOversizedResponses expr: rate(hostaway_oversized_events[1h]) > 10 for: 1h severity: warning annotations: summary: "Too many oversized responses" description: "{{ $value }} responses exceeded token budget in the last hour" action: - notify: slack-oncall ``` #### 4. Cursor Encoding Performance ```yaml alert: SlowCursorEncoding expr: hostaway_cursor_encode_time_p95 > 5 for: 5m severity: warning annotations: summary: "Cursor encoding is slow" description: "p95 cursor encoding time is {{ $value }}ms (target: 1ms)" action: - notify: slack-oncall - escalate: pagerduty ``` ### Alert Channels **Slack Integration:** ```python import requests def send_slack_alert(metric, value, threshold): webhook_url = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" message = { "text": f"🚨 Alert: {metric}", "blocks": [ { "type": "section", "text": { "type": "mrkdwn", "text": f"*{metric}*\nCurrent: `{value}`\nThreshold: `{threshold}`" } }, { "type": "actions", "elements": [ { "type": "button", "text": {"type": "plain_text", "text": "View Dashboard"}, "url": "https://grafana.example.com/d/pagination" } ] } ] } requests.post(webhook_url, json=message) ``` **PagerDuty Integration:** ```python import pypd pypd.api_key = 'your-api-key' def create_pagerduty_incident(title, description): incident = pypd.Incident.create( incident_key='pagination-alert', title=title, service=pypd.Service.find_one(name='Hostaway MCP'), urgency='high', body={'type': 'incident_body', 'details': description} ) ``` ## Tracing ### Distributed Tracing with OpenTelemetry **Setup:** ```python from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.jaeger.thrift import JaegerExporter # Configure tracer trace.set_tracer_provider(TracerProvider()) tracer = trace.get_tracer(__name__) # Export to Jaeger jaeger_exporter = JaegerExporter( agent_host_name='localhost', agent_port=6831, ) trace.get_tracer_provider().add_span_processor( BatchSpanProcessor(jaeger_exporter) ) ``` **Instrument Pagination:** ```python @tracer.start_as_current_span("encode_cursor") def encode_cursor(offset: int, secret: str) -> str: span = trace.get_current_span() span.set_attribute("pagination.offset", offset) cursor = _do_encoding(offset, secret) span.set_attribute("pagination.cursor_length", len(cursor)) return cursor ``` **View Traces:** - Jaeger UI: `http://localhost:16686` - Filter by operation: `encode_cursor`, `decode_cursor` - Analyze latency breakdown ## Performance Profiling ### Python Profiling **Profile cursor operations:** ```python import cProfile import pstats # Profile encode/decode profiler = cProfile.Profile() profiler.enable() for _ in range(1000): cursor = encode_cursor(offset=50, secret="test-secret") decode_cursor(cursor, secret="test-secret") profiler.disable() stats = pstats.Stats(profiler) stats.sort_stats('cumulative') stats.print_stats(10) ``` **Memory profiling:** ```python from memory_profiler import profile @profile def test_cursor_memory(): cursors = [] for i in range(10000): cursor = encode_cursor(offset=i, secret="test") cursors.append(cursor) return cursors ``` ### Load Testing **Locust load test:** ```python from locust import HttpUser, task, between class PaginationUser(HttpUser): wait_time = between(1, 3) @task def fetch_listings_with_pagination(self): cursor = None for _ in range(5): # Fetch 5 pages params = {"limit": 50} if cursor: params["cursor"] = cursor with self.client.get( "/api/listings", params=params, headers={"X-API-Key": "test-key"}, catch_response=True ) as response: if response.status_code == 200: data = response.json() cursor = data.get("nextCursor") response.success() else: response.failure(f"Got status {response.status_code}") ``` **Run load test:** ```bash locust -f locustfile.py --host=https://api.example.com --users=100 --spawn-rate=10 ``` ## Troubleshooting ### Common Issues #### High Invalid Cursor Rate **Symptoms:** - Alert: `HighInvalidCursorRate` - Logs show frequent "Invalid cursor" errors **Diagnosis:** ```bash # Check error distribution jq 'select(.event == "invalid_cursor") | .reason' logs.json | sort | uniq -c # Common reasons: # - "Signature verification failed" (tampering or wrong secret) # - "Cursor expired" (>10 min old) # - "Invalid format" (malformed cursor) ``` **Resolution:** - If "signature failed": Verify cursor secret matches across instances - If "expired": Client may be caching cursors too long - If "invalid format": Client may be URL-encoding incorrectly #### Slow Response Times **Symptoms:** - Alert: `SlowResponseTime` - p95 latency >1000ms **Diagnosis:** ```bash # Check response time breakdown jq 'select(.event == "request_completed") | {endpoint: .endpoint, duration: .duration_ms}' logs.json | \ jq -s 'group_by(.endpoint) | map({endpoint: .[0].endpoint, avg: (map(.duration) | add / length)})' ``` **Resolution:** - Check database query performance - Verify cursor cache hit rate - Check if token estimation is slow #### Memory Leak **Symptoms:** - Gradually increasing memory usage - OOM errors **Diagnosis:** ```python # Check cursor storage size from src.services.cursor_storage import get_cursor_storage storage = get_cursor_storage() print(f"Cached cursors: {len(storage._storage)}") ``` **Resolution:** - Verify TTL cleanup is working - Check for circular references - Review cursor storage implementation ### Debug Mode Enable debug logging for detailed pagination traces: ```bash # Set environment variable export LOG_LEVEL=DEBUG # Or in config LOG_LEVEL=DEBUG python -m src.api.main ``` Debug logs include: - Cursor encoding/decoding details - Token estimation breakdowns - Cache hit/miss events - Performance timing for each operation --- **Last Updated**: October 15, 2025 **Maintained By**: DevOps Team **Related Docs**: [Deployment Checklist](./DEPLOYMENT_CHECKLIST.md)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/darrentmorgan/hostaway-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

MONITORING_OBSERVABILITY.md•14.8 KiB