Session Buddy

Overview Schema Related Servers Score Discussions

session-buddy
docs
migrations

health-check-implementation.md•18.6 KiB

# Migration: Health Check Implementation Note: This document describes a planned HTTP health/metrics layer. The current Session Buddy server does not expose HTTP endpoints; use exec/import checks or build a wrapper if needed. ## Overview Add production-ready health checks to your MCP stack using mcp_common's health check system. This is an optional HTTP layer for wrappers or custom servers. **Priority**: ⚡ **HIGH** - Critical for production deployments ## Prerequisites - `mcp-common>=2.0.0` installed - FastMCP server running - Python 3.13+ ## Benefits ✅ **Production Monitoring**: Real-time health status of all components (when exposed via a wrapper) ✅ **Early Problem Detection**: Identify issues before they cause failures ✅ **Standardized Format**: ComponentHealth pattern works across all MCP servers ✅ **Latency Tracking**: Performance diagnostics built-in ✅ **Actionable Metadata**: Rich context for debugging ✅ **Integration Ready**: Works with monitoring tools (Prometheus, DataDog, etc.) ## Current vs. New Pattern ### Before (No Health Checks) ```text # No health monitoring - issues discovered through failures @mcp.tool() async def status(): return {"status": "running"} # Not very useful! ``` ### After (Comprehensive Health Checks) ```text from mcp_common.health import ComponentHealth, HealthStatus from mcp_common.http_health import check_http_client_health @mcp.tool() async def health_check(): """Comprehensive health check.""" components = await asyncio.gather( check_python_environment_health(), check_database_health(), check_http_client_health(), check_file_system_health(), ) return { "status": "healthy" if all(c.status == HealthStatus.HEALTHY for c in components) else "degraded", "components": [ { "name": c.name, "status": c.status.value, "message": c.message, "latency_ms": c.latency_ms, "metadata": c.metadata, } for c in components ], } ``` ## Migration Steps ### Step 1: Install/Update mcp-common ```bash # Using uv (recommended) uv add "mcp-common>=2.0.0" # Using pip pip install "mcp-common>=2.0.0" ``` ### Step 2: Create Component Health Checks Create health check module for your server: ```python # health_checks.py from __future__ import annotations import time import typing as t from mcp_common.health import ComponentHealth, HealthStatus async def check_python_environment_health() -> ComponentHealth: """Check Python runtime health.""" import sys import platform python_version = sys.version_info if python_version < (3, 13): return ComponentHealth( name="python_env", status=HealthStatus.DEGRADED, message=f"Python {python_version.major}.{python_version.minor} (3.13+ recommended)", metadata={ "python_version": f"{python_version.major}.{python_version.minor}.{python_version.micro}", "platform": platform.system(), }, ) return ComponentHealth( name="python_env", status=HealthStatus.HEALTHY, message=f"Python {python_version.major}.{python_version.minor}.{python_version.micro}", metadata={ "python_version": f"{python_version.major}.{python_version.minor}.{python_version.micro}", "platform": platform.system(), }, ) async def check_database_health() -> ComponentHealth: """Check database connection health.""" start_time = time.perf_counter() try: from my_server.database import get_database db = await get_database() result = await db.ping() latency_ms = (time.perf_counter() - start_time) * 1000 return ComponentHealth( name="database", status=HealthStatus.HEALTHY, message="Database operational", latency_ms=latency_ms, metadata={ "connections": result.get("connections", 0), "version": result.get("version", "unknown"), }, ) except Exception as e: return ComponentHealth( name="database", status=HealthStatus.UNHEALTHY, message=f"Database error: {e}", metadata={"error": str(e), "error_type": type(e).__name__}, ) async def check_file_system_health() -> ComponentHealth: """Check file system accessibility.""" start_time = time.perf_counter() from pathlib import Path import os app_dir = Path.home() / ".my-app" required_dirs = [ app_dir / "logs", app_dir / "data", app_dir / "temp", ] issues = [] for directory in required_dirs: if not directory.exists(): issues.append(f"Missing: {directory.name}") elif not os.access(directory, os.W_OK): issues.append(f"Not writable: {directory.name}") latency_ms = (time.perf_counter() - start_time) * 1000 if issues: return ComponentHealth( name="file_system", status=HealthStatus.DEGRADED, message=f"File system issues: {', '.join(issues)}", latency_ms=latency_ms, metadata={"issues": issues}, ) return ComponentHealth( name="file_system", status=HealthStatus.HEALTHY, message="File system accessible", latency_ms=latency_ms, ) async def get_all_health_checks() -> list[ComponentHealth]: """Execute all health checks concurrently.""" import asyncio checks = await asyncio.gather( check_python_environment_health(), check_database_health(), check_file_system_health(), return_exceptions=True, ) # Convert exceptions to UNHEALTHY ComponentHealth results = [] for check in checks: if isinstance(check, Exception): results.append( ComponentHealth( name="unknown", status=HealthStatus.UNHEALTHY, message=f"Health check crashed: {check}", metadata={"error": str(check)}, ) ) else: results.append(check) return results ``` ### Step 3: Add HTTP Client Health Check If using HTTPClientAdapter: ```text # In health_checks.py, add: from mcp_common.http_health import check_http_client_health, check_http_connectivity async def check_http_health() -> list[ComponentHealth]: """Check HTTP client and connectivity.""" return await asyncio.gather( check_http_client_health(), check_http_connectivity(test_url="https://api.example.com/health"), ) ``` ### Step 4: Register MCP Health Check Tool ```text # server.py from fastmcp import FastMCP from my_server.health_checks import get_all_health_checks mcp = FastMCP("my-server") @mcp.tool() async def health_check() -> dict[str, t.Any]: """Comprehensive server health check. Returns health status for all server components including: - Python environment - Database connections - HTTP client - File system """ components = await get_all_health_checks() # Determine overall status (worst component status) statuses = [c.status for c in components] if HealthStatus.UNHEALTHY in statuses: overall_status = "unhealthy" elif HealthStatus.DEGRADED in statuses: overall_status = "degraded" else: overall_status = "healthy" return { "status": overall_status, "timestamp": datetime.now().isoformat(), "components": [ { "name": c.name, "status": c.status.value, "message": c.message, "latency_ms": c.latency_ms, "metadata": c.metadata, } for c in components ], } ``` ### Step 5: Add Health Check Endpoint (HTTP Servers) For HTTP-based health check endpoints: ```python from starlette.applications import Starlette from starlette.responses import JSONResponse from starlette.routing import Route async def http_health_check(request): """HTTP health check endpoint.""" components = await get_all_health_checks() # Determine status code based on health statuses = [c.status for c in components] if HealthStatus.UNHEALTHY in statuses: status_code = 503 # Service Unavailable elif HealthStatus.DEGRADED in statuses: status_code = 200 # OK but with warnings else: status_code = 200 # OK return JSONResponse( { "status": "healthy" if status_code == 200 else "unhealthy", "components": [ { "name": c.name, "status": c.status.value, "message": c.message, } for c in components ], }, status_code=status_code, ) app = Starlette( routes=[ Route("/health", http_health_check), ] ) ``` ### Step 6: Add Health Check UI (Optional) Use ServerPanels for terminal UI: ```python from mcp_common.ui import ServerPanels from my_server.health_checks import get_all_health_checks async def display_health_status(): """Display health status with Rich UI.""" components = await get_all_health_checks() ServerPanels.status(components) # Call at startup or on demand asyncio.run(display_health_status()) ``` ### Step 7: Add Monitoring Integration (Optional) Export metrics for Prometheus: ```python from prometheus_client import Gauge, generate_latest # Create Prometheus metrics health_status = Gauge("server_health_status", "Component health status", ["component"]) health_latency = Gauge( "server_health_latency_ms", "Health check latency", ["component"] ) async def export_health_metrics(): """Export health metrics for Prometheus.""" components = await get_all_health_checks() for component in components: # Map status to numeric value (1=healthy, 0.5=degraded, 0=unhealthy) status_value = { HealthStatus.HEALTHY: 1.0, HealthStatus.DEGRADED: 0.5, HealthStatus.UNHEALTHY: 0.0, }[component.status] health_status.labels(component=component.name).set(status_value) if component.latency_ms is not None: health_latency.labels(component=component.name).set(component.latency_ms) # Expose metrics endpoint @app.route("/metrics") async def metrics_endpoint(request): return Response(generate_latest(), media_type="text/plain") ``` ## Validation ### Test 1: Basic Health Check ```python import asyncio from my_server.health_checks import get_all_health_checks async def test_health_check(): components = await get_all_health_checks() print("Health Check Results:") for component in components: status_emoji = { "healthy": "✅", "degraded": "⚠️", "unhealthy": "❌", }[component.status.value] print(f"{status_emoji} {component.name}: {component.message}") if component.latency_ms: print(f" Latency: {component.latency_ms:.2f}ms") if component.metadata: print(f" Metadata: {component.metadata}") asyncio.run(test_health_check()) ``` ### Test 2: MCP Tool Integration ```bash # Test via MCP client echo '{"tool": "health_check", "arguments": {}}' | my-server ``` ### Test 3: HTTP Endpoint (if applicable) ```bash # Test HTTP health endpoint curl http://localhost:8000/health # Expected response: # { # "status": "healthy", # "components": [ # {"name": "python_env", "status": "healthy", "message": "Python 3.13.7"}, # {"name": "database", "status": "healthy", "message": "Database operational"}, # {"name": "file_system", "status": "healthy", "message": "File system accessible"} # ] # } ``` ### Test 4: Monitoring Integration ```bash # Test Prometheus metrics (if implemented) curl http://localhost:8000/metrics # Expected output includes: # server_health_status{component="database"} 1.0 # server_health_latency_ms{component="database"} 5.2 ``` ## Rollback Procedure Health checks are additive and safe to rollback: ### Step 1: Remove Health Check Tool Comment out or remove the health_check tool: ```text # @mcp.tool() # async def health_check(): # ... ``` ### Step 2: Remove Health Check Module Delete or rename `health_checks.py`: ```bash mv health_checks.py health_checks.py.bak ``` ### Step 3: Remove UI Dependencies (if added) If you added Rich UI panels: ```bash # Remove from dependencies uv remove rich ``` ## Common Issues ### Issue 1: Import Error for ComponentHealth **Symptom**: ``` ImportError: cannot import name 'ComponentHealth' from 'mcp_common' ``` **Solution**: ```bash # Ensure mcp-common 2.0+ is installed uv add "mcp-common>=2.0.0" # Verify import python -c "from mcp_common.health import ComponentHealth; print('✅ Import successful')" ``` ### Issue 2: Health Check Hangs **Symptom**: Health check never completes or times out. **Solution**: Add timeout to individual health checks: ```python async def check_with_timeout(check_func, timeout_seconds=5.0): """Wrap health check with timeout.""" try: return await asyncio.wait_for(check_func(), timeout=timeout_seconds) except asyncio.TimeoutError: return ComponentHealth( name=check_func.__name__, status=HealthStatus.UNHEALTHY, message=f"Health check timed out after {timeout_seconds}s", ) ``` ### Issue 3: Health Check Crashes Server **Symptom**: Health check raises unhandled exception that crashes server. **Solution**: Always catch exceptions in health checks: ```python async def safe_health_check(check_func): """Wrap health check to prevent crashes.""" try: return await check_func() except Exception as e: return ComponentHealth( name=check_func.__name__, status=HealthStatus.UNHEALTHY, message=f"Health check crashed: {e}", metadata={"error": str(e), "error_type": type(e).__name__}, ) ``` ### Issue 4: Slow Health Checks **Symptom**: Health check endpoint takes >1 second to respond. **Solution**: Run checks concurrently with `asyncio.gather()`: ```text # ✅ Good: Concurrent execution (fast) components = await asyncio.gather( check_1(), check_2(), check_3(), ) # ❌ Bad: Sequential execution (slow) components = [ await check_1(), await check_2(), await check_3(), ] ``` ## Best Practices ### 1. Measure Latency ```text # ✅ Good: Always measure latency start_time = time.perf_counter() # ... perform check ... latency_ms = (time.perf_counter() - start_time) * 1000 ``` ### 2. Provide Actionable Metadata ```python # ✅ Good: Detailed metadata return ComponentHealth( name="database", status=HealthStatus.DEGRADED, message="High connection count", metadata={ "active_connections": 95, "max_connections": 100, "recommendation": "Scale database or optimize queries", }, ) ``` ### 3. Use Meaningful Status Levels - **HEALTHY**: Everything is normal, no action needed - **DEGRADED**: Functional but issues detected, monitor closely - **UNHEALTHY**: Not operational, immediate attention required ### 4. Run Checks Concurrently ```python # ✅ Good: Parallel execution components = await asyncio.gather( check_python_environment_health(), check_database_health(), check_file_system_health(), ) ``` ### 5. Handle Errors Gracefully ```text # ✅ Good: Never let exceptions propagate try: # Check logic return ComponentHealth(status=HealthStatus.HEALTHY) except Exception as e: return ComponentHealth( status=HealthStatus.UNHEALTHY, message=f"Check failed: {e}", metadata={"error": str(e)} ) ``` ### 6. Cache Health Check Results (Optional) For high-traffic servers: ```python from functools import lru_cache from datetime import datetime, timedelta _health_cache = None _health_cache_time = None async def get_cached_health_checks(ttl_seconds=30): """Get health checks with caching.""" global _health_cache, _health_cache_time now = datetime.now() if _health_cache and _health_cache_time: age = (now - _health_cache_time).total_seconds() if age < ttl_seconds: return _health_cache # Cache expired, run checks _health_cache = await get_all_health_checks() _health_cache_time = now return _health_cache ``` ## Health Check Patterns ### Pattern 1: Simple Binary Check ```python async def check_service_health() -> ComponentHealth: """Simple up/down check.""" try: await service.ping() return ComponentHealth( name="service", status=HealthStatus.HEALTHY, message="Service responding" ) except Exception as e: return ComponentHealth( name="service", status=HealthStatus.UNHEALTHY, message=f"Service not responding: {e}", ) ``` ### Pattern 2: Threshold-Based Check ```python async def check_memory_health() -> ComponentHealth: """Memory usage with thresholds.""" import psutil memory = psutil.virtual_memory() usage_percent = memory.percent if usage_percent > 90: status = HealthStatus.UNHEALTHY message = f"Critical memory usage: {usage_percent}%" elif usage_percent > 75: status = HealthStatus.DEGRADED message = f"High memory usage: {usage_percent}%" else: status = HealthStatus.HEALTHY message = f"Memory usage normal: {usage_percent}%" return ComponentHealth( name="memory", status=status, message=message, metadata={ "usage_percent": usage_percent, "available_mb": memory.available / 1024 / 1024, "total_mb": memory.total / 1024 / 1024, }, ) ``` ### Pattern 3: Connectivity Check ```python async def check_api_connectivity() -> ComponentHealth: """Check external API connectivity.""" from mcp_common.http_health import check_http_connectivity return await check_http_connectivity( test_url="https://api.example.com/health", timeout_ms=3000 ) ``` ## Additional Resources - [ComponentHealth API Reference](../reference/API_REFERENCE.md) - [ARCHITECTURE.md - Health Check Architecture](../developer/ARCHITECTURE.md) - [HTTPClientAdapter Health Checks](../reference/API_REFERENCE.md) - [Prometheus Integration Guide](https://prometheus.io/docs/guides/go-application/) ______________________________________________________________________ **Need help?** Open an issue on GitHub with the `health-checks` label. **Example implementations?** Check `session_buddy/health_checks.py` for reference.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/lesleslie/session-buddy'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

health-check-implementation.md•18.6 KiB