Homelab MCP Server

ERROR_HANDLING_FIXES.md•7.15 kB

# 🛡️ Error Handling Fixes for MCP Server ## 🚨 **Issues Identified** The MCP server had several critical error handling problems that could cause crashes on timeouts: ### **1. Main Server Loop Vulnerabilities** - **No timeout protection** on `reader.readline()` - could hang indefinitely - **No timeout wrapper** around `handle_request()` - timeouts would crash the server - **No consecutive error limits** - infinite error loops possible - **No graceful shutdown** handling ### **2. Tool Execution Vulnerabilities** - **No timeout wrapper** on `execute_tool()` - long-running operations could hang - **No structured error responses** - malformed responses could break client communication - **No health monitoring** - no way to track server stability ### **3. SSH Operation Inconsistencies** - **Inconsistent timeout handling** across different SSH functions - **No retry logic** for transient network failures - **Generic error messages** - poor debugging experience ## ✅ **Fixes Implemented** ### **1. Enhanced Server Core (`server.py`)** #### **Robust Main Loop** ```python # Added timeout protection to prevent hanging line_bytes = await asyncio.wait_for(reader.readline(), timeout=300.0) # Added consecutive error tracking consecutive_errors = 0 max_consecutive_errors = 10 # Added graceful shutdown on too many errors if consecutive_errors >= max_consecutive_errors: logger.error(f"Too many consecutive errors, shutting down") break ``` #### **Request Timeout Protection** ```python # All requests now have timeout protection response = await asyncio.wait_for( self.handle_request(request), timeout=120.0 # 2 minute timeout ) # Tool execution has separate timeout result = await asyncio.wait_for( execute_tool(tool_name, tool_args), timeout=60.0 # 60 second timeout ) ``` #### **Health Monitoring** ```python # New health endpoint elif method == "health/status": health_status = health_checker.get_health_status() return self._success_response(request_id, health_status) ``` ### **2. Error Handling Framework (`error_handling.py`)** #### **Timeout Wrapper Decorator** ```python @timeout_wrapper(timeout_seconds=30.0) async def some_function(): # Function will timeout gracefully after 30 seconds # Returns structured error response instead of crashing ``` #### **Retry Logic with Exponential Backoff** ```python @retry_on_failure(max_retries=3, delay_seconds=1.0, backoff_multiplier=2.0) async def network_operation(): # Retries on connection errors: 1s, 2s, 4s delays # Gives up after 3 attempts with structured error ``` #### **SSH-Specific Error Handling** ```python @ssh_connection_wrapper(timeout_seconds=15.0) async def ssh_function(): # Specialized SSH error handling with helpful suggestions ``` ### **3. Enhanced SSH Tools (`ssh_tools.py`)** #### **Applied Decorators** ```python @ssh_connection_wrapper(timeout_seconds=30.0) @retry_on_failure(max_retries=2, delay_seconds=2.0) async def setup_remote_mcp_admin(): # Critical operations get retries and timeout protection ``` #### **Improved Error Messages** ```python { "status": "error", "error_type": "ssh_timeout", "suggestions": [ "Check if the target device is reachable", "Verify SSH service is running on the target", "Check network connectivity" ] } ``` ### **4. Tool Registry Protection (`tools.py`)** #### **Wrapped Tool Execution** ```python @timeout_wrapper(timeout_seconds=45.0) async def execute_tool(tool_name: str, arguments: Dict[str, Any]): # All tool execution now has timeout protection logger.info(f"Executing tool: {tool_name}") ``` ## 🔧 **Configuration Options** ### **Timeout Settings** - **SSH Connections**: 15-30 seconds depending on operation - **Tool Execution**: 45 seconds (with server-level 60s backup) - **Request Handling**: 120 seconds total - **Server Input**: 300 seconds (5 minutes) ### **Retry Settings** - **SSH Setup Operations**: 2 retries with 2s, 4s delays - **SSH Discovery**: 1 retry with 1s delay - **Connection Errors Only**: Non-connection errors fail immediately ### **Health Monitoring** - **Request Tracking**: Total requests, errors, timeouts - **Error Rate Calculation**: Automatic degraded status at >50% error rate - **Uptime Tracking**: Server start time and current uptime ## 🧪 **Testing the Fixes** ### **Run the Test Script** ```bash # Test the enhanced error handling python test_error_handling.py ``` ### **Manual Testing Scenarios** ```bash # Test timeout with unreachable host echo '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"ssh_discover","arguments":{"hostname":"192.0.2.1","username":"test"}}}' | python run_server.py # Test invalid tool (should not crash) echo '{"jsonrpc":"2.0","id":2,"method":"tools/call","params":{"name":"invalid_tool","arguments":{}}}' | python run_server.py # Test health status echo '{"jsonrpc":"2.0","id":3,"method":"health/status"}' | python run_server.py ``` ## 📊 **Expected Behavior Changes** ### **Before Fixes** - ❌ Server crashes on SSH timeouts - ❌ Hangs indefinitely on network issues - ❌ No error recovery or retry logic - ❌ Poor error messages - ❌ No health monitoring ### **After Fixes** - ✅ Graceful timeout handling with structured errors - ✅ Automatic retry on transient failures - ✅ Server remains stable under error conditions - ✅ Detailed error messages with suggestions - ✅ Health monitoring and degraded state detection - ✅ Configurable timeout and retry parameters ## 🚀 **Performance Impact** - **Minimal overhead** from decorators (~1-2ms per operation) - **Better resource usage** due to proper timeouts - **Improved reliability** reduces overall system load - **Faster error detection** prevents hanging operations ## 📋 **Deployment Notes** ### **Backwards Compatibility** - All existing tool calls work unchanged - JSON-RPC protocol remains the same - No breaking changes to client code ### **New Features** - Health status endpoint: `{"method": "health/status"}` - Enhanced error responses with error_type and timestamps - Automatic server stability monitoring ### **Logging Improvements** - Structured logging with levels (INFO, WARNING, ERROR) - Request tracking and timing information - Health status changes logged automatically ## 🔍 **Monitoring & Debugging** ### **Log Patterns to Watch** ```bash # Successful operations INFO:homelab_mcp.tools:Executing tool: ssh_discover # Timeout warnings WARNING:homelab_mcp.error_handling:Attempt 1/3 failed for 'ssh_discover': Connection timeout # Server health INFO:homelab_mcp.server:Server health: healthy (uptime: 1234.5s) # Critical errors ERROR:homelab_mcp.server:Too many consecutive errors (10), shutting down ``` ### **Health Metrics** Monitor these fields from the `health/status` endpoint: - `error_rate`: Should stay below 0.5 (50%) - `timeout_errors`: High values indicate network issues - `status`: "healthy" vs "degraded" This comprehensive fix ensures the MCP server will remain stable and responsive even under adverse network conditions, timeout scenarios, and unexpected errors.

Latest Blog Posts

What Is Context Bloat in MCP?
By Om-Shree-0709 on December 16, 2025.
mcp
Context Bloat
MCP Moves to the Linux Foundation: Neutral Stewardship for Agentic Infrastructure
By Om-Shree-0709 on December 15, 2025.
mcp
anthropic
Linux Foundation
Code Execution with MCP: Architecting Agentic Efficiency
By Om-Shree-0709 on December 14, 2025.
mcp
Token bloat

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/washyu/mcp_python_server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server