Cortex Resource Manager

IMPLEMENTATION_SUMMARY.md•13 kB

# MCP Lifecycle Management Implementation Summary ## Overview Successfully implemented comprehensive MCP (Model Context Protocol) server lifecycle management tools for Kubernetes-based deployments. This implementation provides a complete API for managing MCP servers running in Kubernetes clusters. ## Implementation Details ### Core Module: `src/resource_manager_mcp_server/__init__.py` (611 lines) #### Main Class: MCPLifecycleManager A robust manager class that handles all lifecycle operations for MCP servers in Kubernetes. **Key Features:** - In-cluster and kubeconfig-based authentication - Namespace isolation - Comprehensive error handling - Input validation - Automatic health checking - Status monitoring #### Implemented Functions ##### 1. list_mcp_servers() **Purpose:** List all registered MCP servers in the cluster **Features:** - Label-based filtering (default: `app.kubernetes.io/component=mcp-server`) - Returns comprehensive server information - Includes service endpoints - Real-time status detection **Returns:** ```python [ { "name": "server-name", "status": "running|stopped|scaling|pending", "replicas": 3, "ready_replicas": 3, "endpoints": ["http://10.0.0.1:8080"] } ] ``` ##### 2. get_mcp_status(name) **Purpose:** Get detailed status of a specific MCP server **Features:** - Comprehensive deployment status - Replica information (ready, available, updated) - Service endpoint discovery - Last activity timestamp - Kubernetes condition details **Returns:** ```python { "name": "server-name", "status": "running", "replicas": 3, "ready_replicas": 3, "available_replicas": 3, "updated_replicas": 3, "endpoints": ["http://10.0.0.1:8080"], "last_activity": "2025-12-08T19:00:00Z", "conditions": [...] } ``` ##### 3. start_mcp(name, wait_ready=True) **Purpose:** Start an MCP server by scaling from 0 to 1 replica **Features:** - Idempotent (safe to call on running servers) - Optional wait for ready state - Configurable timeout (default: 300s) - Real-time readiness polling **Parameters:** - `name`: Server name - `wait_ready`: Block until ready (default: True) - `timeout`: Maximum wait time in seconds **Raises:** - `ValueError`: Server not found - `TimeoutError`: Server didn't become ready in time ##### 4. stop_mcp(name, force=False) **Purpose:** Stop an MCP server by scaling to 0 replicas **Features:** - Graceful shutdown by default - Force mode for immediate termination - Idempotent (safe to call on stopped servers) - Automatic pod cleanup in force mode **Parameters:** - `name`: Server name - `force`: Immediate termination (default: False) **Behavior:** - Normal mode: Respects terminationGracePeriodSeconds - Force mode: Sets grace period to 0 and deletes pods immediately ##### 5. scale_mcp(name, replicas) **Purpose:** Scale an MCP server horizontally **Features:** - Replica count validation (0-10) - Optional wait for ready state - Idempotent (safe to call with current replica count) - Supports both scale-up and scale-down **Parameters:** - `name`: Server name - `replicas`: Desired replica count (0-10) - `wait_ready`: Wait for all replicas to be ready (default: False) - `timeout`: Maximum wait time in seconds **Validation:** - Replica count must be integer between 0 and 10 - Server name must be valid Kubernetes name ### Error Handling & Validation #### Input Validation - **Server Names:** Validates Kubernetes naming conventions - Alphanumeric and hyphens only - No leading/trailing hyphens - Case-insensitive - **Replica Counts:** Validates replica constraints - Integer type checking - Range validation (0-10) #### Exception Types - `ValueError`: Invalid input or resource not found - `ApiException`: Kubernetes API errors (with proper status code handling) - `TimeoutError`: Operations exceeding timeout limits - Generic `Exception`: Wrapped API errors with descriptive messages ### Service Discovery #### Endpoint Detection Supports multiple Kubernetes service types: 1. **ClusterIP Services** - Returns internal cluster endpoints - Format: `http://<cluster-ip>:<port>` 2. **NodePort Services** - Returns node port information - Format: `nodePort://*:<node-port>` 3. **LoadBalancer Services** - Returns external endpoints - Handles both IP and hostname ingress - Format: `http://<external-ip>:<port>` ### Status Detection #### Deployment States Intelligent status determination based on replica counts: - **running**: `ready_replicas == spec_replicas > 0` - **stopped**: `spec_replicas == 0` - **scaling**: `replicas != spec_replicas OR ready_replicas != spec_replicas` - **pending**: Waiting for initial pods to become ready ### Convenience Functions Module provides top-level convenience functions for direct usage: ```python from resource_manager_mcp_server import ( list_mcp_servers, get_mcp_status, start_mcp, stop_mcp, scale_mcp, get_manager ) ``` Each function internally uses a singleton `MCPLifecycleManager` instance. ## Supporting Files ### 1. requirements.txt Dependencies: - `kubernetes>=28.1.0` - Kubernetes Python client - `typing-extensions>=4.8.0` - Type hints support ### 2. setup.py - Package configuration - Development dependencies (pytest, black, flake8, mypy) - Python 3.8+ compatibility ### 3. config/example-mcp-deployment.yaml Complete example Kubernetes deployment including: - Deployment with proper labels - Service configuration - Health and readiness probes - Resource limits - Graceful termination settings ### 4. example_usage.py Comprehensive usage examples demonstrating: - Basic operations (list, status, scale) - Advanced usage (custom namespace, timeouts) - Full lifecycle workflows - Error handling patterns ### 5. tests/test_lifecycle_manager.py Extensive test suite covering: - Input validation - Status detection - All CRUD operations - Error conditions - Service endpoint discovery - Mock-based testing (no cluster required) Test coverage: - 30+ test cases - Unit tests for all public methods - Edge case handling - Mock Kubernetes API responses ### 6. README.md Complete documentation including: - Feature overview - Installation instructions - API reference for all functions - Usage examples - Kubernetes setup requirements - RBAC permissions - Error handling guide - Development instructions ### 7. QUICKSTART.md Quick start guide covering: - 5-minute setup - Basic usage patterns - Common workflows - Troubleshooting tips - Next steps ### 8. Makefile Convenience commands: - `make install` - Install dependencies - `make test` - Run test suite - `make lint` - Run linting - `make format` - Format code - `make clean` - Clean artifacts ### 9. pytest.ini Test configuration: - Test discovery patterns - Markers for unit/integration tests - Output formatting ## Technical Design Decisions ### 1. Kubernetes Client Library - Uses official `kubernetes` Python client - Supports both in-cluster and external kubeconfig - Automatic authentication handling ### 2. Singleton Pattern - Singleton manager instance for convenience functions - Reduces overhead of multiple client initializations - Allows direct class usage for advanced scenarios ### 3. Status Polling - Implements custom wait mechanism with 2-second intervals - Configurable timeouts - Non-blocking by default (wait_ready=False) ### 4. Replica Limits - Conservative limit of 10 replicas - Prevents accidental resource exhaustion - Configurable through validation method ### 5. Service Discovery - Multi-type service support - Graceful handling of missing services - Returns empty list rather than failing ### 6. Error Handling Philosophy - Fail fast with clear error messages - Distinguish between user errors (ValueError) and system errors (ApiException) - Idempotent operations where possible ## Integration Points ### Kubernetes Requirements #### Required Labels MCP server deployments must have: ```yaml labels: app.kubernetes.io/component: mcp-server ``` #### Required RBAC Permissions ```yaml - apiGroups: ["apps"] resources: ["deployments"] verbs: ["get", "list", "patch", "update"] - apiGroups: [""] resources: ["services", "pods"] verbs: ["get", "list", "delete"] ``` #### Recommended Deployment Configuration - Health probes (liveness and readiness) - Resource limits - Graceful termination period (30s recommended) - Proper labels for discovery ## Usage Patterns ### Pattern 1: Simple Server Management ```python from resource_manager_mcp_server import start_mcp, stop_mcp start_mcp("my-server") # Start and wait # Do work... stop_mcp("my-server") # Graceful shutdown ``` ### Pattern 2: Dynamic Scaling ```python from resource_manager_mcp_server import scale_mcp, get_mcp_status scale_mcp("my-server", replicas=5) # Scale up # Handle load... scale_mcp("my-server", replicas=1) # Scale down ``` ### Pattern 3: Status Monitoring ```python from resource_manager_mcp_server import list_mcp_servers servers = list_mcp_servers() for server in servers: if server['status'] != 'running': print(f"Alert: {server['name']} is {server['status']}") ``` ### Pattern 4: Advanced Manager Usage ```python from resource_manager_mcp_server import MCPLifecycleManager manager = MCPLifecycleManager(namespace="production") manager.start_mcp("critical-server", wait_ready=True, timeout=600) ``` ## Testing Strategy ### Unit Tests - Mock Kubernetes API calls - Test validation logic - Test status detection - Test error handling - No cluster required ### Integration Tests - Require real Kubernetes cluster - Test actual API interactions - Validate end-to-end workflows - Marked separately (`@pytest.mark.integration`) ### Test Execution ```bash # Unit tests only (no cluster needed) pytest tests/ -m unit # All tests (requires cluster) pytest tests/ # With coverage pytest tests/ --cov=src/resource_manager_mcp_server ``` ## Performance Considerations ### API Call Optimization - Single deployment read for get_mcp_status() - Batch listing for list_mcp_servers() - Optional readiness waiting (non-blocking default) ### Polling Strategy - 2-second intervals for readiness checks - Configurable timeouts - Early exit on success ### Resource Usage - Minimal memory footprint - No background threads - Synchronous operations (simple to reason about) ## Future Enhancements Potential improvements for future versions: 1. **Async Support**: Add async/await versions of all methods 2. **Batch Operations**: Support multiple servers in single call 3. **Metrics Collection**: Prometheus metrics for monitoring 4. **Event Streaming**: Watch Kubernetes events in real-time 5. **Config Management**: ConfigMap/Secret management 6. **Rollout Strategies**: Blue-green, canary deployments 7. **Auto-scaling**: HPA integration 8. **Cost Tracking**: Resource usage and cost monitoring 9. **Multi-cluster**: Support multiple Kubernetes clusters 10. **Webhooks**: Integration with external systems ## Files Created ``` /Users/ryandahlberg/Projects/resource-manager-mcp-server/ ├── src/ │ └── resource_manager_mcp_server/ │ └── __init__.py # Main implementation (611 lines) ├── config/ │ └── example-mcp-deployment.yaml # Example K8s deployment ├── tests/ │ ├── test_lifecycle_manager.py # Comprehensive test suite │ └── pytest.ini # Test configuration ├── requirements.txt # Python dependencies ├── setup.py # Package configuration ├── example_usage.py # Usage examples ├── README.md # Complete documentation ├── QUICKSTART.md # Quick start guide ├── IMPLEMENTATION_SUMMARY.md # This file └── Makefile # Build commands ``` ## Success Criteria Met - ✅ list_mcp_servers() - Complete with status, replicas, endpoints - ✅ get_mcp_status() - Detailed status with conditions and activity - ✅ start_mcp() - Scale 0→1 with optional wait - ✅ stop_mcp() - Graceful and force shutdown - ✅ scale_mcp() - Horizontal scaling with validation - ✅ Kubernetes Python client integration - ✅ Proper error handling and validation - ✅ Comprehensive documentation - ✅ Test coverage - ✅ Example code and configurations ## Conclusion This implementation provides a production-ready MCP lifecycle management system with: - Complete API coverage for all required operations - Robust error handling and validation - Comprehensive documentation and examples - Extensive test coverage - Integration-ready design - Clear upgrade path for future enhancements The implementation is ready for integration into the cortex automation system and can be used immediately for managing MCP servers in Kubernetes environments.

Latest Blog Posts

Model Context Protocol Proxies: Enabling Enterprise Control with Virtual MCPs
By Om-Shree-0709 on December 9, 2025.
AI Security
Virtual MCP
Kubernetes Operator
The State of MCP in 2025: Who's Building What and Why It Matters
By punkpeye on December 7, 2025.
mcp
startups
MCP hosting with persistent storage
By punkpeye on December 6, 2025.
changelog

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ry-ops/cortex-resource-manager'

If you have feedback or need assistance with the MCP directory API, please join our Discord server