Cortex Resource Manager

WORKER_MANAGEMENT.md•11.1 kB

# Worker Management Implementation Comprehensive worker management tools for the Resource Manager MCP Server. ## Overview This implementation adds Kubernetes worker management capabilities to the resource-manager-mcp-server, enabling dynamic provisioning and lifecycle management of burst workers alongside permanent infrastructure. ## Features Implemented ### 1. list_workers(type_filter=None) Lists all Kubernetes workers with comprehensive information. **Capabilities:** - Filter by worker type (permanent/burst) - Returns node status (ready/busy/draining/not_ready) - Includes resource information (CPU, memory, pods) - Shows TTL expiration for burst workers - Displays labels and annotations **Safety Features:** - Read-only operation - No cluster modifications ### 2. provision_workers(count, ttl, size="medium") Provisions burst workers with automatic VM creation and cluster joining. **Capabilities:** - Creates 1-10 workers per request - Configurable TTL (1-168 hours) - Three size options: small, medium, large - Integrates with Talos MCP or Proxmox MCP - Automatic labeling and TTL annotation **Worker Sizes:** - Small: 2 CPU, 4GB RAM, 50GB disk - Medium: 4 CPU, 8GB RAM, 100GB disk - Large: 8 CPU, 16GB RAM, 200GB disk **Safety Features:** - Input validation (count, TTL, size) - Automatic cleanup after TTL expiration - Labels for easy identification ### 3. drain_worker(worker_id) Gracefully drains a worker node before removal. **Capabilities:** - Evicts all pods to other nodes - Marks node as unschedulable - 5-minute grace period for pod termination - Ignores DaemonSets - Handles pods with emptyDir volumes **Safety Features:** - Graceful pod migration - No service disruption - Configurable timeout ### 4. destroy_worker(worker_id, force=False) Safely destroys burst workers with protection for permanent infrastructure. **Capabilities:** - Removes node from Kubernetes cluster - Deletes VM via Talos/Proxmox MCP - Optional force flag (not recommended) - Partial failure handling **Safety Features:** - **CRITICAL:** Only burst workers can be destroyed - Permanent worker protection (raises exception) - Requires drain before destroy (unless force=True) - Protected worker name patterns - Verification of worker type before deletion ### 5. get_worker_details(worker_id) Retrieves detailed information about a specific worker. **Capabilities:** - Complete node status information - Resource capacity and allocatable - All labels and annotations - Node conditions (Ready, MemoryPressure, etc.) - IP addresses - TTL information for burst workers **Safety Features:** - Read-only operation - No cluster modifications ## Architecture ### Components 1. **worker_manager.py** - Core worker management logic - WorkerManager class - Kubernetes API integration via kubectl - MCP server integration placeholders - Safety checks and validation 2. **server.py** - MCP server integration - Tool registration - Request handling - Error handling - JSON response formatting 3. **config/worker-config.yaml** - Configuration - Kubernetes settings - MCP server endpoints - Worker size templates - Drain configuration - Safety settings 4. **tests/test_worker_manager.py** - Unit tests - Comprehensive test coverage - Mocked Kubernetes API - Safety feature validation ## Safety Mechanisms ### Permanent Worker Protection The most critical safety feature prevents accidental deletion of permanent infrastructure: ```python # SAFETY CHECK: Verify this is a burst worker worker_type = self._get_node_type(node) if worker_type != WorkerType.BURST: raise WorkerManagerError( f"SAFETY VIOLATION: Cannot destroy permanent worker {worker_id}. " f"Only burst workers can be destroyed." ) ``` **How it works:** 1. Check worker labels for `worker-type=burst` 2. Check annotations for `worker-ttl` 3. Permanent workers (no burst label) CANNOT be destroyed 4. Exception raised with clear error message ### Drain Before Destroy Workers must be drained before destruction (unless force=True): ```python # Check if worker is drained (unless force is True) if not force: spec = node.get("spec", {}) if not spec.get("unschedulable", False): raise WorkerManagerError( f"Worker {worker_id} is not drained. " f"Run drain_worker first or use force=True (not recommended)" ) ``` ### Input Validation All inputs are validated before execution: ```python # Validate worker count if count < 1 or count > 10: raise WorkerManagerError("Worker count must be between 1 and 10") # Validate TTL if ttl < 1 or ttl > 168: # Max 1 week raise WorkerManagerError("TTL must be between 1 and 168 hours") # Validate size if size not in WORKER_SIZES: raise WorkerManagerError(f"Invalid size. Must be one of: {list(WORKER_SIZES.keys())}") ``` ### Protected Worker Patterns Configuration supports protected name patterns (regex): ```yaml safety: protected_worker_patterns: - "^master-.*" - "^control-plane-.*" - "^permanent-.*" ``` ## Integration Points ### Kubernetes API Uses kubectl commands for all Kubernetes operations: ```python def _run_kubectl(self, args: List[str]) -> Dict[str, Any]: cmd = ["kubectl"] if self.kubectl_context: cmd.extend(["--context", self.kubectl_context]) cmd.extend(args) result = subprocess.run(cmd, capture_output=True, text=True, check=True) return json.loads(result.stdout) ``` ### MCP Server Integration Designed to integrate with Talos MCP and Proxmox MCP: ```python def _call_mcp_server(self, server: str, method: str, params: Dict[str, Any]) -> Dict[str, Any]: # Placeholder for MCP protocol integration # Will use MCP protocol to communicate with: # - talos-mcp-server for Talos Linux VMs # - proxmox-mcp-server for Proxmox VMs ``` **Integration TODO:** - Implement MCP protocol client - Add VM creation methods - Add VM deletion methods - Add cluster join automation - Add health checking ## Usage Examples ### List All Workers ```python from worker_manager import WorkerManager manager = WorkerManager() workers = manager.list_workers() for worker in workers: print(f"{worker['name']}: {worker['type']} - {worker['status']}") ``` ### Provision Burst Workers ```python # Provision 3 medium workers with 24-hour TTL workers = manager.provision_workers(count=3, ttl=24, size="medium") for worker in workers: print(f"Created: {worker['name']}") print(f" Expires: {worker['ttl_expires']}") ``` ### Safe Worker Removal ```python worker_id = "burst-worker-1234567890-0" # Step 1: Verify it's a burst worker details = manager.get_worker_details(worker_id) if details['type'] != 'burst': raise Exception("Cannot destroy permanent worker!") # Step 2: Drain the worker drain_result = manager.drain_worker(worker_id) print(f"Drained: {drain_result['status']}") # Step 3: Destroy the worker destroy_result = manager.destroy_worker(worker_id) print(f"Destroyed: {destroy_result['status']}") ``` ### MCP Tool Calls Via the MCP server interface: ```json { "tool": "list_workers", "arguments": { "type_filter": "burst" } } ``` ```json { "tool": "provision_workers", "arguments": { "count": 2, "ttl": 48, "size": "large" } } ``` ## Error Handling All functions raise `WorkerManagerError` for operational errors: ```python try: manager.destroy_worker("permanent-worker-1") except WorkerManagerError as e: print(f"Error: {e}") # Output: SAFETY VIOLATION: Cannot destroy permanent worker... ``` ## Configuration ### config/worker-config.yaml Comprehensive configuration including: - Kubernetes context and namespace - MCP server endpoints - Worker size templates - Burst worker limits - Drain configuration - Safety settings - Logging configuration ### Environment Variables Can override config with environment variables: - `KUBECTL_CONTEXT` - Kubernetes context - `TALOS_MCP_ENDPOINT` - Talos MCP endpoint - `PROXMOX_MCP_ENDPOINT` - Proxmox MCP endpoint ## Testing ### Unit Tests Comprehensive test suite in `tests/test_worker_manager.py`: - Worker listing and filtering - Worker type detection - Status detection - Provisioning validation - Drain operations - Destroy operations with safety checks - Error handling - Edge cases ### Running Tests ```bash # Install test dependencies pip install pytest pytest-mock # Run tests pytest tests/test_worker_manager.py -v # Run with coverage pytest tests/test_worker_manager.py --cov=src/worker_manager --cov-report=html ``` ## Files Created 1. `/Users/ryandahlberg/Projects/resource-manager-mcp-server/src/worker_manager.py` - 700+ lines of worker management logic - Complete implementation of all 5 tools - Comprehensive safety checks 2. `/Users/ryandahlberg/Projects/resource-manager-mcp-server/src/server.py` - Updated with worker management tools - Tool registration and handlers - Error handling 3. `/Users/ryandahlberg/Projects/resource-manager-mcp-server/config/worker-config.yaml` - Complete configuration template - All settings documented 4. `/Users/ryandahlberg/Projects/resource-manager-mcp-server/example_worker_usage.py` - Comprehensive usage examples - Safe workflow demonstrations - Error handling examples 5. `/Users/ryandahlberg/Projects/resource-manager-mcp-server/tests/test_worker_manager.py` - 20+ unit tests - Mock Kubernetes API - Safety validation tests 6. `/Users/ryandahlberg/Projects/resource-manager-mcp-server/README.md` - Updated with worker management documentation - API reference - Usage examples ## Next Steps ### MCP Integration (Required) 1. Implement MCP protocol client 2. Integrate with talos-mcp-server: - VM creation - VM deletion - Cluster join automation 3. Integrate with proxmox-mcp-server: - VM creation - VM deletion - Cluster join automation ### Enhanced Features (Optional) 1. Automatic TTL cleanup background task 2. Worker health monitoring 3. Automatic scale-up/scale-down based on load 4. Cost tracking for burst workers 5. Worker usage metrics 6. Notification on worker events ### Production Readiness 1. Add comprehensive logging 2. Add metrics collection (Prometheus) 3. Add alerting for worker issues 4. Add performance benchmarks 5. Load testing 6. Security audit ## Security Considerations 1. **RBAC Permissions**: Service account needs: - `nodes`: get, list, delete, patch - `pods`: get, list, delete (for drain) 2. **Worker Protection**: Multiple layers prevent permanent worker deletion 3. **Input Validation**: All inputs validated before execution 4. **Audit Logging**: All operations should be logged for audit trail 5. **Rate Limiting**: Consider rate limits for provisioning operations ## License MIT License - See LICENSE file for details ## Contributing Contributions welcome! Areas for contribution: - MCP server integration - Additional safety checks - Performance optimizations - Documentation improvements - Test coverage expansion

Latest Blog Posts

Model Context Protocol Proxies: Enabling Enterprise Control with Virtual MCPs
By Om-Shree-0709 on December 9, 2025.
AI Security
Virtual MCP
Kubernetes Operator
The State of MCP in 2025: Who's Building What and Why It Matters
By punkpeye on December 7, 2025.
mcp
startups
MCP hosting with persistent storage
By punkpeye on December 6, 2025.
changelog

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ry-ops/cortex-resource-manager'

If you have feedback or need assistance with the MCP directory API, please join our Discord server