Skip to main content
Glama
WORKER_MANAGEMENT.md11.1 kB
# Worker Management Implementation Comprehensive worker management tools for the Resource Manager MCP Server. ## Overview This implementation adds Kubernetes worker management capabilities to the resource-manager-mcp-server, enabling dynamic provisioning and lifecycle management of burst workers alongside permanent infrastructure. ## Features Implemented ### 1. list_workers(type_filter=None) Lists all Kubernetes workers with comprehensive information. **Capabilities:** - Filter by worker type (permanent/burst) - Returns node status (ready/busy/draining/not_ready) - Includes resource information (CPU, memory, pods) - Shows TTL expiration for burst workers - Displays labels and annotations **Safety Features:** - Read-only operation - No cluster modifications ### 2. provision_workers(count, ttl, size="medium") Provisions burst workers with automatic VM creation and cluster joining. **Capabilities:** - Creates 1-10 workers per request - Configurable TTL (1-168 hours) - Three size options: small, medium, large - Integrates with Talos MCP or Proxmox MCP - Automatic labeling and TTL annotation **Worker Sizes:** - Small: 2 CPU, 4GB RAM, 50GB disk - Medium: 4 CPU, 8GB RAM, 100GB disk - Large: 8 CPU, 16GB RAM, 200GB disk **Safety Features:** - Input validation (count, TTL, size) - Automatic cleanup after TTL expiration - Labels for easy identification ### 3. drain_worker(worker_id) Gracefully drains a worker node before removal. **Capabilities:** - Evicts all pods to other nodes - Marks node as unschedulable - 5-minute grace period for pod termination - Ignores DaemonSets - Handles pods with emptyDir volumes **Safety Features:** - Graceful pod migration - No service disruption - Configurable timeout ### 4. destroy_worker(worker_id, force=False) Safely destroys burst workers with protection for permanent infrastructure. **Capabilities:** - Removes node from Kubernetes cluster - Deletes VM via Talos/Proxmox MCP - Optional force flag (not recommended) - Partial failure handling **Safety Features:** - **CRITICAL:** Only burst workers can be destroyed - Permanent worker protection (raises exception) - Requires drain before destroy (unless force=True) - Protected worker name patterns - Verification of worker type before deletion ### 5. get_worker_details(worker_id) Retrieves detailed information about a specific worker. **Capabilities:** - Complete node status information - Resource capacity and allocatable - All labels and annotations - Node conditions (Ready, MemoryPressure, etc.) - IP addresses - TTL information for burst workers **Safety Features:** - Read-only operation - No cluster modifications ## Architecture ### Components 1. **worker_manager.py** - Core worker management logic - WorkerManager class - Kubernetes API integration via kubectl - MCP server integration placeholders - Safety checks and validation 2. **server.py** - MCP server integration - Tool registration - Request handling - Error handling - JSON response formatting 3. **config/worker-config.yaml** - Configuration - Kubernetes settings - MCP server endpoints - Worker size templates - Drain configuration - Safety settings 4. **tests/test_worker_manager.py** - Unit tests - Comprehensive test coverage - Mocked Kubernetes API - Safety feature validation ## Safety Mechanisms ### Permanent Worker Protection The most critical safety feature prevents accidental deletion of permanent infrastructure: ```python # SAFETY CHECK: Verify this is a burst worker worker_type = self._get_node_type(node) if worker_type != WorkerType.BURST: raise WorkerManagerError( f"SAFETY VIOLATION: Cannot destroy permanent worker {worker_id}. " f"Only burst workers can be destroyed." ) ``` **How it works:** 1. Check worker labels for `worker-type=burst` 2. Check annotations for `worker-ttl` 3. Permanent workers (no burst label) CANNOT be destroyed 4. Exception raised with clear error message ### Drain Before Destroy Workers must be drained before destruction (unless force=True): ```python # Check if worker is drained (unless force is True) if not force: spec = node.get("spec", {}) if not spec.get("unschedulable", False): raise WorkerManagerError( f"Worker {worker_id} is not drained. " f"Run drain_worker first or use force=True (not recommended)" ) ``` ### Input Validation All inputs are validated before execution: ```python # Validate worker count if count < 1 or count > 10: raise WorkerManagerError("Worker count must be between 1 and 10") # Validate TTL if ttl < 1 or ttl > 168: # Max 1 week raise WorkerManagerError("TTL must be between 1 and 168 hours") # Validate size if size not in WORKER_SIZES: raise WorkerManagerError(f"Invalid size. Must be one of: {list(WORKER_SIZES.keys())}") ``` ### Protected Worker Patterns Configuration supports protected name patterns (regex): ```yaml safety: protected_worker_patterns: - "^master-.*" - "^control-plane-.*" - "^permanent-.*" ``` ## Integration Points ### Kubernetes API Uses kubectl commands for all Kubernetes operations: ```python def _run_kubectl(self, args: List[str]) -> Dict[str, Any]: cmd = ["kubectl"] if self.kubectl_context: cmd.extend(["--context", self.kubectl_context]) cmd.extend(args) result = subprocess.run(cmd, capture_output=True, text=True, check=True) return json.loads(result.stdout) ``` ### MCP Server Integration Designed to integrate with Talos MCP and Proxmox MCP: ```python def _call_mcp_server(self, server: str, method: str, params: Dict[str, Any]) -> Dict[str, Any]: # Placeholder for MCP protocol integration # Will use MCP protocol to communicate with: # - talos-mcp-server for Talos Linux VMs # - proxmox-mcp-server for Proxmox VMs ``` **Integration TODO:** - Implement MCP protocol client - Add VM creation methods - Add VM deletion methods - Add cluster join automation - Add health checking ## Usage Examples ### List All Workers ```python from worker_manager import WorkerManager manager = WorkerManager() workers = manager.list_workers() for worker in workers: print(f"{worker['name']}: {worker['type']} - {worker['status']}") ``` ### Provision Burst Workers ```python # Provision 3 medium workers with 24-hour TTL workers = manager.provision_workers(count=3, ttl=24, size="medium") for worker in workers: print(f"Created: {worker['name']}") print(f" Expires: {worker['ttl_expires']}") ``` ### Safe Worker Removal ```python worker_id = "burst-worker-1234567890-0" # Step 1: Verify it's a burst worker details = manager.get_worker_details(worker_id) if details['type'] != 'burst': raise Exception("Cannot destroy permanent worker!") # Step 2: Drain the worker drain_result = manager.drain_worker(worker_id) print(f"Drained: {drain_result['status']}") # Step 3: Destroy the worker destroy_result = manager.destroy_worker(worker_id) print(f"Destroyed: {destroy_result['status']}") ``` ### MCP Tool Calls Via the MCP server interface: ```json { "tool": "list_workers", "arguments": { "type_filter": "burst" } } ``` ```json { "tool": "provision_workers", "arguments": { "count": 2, "ttl": 48, "size": "large" } } ``` ## Error Handling All functions raise `WorkerManagerError` for operational errors: ```python try: manager.destroy_worker("permanent-worker-1") except WorkerManagerError as e: print(f"Error: {e}") # Output: SAFETY VIOLATION: Cannot destroy permanent worker... ``` ## Configuration ### config/worker-config.yaml Comprehensive configuration including: - Kubernetes context and namespace - MCP server endpoints - Worker size templates - Burst worker limits - Drain configuration - Safety settings - Logging configuration ### Environment Variables Can override config with environment variables: - `KUBECTL_CONTEXT` - Kubernetes context - `TALOS_MCP_ENDPOINT` - Talos MCP endpoint - `PROXMOX_MCP_ENDPOINT` - Proxmox MCP endpoint ## Testing ### Unit Tests Comprehensive test suite in `tests/test_worker_manager.py`: - Worker listing and filtering - Worker type detection - Status detection - Provisioning validation - Drain operations - Destroy operations with safety checks - Error handling - Edge cases ### Running Tests ```bash # Install test dependencies pip install pytest pytest-mock # Run tests pytest tests/test_worker_manager.py -v # Run with coverage pytest tests/test_worker_manager.py --cov=src/worker_manager --cov-report=html ``` ## Files Created 1. `/Users/ryandahlberg/Projects/resource-manager-mcp-server/src/worker_manager.py` - 700+ lines of worker management logic - Complete implementation of all 5 tools - Comprehensive safety checks 2. `/Users/ryandahlberg/Projects/resource-manager-mcp-server/src/server.py` - Updated with worker management tools - Tool registration and handlers - Error handling 3. `/Users/ryandahlberg/Projects/resource-manager-mcp-server/config/worker-config.yaml` - Complete configuration template - All settings documented 4. `/Users/ryandahlberg/Projects/resource-manager-mcp-server/example_worker_usage.py` - Comprehensive usage examples - Safe workflow demonstrations - Error handling examples 5. `/Users/ryandahlberg/Projects/resource-manager-mcp-server/tests/test_worker_manager.py` - 20+ unit tests - Mock Kubernetes API - Safety validation tests 6. `/Users/ryandahlberg/Projects/resource-manager-mcp-server/README.md` - Updated with worker management documentation - API reference - Usage examples ## Next Steps ### MCP Integration (Required) 1. Implement MCP protocol client 2. Integrate with talos-mcp-server: - VM creation - VM deletion - Cluster join automation 3. Integrate with proxmox-mcp-server: - VM creation - VM deletion - Cluster join automation ### Enhanced Features (Optional) 1. Automatic TTL cleanup background task 2. Worker health monitoring 3. Automatic scale-up/scale-down based on load 4. Cost tracking for burst workers 5. Worker usage metrics 6. Notification on worker events ### Production Readiness 1. Add comprehensive logging 2. Add metrics collection (Prometheus) 3. Add alerting for worker issues 4. Add performance benchmarks 5. Load testing 6. Security audit ## Security Considerations 1. **RBAC Permissions**: Service account needs: - `nodes`: get, list, delete, patch - `pods`: get, list, delete (for drain) 2. **Worker Protection**: Multiple layers prevent permanent worker deletion 3. **Input Validation**: All inputs validated before execution 4. **Audit Logging**: All operations should be logged for audit trail 5. **Rate Limiting**: Consider rate limits for provisioning operations ## License MIT License - See LICENSE file for details ## Contributing Contributions welcome! Areas for contribution: - MCP server integration - Additional safety checks - Performance optimizations - Documentation improvements - Test coverage expansion

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ry-ops/cortex-resource-manager'

If you have feedback or need assistance with the MCP directory API, please join our Discord server