Manages Kubernetes deployments, services, and worker nodes, providing tools for MCP server lifecycle management (listing, starting, stopping, scaling), worker node provisioning and destruction, and resource allocation tracking within a Kubernetes cluster.
Integrates with Proxmox for VM provisioning when creating burst workers, enabling automated creation of virtual machines that are joined to the Kubernetes cluster as worker nodes.
Integrates with Talos Linux for VM provisioning when creating burst workers, enabling automated creation of Talos-based virtual machines that are joined to the Kubernetes cluster as worker nodes.
Cortex Resource Manager
An MCP (Model Context Protocol) server for managing resource allocation, MCP server lifecycle, and Kubernetes workers in the cortex automation system.
Repository: ry-ops/cortex-resource-manager
Features
Resource Allocation (Core Orchestration)
Request resources for jobs (MCP servers + workers)
Release resources after job completion
Track allocations with unique IDs
Get current cluster capacity
Query allocation details
Automatic TTL/expiry handling
In-memory allocation tracking
MCP Server Lifecycle Management
List all registered MCP servers with status
Get detailed status of individual MCP servers
Start MCP servers (scale from 0 to 1)
Stop MCP servers (scale to 0)
Scale MCP servers horizontally (0-10 replicas)
Automatic health checking and readiness waiting
Graceful and forced shutdown options
Worker Management
List Kubernetes workers (permanent and burst) with filtering
Provision burst workers with configurable TTL and size
Drain workers gracefully before destruction
Destroy burst workers safely with protection for permanent workers
Get detailed worker information including resources and status
Integration with Talos MCP and Proxmox MCP for VM provisioning
Overview
The Cortex Resource Manager provides 16 tools organized into 3 categories:
Resource Allocation (5 tools): Core orchestration API for managing cortex job resources
request_resources- Request MCP servers and workers for a jobrelease_resources- Release allocated resourcesget_allocation- Query allocation detailsget_capacity- Check cluster capacitylist_allocations- List all active allocations
MCP Server Lifecycle (5 tools): Manage MCP server deployments in Kubernetes
list_mcp_servers- List all MCP servers with statusget_mcp_status- Get detailed server statusstart_mcp- Start an MCP server (scale to 1)stop_mcp- Stop an MCP server (scale to 0)scale_mcp- Scale MCP server horizontally (0-10 replicas)
Worker Management (6 tools): Manage Kubernetes workers (permanent and burst)
list_workers- List all workers with filteringprovision_workers- Create burst workers with TTLdrain_worker- Gracefully drain a workerdestroy_worker- Safely destroy burst workersget_worker_details- Get detailed worker informationget_worker_capacity- Check worker resource capacity
Installation
Requirements
Python 3.8+
Kubernetes cluster access
Properly configured kubeconfig or in-cluster service account
Usage
Resource Allocation Tools
The core orchestration API for cortex job management:
MCP Server Lifecycle (Convenience Functions)
Advanced Usage (Manager Class)
API Reference
list_mcp_servers()
List all registered MCP servers.
Parameters:
namespace(str): Kubernetes namespace (default: "default")label_selector(str): Label selector to filter deployments (default: "app.kubernetes.io/component=mcp-server")
Returns: List of dictionaries with:
name: Server namestatus: Current status ("running", "stopped", "scaling", "pending")replicas: Desired replica countready_replicas: Number of ready replicasendpoints: List of service endpoints
get_mcp_status(name)
Get detailed status of one MCP server.
Parameters:
name(str): MCP server namenamespace(str): Kubernetes namespace (default: "default")
Returns: Dictionary with:
name: Server namestatus: Current statusreplicas: Desired replica countready_replicas: Number of ready replicasavailable_replicas: Number of available replicasupdated_replicas: Number of updated replicasendpoints: List of service endpointslast_activity: Timestamp of last deployment updateconditions: List of deployment conditions
Raises:
ValueError: If server not found
start_mcp(name, wait_ready=True)
Start an MCP server by scaling from 0 to 1 replica.
Parameters:
name(str): MCP server namewait_ready(bool): Wait for server to be ready (default: True)timeout(int): Maximum wait time in seconds (default: 300)namespace(str): Kubernetes namespace (default: "default")
Returns: Dictionary with server status after starting
Raises:
ValueError: If server not foundTimeoutError: If wait_ready=True and server doesn't become ready
stop_mcp(name, force=False)
Stop an MCP server by scaling to 0 replicas.
Parameters:
name(str): MCP server nameforce(bool): Force immediate termination (default: False)namespace(str): Kubernetes namespace (default: "default")
Returns: Dictionary with server status after stopping
Raises:
ValueError: If server not found
scale_mcp(name, replicas)
Scale an MCP server horizontally.
Parameters:
name(str): MCP server namereplicas(int): Desired replica count (0-10)wait_ready(bool): Wait for all replicas to be ready (default: False)timeout(int): Maximum wait time in seconds (default: 300)namespace(str): Kubernetes namespace (default: "default")
Returns: Dictionary with server status after scaling
Raises:
ValueError: If server not found or invalid replica count
Worker Management Tools
list_workers(type_filter=None)
List all Kubernetes workers with their status, type, and resources.
Parameters:
type_filter(str, optional): Filter by worker type ("permanent" or "burst")
Returns: List of dictionaries with:
name: Worker node namestatus: Worker status ("ready", "busy", "draining", "not_ready")type: Worker type ("permanent" or "burst")resources: Resource capacity and allocatable amountslabels: Node labelsannotations: Node annotationscreated: Node creation timestampttl_expires(burst workers only): TTL expiration timestamp
Example:
provision_workers(count, ttl, size="medium")
Create burst workers by provisioning VMs and joining them to the Kubernetes cluster.
Parameters:
count(int): Number of workers to provision (1-10)ttl(int): Time-to-live in hours (1-168, max 1 week)size(str): Worker size ("small", "medium", or "large")small: 2 CPU, 4GB RAM, 50GB disk
medium: 4 CPU, 8GB RAM, 100GB disk
large: 8 CPU, 16GB RAM, 200GB disk
Returns: List of provisioned worker information dictionaries
Raises:
WorkerManagerError: If provisioning fails or parameters are invalid
Example:
Note: This function integrates with Talos MCP or Proxmox MCP servers to create VMs. The VMs are automatically joined to the Kubernetes cluster and labeled as burst workers.
drain_worker(worker_id)
Gracefully drain a worker node by moving all pods to other nodes and marking it unschedulable.
Parameters:
worker_id(str): Worker node name to drain
Returns: Dictionary with drain operation status:
worker_id: Worker node namestatus: Operation status ("draining")message: Status messageoutput: kubectl drain command output
Raises:
WorkerManagerError: If worker not found or drain fails
Example:
Note: This operation may take several minutes as pods are gracefully terminated and rescheduled to other nodes. DaemonSets are ignored, and pods with emptyDir volumes are deleted.
destroy_worker(worker_id, force=False)
Destroy a burst worker by removing it from the cluster and deleting the VM.
Parameters:
worker_id(str): Worker node name to destroyforce(bool): Force destroy without draining first (not recommended, default: False)
Returns: Dictionary with destroy operation status:
worker_id: Worker node namestatus: Operation status ("destroyed" or "partial_destroy")message: Status messageremoved_from_cluster: Whether node was removed from clustervm_deleted: Whether VM was deletederror(if failed): Error message
Raises:
WorkerManagerError: If worker is permanent (SAFETY VIOLATION), not found, or not drained
SAFETY FEATURES:
Only burst workers can be destroyed - attempting to destroy a permanent worker raises an error
Requires worker to be drained first unless force=True
Protected worker patterns prevent accidental deletion
Example:
WARNING: Never destroy permanent workers! The system prevents this, but always verify worker type before destroying.
get_worker_details(worker_id)
Get detailed information about a specific worker.
Parameters:
worker_id(str): Worker node name
Returns: Dictionary with detailed worker information:
name: Worker node namestatus: Worker statustype: Worker typeresources: Capacity and allocatable resourceslabels: All node labelsannotations: All node annotationscreated: Creation timestampconditions: Node conditions (Ready, MemoryPressure, DiskPressure, etc.)addresses: Node IP addressesttl_expires(burst workers only): TTL expiration timestamp
Raises:
WorkerManagerError: If worker not found
Example:
Kubernetes Setup
Required Labels
MCP server deployments must have the label:
Example Deployment
See config/example-mcp-deployment.yaml for a complete example.
Key requirements:
Deployment with
app.kubernetes.io/component: mcp-serverlabelService with matching selector
Health and readiness probes configured
Appropriate resource limits
RBAC Permissions
The service account needs these permissions:
Error Handling
All functions raise appropriate exceptions:
ValueError: Invalid input parameters or resource not foundApiException: Kubernetes API errorsTimeoutError: Operations that exceed timeout limits
Example error handling:
Status Values
Deployment Status
running: All replicas are ready and availablestopped: Scaled to 0 replicasscaling: Replicas are being added or removedpending: Waiting for replicas to become ready
Development
Running Tests
Project Structure
License
MIT License
Contributing
Contributions welcome! Please submit pull requests or open issues.