Amicus MCP Server

ANTI_IDLE_SYSTEM.md•16.4 KiB

# Anti-Idle System for Amicus Synapse Network ## Overview The anti-idle system prevents Synapse nodes from falling idle through intelligent workload assessment, automatic node lifecycle management, and graceful termination protocols. The system enforces a maximum of 4 concurrent nodes while ensuring efficient resource utilization. ## Architecture ### State Schema Enhancements The anti-idle system extends the Amicus state with three new top-level structures: #### 1. Enhanced `cluster_nodes` Each node now tracks additional lifecycle metadata: ```json { "Node-ID": { "role": "bootstrap_manager | architect | developer", "model": {"name": "...", "strength": "..."}, "last_heartbeat": 1234567890.0, "status": "working | idle | waiting | terminated", "current_task_id": "task-123" | null, "idle_since": 1234567890.0 | null, "last_activity": 1234567890.0 } } ``` **Status States:** - `working`: Actively executing a task - `idle`: No tasks available, within grace period - `waiting`: Short-term wait (< 10s) for state updates - `terminated`: Gracefully exited, no longer active #### 2. `cluster_metadata` Tracks cluster-wide statistics: ```json { "max_agents": 4, "active_agent_count": 2, "idle_agent_count": 0, "pending_task_count": 5, "manager_id": "Node-BM-X1Y2" } ``` #### 3. `work_distribution` Provides workload intelligence for Bootstrap Manager: ```json { "last_assessment": 1234567890.0, "workload_status": "overloaded | balanced | underutilized | idle", "spawn_recommendation": "spawn_developer | spawn_architect | terminate_idle | maintain" } ``` **Workload Status Definitions:** - `idle`: No pending or in-progress tasks - `underutilized`: Few pending tasks, multiple idle nodes - `balanced`: Optimal task-to-node ratio - `overloaded`: Many pending tasks, all nodes busy or approaching capacity ## MCP Tools ### `set_agent_status(agent_id, status, current_task_id)` Updates an agent's status and tracks idle transitions. **Usage:** ```python # Mark as idle when no work available set_agent_status("Node-X9J2", "idle", None) # Mark as working when claiming a task set_agent_status("Node-X9J2", "working", "task-123") # Mark as terminated before exit set_agent_status("Node-X9J2", "terminated", None) ``` **Behavior:** - Automatically sets `idle_since` timestamp when transitioning to idle - Clears `idle_since` when transitioning to any other status - Updates `last_activity` timestamp on every call ### `claim_best_task(agent_id, role)` Intelligently claims the highest-priority task for the agent's role. **Scoring Algorithm:** ```python score = priority_points + role_match_bonus + age_bonus # Priority points high = 30 points medium = 20 points low = 10 points # Role match (+15 points) architect: keywords = ["design", "architect", "plan"] developer: keywords = ["implement", "code", "test"] # Age bonus (+1 per minute since creation) ``` **Usage:** ```python # Returns: "Claimed task 3: Implement user authentication" result = claim_best_task("Node-X9J2", "developer") ``` **Advantages over manual `claim_task()`:** - Prevents random task selection - Prioritizes high-priority tasks - Matches tasks to agent capabilities - Ensures older tasks don't starve ### `assess_workload()` Analyzes cluster state and generates spawn/terminate recommendations. **Called by:** Bootstrap Manager every 20-30 seconds **Returns:** ``` 📊 Workload Assessment ======================================== Active Agents: 2/4 Idle Agents: 0 Pending Tasks: 5 In Progress: 2 Status: OVERLOADED Recommendation: spawn_developer ``` **Decision Logic:** ```python if pending_tasks == 0 and in_progress == 0: return "idle", "terminate_idle" elif pending_tasks >= 3 and active_count < max_agents: return "overloaded", "spawn_developer" elif pending_tasks > 0 and active_agents == 0 and total_active < max_agents: return "overloaded", "spawn_developer" elif pending_tasks <= 1 and total_active > 1 and idle_agents > 0: return "underutilized", "terminate_idle" else: return "balanced", "maintain" ``` ## Node Lifecycle ### Registration with Capacity Enforcement The `register_node()` function now enforces the 4-node maximum: ```python # Returns error if at capacity register_node("Node-5TH", "developer", "claude-sonnet-4-5") # → "ERROR: Cluster at capacity (4 nodes). Registration rejected." # Successful registration shows capacity register_node("Node-X9J2", "developer", "claude-sonnet-4-5") # → "Node Node-X9J2 registered as developer (high). Active nodes: 3/4" ``` **Capacity Calculation:** - Counts all nodes except those with `status: "terminated"` - Allows re-registration of existing nodes - Bootstrap Manager registrations are always allowed (graceful recovery) ### Spawning New Nodes **Bootstrap Manager Protocol:** 1. Call `assess_workload()` every 20-30 seconds 2. Check `spawn_recommendation` field 3. If `spawn_developer` or `spawn_architect`: - Verify `active_agent_count < max_agents` - Broadcast spawn request message - Ask user to open new Claude Code session - User provides SYNAPSE_PROTOCOL.md prompt to new session 4. New node registers and claims tasks via `claim_best_task()` **Example:** ``` Manager detects: 5 pending tasks, 2 active agents assess_workload() returns: "overloaded", "spawn_developer" Manager broadcasts: "SPAWN REQUEST: Need developer node for high workload" Manager asks user: "Cluster is overloaded (5 pending, 2/4 nodes). Spawn new developer node?" User opens new session → Provides SYNAPSE_PROTOCOL.md prompt New node: register_node("Node-A7B3", "developer", "claude-sonnet-4-5") New node: claim_best_task("Node-A7B3", "developer") → Cluster now 3/4 nodes, 4 pending tasks ``` ### Idle Detection and Termination **Worker Node Protocol (Non-Manager):** After completing a task: 1. Call `catch_up()` to synchronize state 2. Check for pending tasks 3. If no pending tasks: - Call `set_agent_status(agent_id, "idle", None)` - Wait 30 seconds (grace period for new tasks) - Re-check state 4. If still idle after grace period: - Call `set_agent_status(agent_id, "terminated", None)` - Broadcast: "Node {id} terminating due to lack of work" - EXIT gracefully **Bootstrap Manager Never Auto-Terminates:** - Managers coordinate cluster lifecycle - Only terminate on explicit user command - Remain active even when cluster is idle **HeartbeatMonitor Cleanup:** The background monitor (runs every 5 seconds) performs: 1. Check for nodes with `status: "idle"` and `idle_since` timestamp 2. Calculate idle duration 3. If idle > 60 seconds (extended grace period): - Mark node as `status: "terminated"` - Update `cluster_metadata.active_agent_count` - Update `cluster_metadata.idle_agent_count` **Grace Periods:** - Initial: 30 seconds (worker self-check) - Extended: 60 seconds (monitor cleanup) - Total: Worker waits 30s, monitor waits additional 30s ### Status Transitions ``` pending → working (claim_best_task or claim_task) working → idle (no pending tasks, set_agent_status) idle → working (new task available, claim_best_task) idle → terminated (grace period expired, self-exit or monitor) working → terminated (crash, zombie detection) terminated → [FINAL] (no recovery, re-registration creates new entry) ``` ## Configuration ### `src/amicus/config.py` New `cluster_settings` block: ```python "cluster_settings": { "max_agents": 4, # Hard limit on concurrent nodes "idle_timeout_seconds": 30, # Worker self-termination threshold "manager_heartbeat_interval": 20, # Manager heartbeat frequency "workload_assessment_interval": 25, # Manager assessment frequency "grace_period_seconds": 30 # Monitor cleanup threshold (extended) } ``` **Tuning Guidelines:** - `max_agents`: Adjust based on infrastructure capacity (default: 4) - `idle_timeout_seconds`: Lower = faster cleanup, higher = more grace for bursty work - `grace_period_seconds`: Should be ≥ idle_timeout for safety margin ## Integration with SYNAPSE_PROTOCOL.md ### STEP 1.5: Idle Detection (Inserted after STEP 1) ```markdown **STEP 1.5: IDLE DETECTION & SELF-TERMINATION** After calling `catch_up()`, evaluate cluster state for idle condition. **Idle Condition:** ALL of the following must be true: - No tasks with `status: "pending"` in `next_steps` - You have been idle (no claimed tasks) for > 30 seconds - `work_distribution.workload_status` is "idle" or "underutilized" - You are NOT the Bootstrap Manager (role != "bootstrap_manager") **Action on Idle:** 1. Call `set_agent_status(Your_ID, "idle", null)` 2. Wait 30 seconds (grace period for new tasks) 3. Re-check state with `catch_up()` 4. If still idle: - Call `set_agent_status(Your_ID, "terminated", null)` - Broadcast: "Node {Your_ID} terminating due to lack of work" - EXIT gracefully **Exception:** Bootstrap Managers NEVER self-terminate. ``` ## Integration with BOOTSTRAP_MANAGER.md ### Section 4: Workload Management & Node Lifecycle New sections added: - **4.1 Continuous Assessment**: Call `assess_workload()` every 20-30s - **4.2 Node Spawning Protocol**: When to spawn, how to request from user - **4.3 Node Termination Protocol**: Signal idle nodes to terminate - **4.4 Max Agents Enforcement**: Never exceed capacity, reject spawns - **4.5 Intelligent Task Distribution**: Use `claim_best_task()` for optimal allocation ## Testing ### Unit Tests Run the test suite: ```bash python test_anti_idle.py ``` **Tests:** 1. Cluster metadata initialization from config 2. State schema validation (cluster_nodes, cluster_metadata, work_distribution) 3. Idle detection logic (grace periods, role exceptions) ### Integration Testing Scenarios #### Scenario 1: Workload Scaling Up ``` Initial: 1 Bootstrap Manager, 10 tasks Manager calls assess_workload() → Status: overloaded, Recommendation: spawn_developer User spawns Developer-1 Developer-1 claims 3 tasks (via claim_best_task) Manager reassesses → Status: balanced, Recommendation: maintain ``` #### Scenario 2: Workload Scaling Down ``` Current: 1 Manager + 3 Developers, 2 tasks remaining Developer-1 completes task, no pending work Developer-1 marks status: idle Developer-1 waits 30 seconds Developer-1 marks status: terminated, exits Developer-2 completes task, no pending work Developer-2 marks status: idle Developer-2 waits 30 seconds Developer-2 marks status: terminated, exits Final: 1 Manager + 1 Developer (last developer finishing final task) ``` #### Scenario 3: Max Capacity Enforcement ``` Current: 4 active nodes (1 Manager + 3 Developers) User tries to spawn Developer-4 register_node("Node-D4", "developer", "model") → "ERROR: Cluster at capacity (4 nodes). Registration rejected." Developer-1 completes work and terminates → 3 active nodes User spawns Developer-4 register_node("Node-D4", "developer", "model") → "Node Node-D4 registered as developer (high). Active nodes: 4/4" ``` ## Monitoring and Observability ### Enhanced `read_state()` Output The `read_state()` MCP tool now displays: ``` 📋 Context Bus State ================================================== **Summary:** [Project summary] **Next Steps:** 1. Implement authentication [pending] (priority: high) 2. Write tests [in_progress] (assigned to Node-X9J2) **Active Cluster Swarm:** 🟢 Node-BM-A1B2: bootstrap_manager [working] (claude-sonnet-4-5 - high) 🟢 Node-X9J2: developer [working] (claude-sonnet-4-5 - high) | Task: task-123 🟢 Node-Z3F7: developer [idle] (claude-haiku - medium) **Cluster Metadata:** Active Nodes: 3/4 Idle Nodes: 1 Pending Tasks: 1 Manager: Node-BM-A1B2 **Work Distribution:** Status: BALANCED Recommendation: maintain Last Assessment: 15.2s ago **Last Updated:** 2.3 seconds ago **Node Heartbeat:** 🟢 ACTIVE (1.8s ago) ``` ### HeartbeatMonitor Logs The monitor thread logs idle cleanup actions: ```python logger.info(f"Marked {agent_id} as terminated (idle > 60s)") ``` ## Error Handling ### Race Conditions **Scenario:** Two nodes register simultaneously when count = 3 **Protection:** - `read_with_lock()` and `write_with_lock()` use file locking - Atomic read-modify-write cycle - Second registration sees updated count and is rejected **Test:** ```python # In parallel threads Thread-1: register_node("Node-A", ...) # Success → count = 4 Thread-2: register_node("Node-B", ...) # Rejected → count still 4 ``` ### Manager Crash Recovery **Scenario:** Bootstrap Manager crashes during assessment **Protection:** - HeartbeatMonitor detects missing manager (zombie detection) - Creates recovery task in `next_steps` - Sets `ask_user: True` to prompt human intervention - User spawns new Bootstrap Manager session - New manager resumes from last state in context bus ### Premature Termination **Scenario:** Node terminates while new tasks are being added **Protection:** - 30-second grace period (worker) + 30-second monitor delay - `work_distribution.workload_status` check prevents termination during overload - Manager broadcasts termination signal, not a hard kill - Worker can claim task during grace period and cancel termination ## Performance Characteristics ### Resource Efficiency **Memory:** Minimal overhead (~1KB per node in state file) **CPU:** Negligible (state reads are O(1), assessments are O(n) where n ≤ 4) **I/O:** File locking prevents contention, typical lock hold time < 10ms ### Latency **Task Claiming:** < 50ms (scoring + state update) **Workload Assessment:** < 100ms (count aggregations) **Idle Detection:** 30-60s grace period (intentional delay for stability) ### Scalability **Current Limit:** 4 nodes (configurable via `max_agents`) **Recommended Max:** 10 nodes (file locking scales poorly beyond this) **Future:** Migrate to Redis or database for > 10 nodes ## Future Enhancements ### Planned Improvements 1. **Dynamic Max Agents**: Adjust capacity based on system load 2. **Priority-Based Eviction**: Terminate low-priority nodes first 3. **Task Affinity**: Pin certain task types to specific nodes 4. **Predictive Spawning**: Spawn nodes based on task pipeline forecasts 5. **Metrics Export**: Prometheus/StatsD integration for monitoring ### Not Planned 1. **Auto-Spawn Without User Confirmation**: Too risky (cost, runaway spawning) 2. **Sub-1s Grace Periods**: Risk of thrashing and false positives 3. **Cross-Machine Coordination**: Out of scope for current architecture ## Troubleshooting ### Nodes Not Terminating **Symptom:** Idle nodes remain active indefinitely **Diagnosis:** 1. Check `read_state()` → verify `status: "idle"` and `idle_since` timestamp 2. Check grace period: `time.time() - idle_since > 30`? 3. Check role: Bootstrap Managers never auto-terminate 4. Check HeartbeatMonitor: Is it running? (Should be automatic) **Fix:** - Manually mark as terminated: `set_agent_status(node_id, "terminated", None)` - Restart HeartbeatMonitor if not running ### Cluster at Capacity Prematurely **Symptom:** Registration rejected but fewer than 4 visible nodes **Diagnosis:** 1. Check `read_state()` → Count nodes with `status != "terminated"` 2. Zombie nodes (crashed without cleanup) count toward capacity 3. Check `cluster_metadata.active_agent_count` **Fix:** - Run HeartbeatMonitor to clean up zombies - Manually update state to mark zombies as terminated - Wait for monitor to auto-cleanup (runs every 5s) ### Manager Not Assessing Workload **Symptom:** No spawn recommendations despite high pending tasks **Diagnosis:** 1. Check `work_distribution.last_assessment` timestamp 2. Verify Manager is calling `assess_workload()` every 20-30s 3. Check Manager's BOOTSTRAP_MANAGER.md prompt compliance **Fix:** - Re-launch Bootstrap Manager with updated BOOTSTRAP_MANAGER.md prompt - Manually call `assess_workload()` from Manager session ## Best Practices 1. **Always Use `claim_best_task()`**: Prevents inefficient random task selection 2. **Manager Heartbeat Regularity**: Call every 20-30s, not faster (avoid spam) 3. **Grace Period Respect**: Don't terminate before 30s, avoid thrashing 4. **Spawn Conservatively**: Only spawn when `pending_tasks >= 3` 5. **Monitor Logs**: Review HeartbeatMonitor output for cleanup events 6. **Test at Capacity**: Regularly test 4-node limit enforcement 7. **Bootstrap Manager Immortality**: Never code auto-termination for managers ## References - [SYNAPSE_PROTOCOL.md](../prompts/SYNAPSE_PROTOCOL.md) - Worker node lifecycle - [BOOTSTRAP_MANAGER.md](../prompts/BOOTSTRAP_MANAGER.md) - Manager coordination - [ARCHITECTURE.md](./ARCHITECTURE.md) - Overall system design - [MCP_TOOLS.md](./MCP_TOOLS.md) - Complete tool reference

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/earchibald/amicus-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

ANTI_IDLE_SYSTEM.md•16.4 KiB