# Anti-Idle System for Amicus Synapse Network
## Overview
The anti-idle system prevents Synapse nodes from falling idle through intelligent workload assessment, automatic node lifecycle management, and graceful termination protocols. The system enforces a maximum of 4 concurrent nodes while ensuring efficient resource utilization.
## Architecture
### State Schema Enhancements
The anti-idle system extends the Amicus state with three new top-level structures:
#### 1. Enhanced `cluster_nodes`
Each node now tracks additional lifecycle metadata:
```json
{
"Node-ID": {
"role": "bootstrap_manager | architect | developer",
"model": {"name": "...", "strength": "..."},
"last_heartbeat": 1234567890.0,
"status": "working | idle | waiting | terminated",
"current_task_id": "task-123" | null,
"idle_since": 1234567890.0 | null,
"last_activity": 1234567890.0
}
}
```
**Status States:**
- `working`: Actively executing a task
- `idle`: No tasks available, within grace period
- `waiting`: Short-term wait (< 10s) for state updates
- `terminated`: Gracefully exited, no longer active
#### 2. `cluster_metadata`
Tracks cluster-wide statistics:
```json
{
"max_agents": 4,
"active_agent_count": 2,
"idle_agent_count": 0,
"pending_task_count": 5,
"manager_id": "Node-BM-X1Y2"
}
```
#### 3. `work_distribution`
Provides workload intelligence for Bootstrap Manager:
```json
{
"last_assessment": 1234567890.0,
"workload_status": "overloaded | balanced | underutilized | idle",
"spawn_recommendation": "spawn_developer | spawn_architect | terminate_idle | maintain"
}
```
**Workload Status Definitions:**
- `idle`: No pending or in-progress tasks
- `underutilized`: Few pending tasks, multiple idle nodes
- `balanced`: Optimal task-to-node ratio
- `overloaded`: Many pending tasks, all nodes busy or approaching capacity
## MCP Tools
### `set_agent_status(agent_id, status, current_task_id)`
Updates an agent's status and tracks idle transitions.
**Usage:**
```python
# Mark as idle when no work available
set_agent_status("Node-X9J2", "idle", None)
# Mark as working when claiming a task
set_agent_status("Node-X9J2", "working", "task-123")
# Mark as terminated before exit
set_agent_status("Node-X9J2", "terminated", None)
```
**Behavior:**
- Automatically sets `idle_since` timestamp when transitioning to idle
- Clears `idle_since` when transitioning to any other status
- Updates `last_activity` timestamp on every call
### `claim_best_task(agent_id, role)`
Intelligently claims the highest-priority task for the agent's role.
**Scoring Algorithm:**
```python
score = priority_points + role_match_bonus + age_bonus
# Priority points
high = 30 points
medium = 20 points
low = 10 points
# Role match (+15 points)
architect: keywords = ["design", "architect", "plan"]
developer: keywords = ["implement", "code", "test"]
# Age bonus (+1 per minute since creation)
```
**Usage:**
```python
# Returns: "Claimed task 3: Implement user authentication"
result = claim_best_task("Node-X9J2", "developer")
```
**Advantages over manual `claim_task()`:**
- Prevents random task selection
- Prioritizes high-priority tasks
- Matches tasks to agent capabilities
- Ensures older tasks don't starve
### `assess_workload()`
Analyzes cluster state and generates spawn/terminate recommendations.
**Called by:** Bootstrap Manager every 20-30 seconds
**Returns:**
```
π Workload Assessment
========================================
Active Agents: 2/4
Idle Agents: 0
Pending Tasks: 5
In Progress: 2
Status: OVERLOADED
Recommendation: spawn_developer
```
**Decision Logic:**
```python
if pending_tasks == 0 and in_progress == 0:
return "idle", "terminate_idle"
elif pending_tasks >= 3 and active_count < max_agents:
return "overloaded", "spawn_developer"
elif pending_tasks > 0 and active_agents == 0 and total_active < max_agents:
return "overloaded", "spawn_developer"
elif pending_tasks <= 1 and total_active > 1 and idle_agents > 0:
return "underutilized", "terminate_idle"
else:
return "balanced", "maintain"
```
## Node Lifecycle
### Registration with Capacity Enforcement
The `register_node()` function now enforces the 4-node maximum:
```python
# Returns error if at capacity
register_node("Node-5TH", "developer", "claude-sonnet-4-5")
# β "ERROR: Cluster at capacity (4 nodes). Registration rejected."
# Successful registration shows capacity
register_node("Node-X9J2", "developer", "claude-sonnet-4-5")
# β "Node Node-X9J2 registered as developer (high). Active nodes: 3/4"
```
**Capacity Calculation:**
- Counts all nodes except those with `status: "terminated"`
- Allows re-registration of existing nodes
- Bootstrap Manager registrations are always allowed (graceful recovery)
### Spawning New Nodes
**Bootstrap Manager Protocol:**
1. Call `assess_workload()` every 20-30 seconds
2. Check `spawn_recommendation` field
3. If `spawn_developer` or `spawn_architect`:
- Verify `active_agent_count < max_agents`
- Broadcast spawn request message
- Ask user to open new Claude Code session
- User provides SYNAPSE_PROTOCOL.md prompt to new session
4. New node registers and claims tasks via `claim_best_task()`
**Example:**
```
Manager detects: 5 pending tasks, 2 active agents
assess_workload() returns: "overloaded", "spawn_developer"
Manager broadcasts: "SPAWN REQUEST: Need developer node for high workload"
Manager asks user: "Cluster is overloaded (5 pending, 2/4 nodes). Spawn new developer node?"
User opens new session β Provides SYNAPSE_PROTOCOL.md prompt
New node: register_node("Node-A7B3", "developer", "claude-sonnet-4-5")
New node: claim_best_task("Node-A7B3", "developer")
β Cluster now 3/4 nodes, 4 pending tasks
```
### Idle Detection and Termination
**Worker Node Protocol (Non-Manager):**
After completing a task:
1. Call `catch_up()` to synchronize state
2. Check for pending tasks
3. If no pending tasks:
- Call `set_agent_status(agent_id, "idle", None)`
- Wait 30 seconds (grace period for new tasks)
- Re-check state
4. If still idle after grace period:
- Call `set_agent_status(agent_id, "terminated", None)`
- Broadcast: "Node {id} terminating due to lack of work"
- EXIT gracefully
**Bootstrap Manager Never Auto-Terminates:**
- Managers coordinate cluster lifecycle
- Only terminate on explicit user command
- Remain active even when cluster is idle
**HeartbeatMonitor Cleanup:**
The background monitor (runs every 5 seconds) performs:
1. Check for nodes with `status: "idle"` and `idle_since` timestamp
2. Calculate idle duration
3. If idle > 60 seconds (extended grace period):
- Mark node as `status: "terminated"`
- Update `cluster_metadata.active_agent_count`
- Update `cluster_metadata.idle_agent_count`
**Grace Periods:**
- Initial: 30 seconds (worker self-check)
- Extended: 60 seconds (monitor cleanup)
- Total: Worker waits 30s, monitor waits additional 30s
### Status Transitions
```
pending β working (claim_best_task or claim_task)
working β idle (no pending tasks, set_agent_status)
idle β working (new task available, claim_best_task)
idle β terminated (grace period expired, self-exit or monitor)
working β terminated (crash, zombie detection)
terminated β [FINAL] (no recovery, re-registration creates new entry)
```
## Configuration
### `src/amicus/config.py`
New `cluster_settings` block:
```python
"cluster_settings": {
"max_agents": 4, # Hard limit on concurrent nodes
"idle_timeout_seconds": 30, # Worker self-termination threshold
"manager_heartbeat_interval": 20, # Manager heartbeat frequency
"workload_assessment_interval": 25, # Manager assessment frequency
"grace_period_seconds": 30 # Monitor cleanup threshold (extended)
}
```
**Tuning Guidelines:**
- `max_agents`: Adjust based on infrastructure capacity (default: 4)
- `idle_timeout_seconds`: Lower = faster cleanup, higher = more grace for bursty work
- `grace_period_seconds`: Should be β₯ idle_timeout for safety margin
## Integration with SYNAPSE_PROTOCOL.md
### STEP 1.5: Idle Detection (Inserted after STEP 1)
```markdown
**STEP 1.5: IDLE DETECTION & SELF-TERMINATION**
After calling `catch_up()`, evaluate cluster state for idle condition.
**Idle Condition:** ALL of the following must be true:
- No tasks with `status: "pending"` in `next_steps`
- You have been idle (no claimed tasks) for > 30 seconds
- `work_distribution.workload_status` is "idle" or "underutilized"
- You are NOT the Bootstrap Manager (role != "bootstrap_manager")
**Action on Idle:**
1. Call `set_agent_status(Your_ID, "idle", null)`
2. Wait 30 seconds (grace period for new tasks)
3. Re-check state with `catch_up()`
4. If still idle:
- Call `set_agent_status(Your_ID, "terminated", null)`
- Broadcast: "Node {Your_ID} terminating due to lack of work"
- EXIT gracefully
**Exception:** Bootstrap Managers NEVER self-terminate.
```
## Integration with BOOTSTRAP_MANAGER.md
### Section 4: Workload Management & Node Lifecycle
New sections added:
- **4.1 Continuous Assessment**: Call `assess_workload()` every 20-30s
- **4.2 Node Spawning Protocol**: When to spawn, how to request from user
- **4.3 Node Termination Protocol**: Signal idle nodes to terminate
- **4.4 Max Agents Enforcement**: Never exceed capacity, reject spawns
- **4.5 Intelligent Task Distribution**: Use `claim_best_task()` for optimal allocation
## Testing
### Unit Tests
Run the test suite:
```bash
python test_anti_idle.py
```
**Tests:**
1. Cluster metadata initialization from config
2. State schema validation (cluster_nodes, cluster_metadata, work_distribution)
3. Idle detection logic (grace periods, role exceptions)
### Integration Testing Scenarios
#### Scenario 1: Workload Scaling Up
```
Initial: 1 Bootstrap Manager, 10 tasks
Manager calls assess_workload()
β Status: overloaded, Recommendation: spawn_developer
User spawns Developer-1
Developer-1 claims 3 tasks (via claim_best_task)
Manager reassesses
β Status: balanced, Recommendation: maintain
```
#### Scenario 2: Workload Scaling Down
```
Current: 1 Manager + 3 Developers, 2 tasks remaining
Developer-1 completes task, no pending work
Developer-1 marks status: idle
Developer-1 waits 30 seconds
Developer-1 marks status: terminated, exits
Developer-2 completes task, no pending work
Developer-2 marks status: idle
Developer-2 waits 30 seconds
Developer-2 marks status: terminated, exits
Final: 1 Manager + 1 Developer (last developer finishing final task)
```
#### Scenario 3: Max Capacity Enforcement
```
Current: 4 active nodes (1 Manager + 3 Developers)
User tries to spawn Developer-4
register_node("Node-D4", "developer", "model")
β "ERROR: Cluster at capacity (4 nodes). Registration rejected."
Developer-1 completes work and terminates
β 3 active nodes
User spawns Developer-4
register_node("Node-D4", "developer", "model")
β "Node Node-D4 registered as developer (high). Active nodes: 4/4"
```
## Monitoring and Observability
### Enhanced `read_state()` Output
The `read_state()` MCP tool now displays:
```
π Context Bus State
==================================================
**Summary:**
[Project summary]
**Next Steps:**
1. Implement authentication [pending] (priority: high)
2. Write tests [in_progress] (assigned to Node-X9J2)
**Active Cluster Swarm:**
π’ Node-BM-A1B2: bootstrap_manager [working] (claude-sonnet-4-5 - high)
π’ Node-X9J2: developer [working] (claude-sonnet-4-5 - high) | Task: task-123
π’ Node-Z3F7: developer [idle] (claude-haiku - medium)
**Cluster Metadata:**
Active Nodes: 3/4
Idle Nodes: 1
Pending Tasks: 1
Manager: Node-BM-A1B2
**Work Distribution:**
Status: BALANCED
Recommendation: maintain
Last Assessment: 15.2s ago
**Last Updated:** 2.3 seconds ago
**Node Heartbeat:** π’ ACTIVE (1.8s ago)
```
### HeartbeatMonitor Logs
The monitor thread logs idle cleanup actions:
```python
logger.info(f"Marked {agent_id} as terminated (idle > 60s)")
```
## Error Handling
### Race Conditions
**Scenario:** Two nodes register simultaneously when count = 3
**Protection:**
- `read_with_lock()` and `write_with_lock()` use file locking
- Atomic read-modify-write cycle
- Second registration sees updated count and is rejected
**Test:**
```python
# In parallel threads
Thread-1: register_node("Node-A", ...) # Success β count = 4
Thread-2: register_node("Node-B", ...) # Rejected β count still 4
```
### Manager Crash Recovery
**Scenario:** Bootstrap Manager crashes during assessment
**Protection:**
- HeartbeatMonitor detects missing manager (zombie detection)
- Creates recovery task in `next_steps`
- Sets `ask_user: True` to prompt human intervention
- User spawns new Bootstrap Manager session
- New manager resumes from last state in context bus
### Premature Termination
**Scenario:** Node terminates while new tasks are being added
**Protection:**
- 30-second grace period (worker) + 30-second monitor delay
- `work_distribution.workload_status` check prevents termination during overload
- Manager broadcasts termination signal, not a hard kill
- Worker can claim task during grace period and cancel termination
## Performance Characteristics
### Resource Efficiency
**Memory:** Minimal overhead (~1KB per node in state file)
**CPU:** Negligible (state reads are O(1), assessments are O(n) where n β€ 4)
**I/O:** File locking prevents contention, typical lock hold time < 10ms
### Latency
**Task Claiming:** < 50ms (scoring + state update)
**Workload Assessment:** < 100ms (count aggregations)
**Idle Detection:** 30-60s grace period (intentional delay for stability)
### Scalability
**Current Limit:** 4 nodes (configurable via `max_agents`)
**Recommended Max:** 10 nodes (file locking scales poorly beyond this)
**Future:** Migrate to Redis or database for > 10 nodes
## Future Enhancements
### Planned Improvements
1. **Dynamic Max Agents**: Adjust capacity based on system load
2. **Priority-Based Eviction**: Terminate low-priority nodes first
3. **Task Affinity**: Pin certain task types to specific nodes
4. **Predictive Spawning**: Spawn nodes based on task pipeline forecasts
5. **Metrics Export**: Prometheus/StatsD integration for monitoring
### Not Planned
1. **Auto-Spawn Without User Confirmation**: Too risky (cost, runaway spawning)
2. **Sub-1s Grace Periods**: Risk of thrashing and false positives
3. **Cross-Machine Coordination**: Out of scope for current architecture
## Troubleshooting
### Nodes Not Terminating
**Symptom:** Idle nodes remain active indefinitely
**Diagnosis:**
1. Check `read_state()` β verify `status: "idle"` and `idle_since` timestamp
2. Check grace period: `time.time() - idle_since > 30`?
3. Check role: Bootstrap Managers never auto-terminate
4. Check HeartbeatMonitor: Is it running? (Should be automatic)
**Fix:**
- Manually mark as terminated: `set_agent_status(node_id, "terminated", None)`
- Restart HeartbeatMonitor if not running
### Cluster at Capacity Prematurely
**Symptom:** Registration rejected but fewer than 4 visible nodes
**Diagnosis:**
1. Check `read_state()` β Count nodes with `status != "terminated"`
2. Zombie nodes (crashed without cleanup) count toward capacity
3. Check `cluster_metadata.active_agent_count`
**Fix:**
- Run HeartbeatMonitor to clean up zombies
- Manually update state to mark zombies as terminated
- Wait for monitor to auto-cleanup (runs every 5s)
### Manager Not Assessing Workload
**Symptom:** No spawn recommendations despite high pending tasks
**Diagnosis:**
1. Check `work_distribution.last_assessment` timestamp
2. Verify Manager is calling `assess_workload()` every 20-30s
3. Check Manager's BOOTSTRAP_MANAGER.md prompt compliance
**Fix:**
- Re-launch Bootstrap Manager with updated BOOTSTRAP_MANAGER.md prompt
- Manually call `assess_workload()` from Manager session
## Best Practices
1. **Always Use `claim_best_task()`**: Prevents inefficient random task selection
2. **Manager Heartbeat Regularity**: Call every 20-30s, not faster (avoid spam)
3. **Grace Period Respect**: Don't terminate before 30s, avoid thrashing
4. **Spawn Conservatively**: Only spawn when `pending_tasks >= 3`
5. **Monitor Logs**: Review HeartbeatMonitor output for cleanup events
6. **Test at Capacity**: Regularly test 4-node limit enforcement
7. **Bootstrap Manager Immortality**: Never code auto-termination for managers
## References
- [SYNAPSE_PROTOCOL.md](../prompts/SYNAPSE_PROTOCOL.md) - Worker node lifecycle
- [BOOTSTRAP_MANAGER.md](../prompts/BOOTSTRAP_MANAGER.md) - Manager coordination
- [ARCHITECTURE.md](./ARCHITECTURE.md) - Overall system design
- [MCP_TOOLS.md](./MCP_TOOLS.md) - Complete tool reference