# VRAM Management Guide
## Current Status ✅
**VRAM management is now active and protecting your GPU!**
```
Current Implementation:
- Strategy: Ultra-simple LRU cache eviction
- Method: Load SD1.5 (2GB) to evict FLUX (11GB)
- Monitoring: Every 60 seconds via /system_stats
- Protection: Emergency mode at 95% VRAM usage
- Status: Active and running in production
- FP8 Text Encoder: Enabled (saves ~2.4GB VRAM)
- Reserved VRAM: 1GB for OS/system safety
```
### Current Memory Usage
- **System RAM:** 13GB available (19GB total)
- **GPU VRAM:** ~5GB used, 19GB free (24GB total)
- **Swap:** Minimal usage (healthy)
## Core Principles
**PROTECT THE GPU ABOVE ALL ELSE** - Service disruption is acceptable, GPU damage is not.
1. **Think strategically** - Prevent issues before they occur
2. **Handle edge cases gracefully** - Never crash, always have fallbacks
3. **Plan for the long term** - Track patterns, predict problems
4. **Follow best practices** - Use proven patterns, conservative thresholds
5. **Do no harm** - Refuse operations rather than risk hardware
6. **Trust but verify** - Validate all operations and results
7. **Protect the GPU above all else** - Including service disruption
## Implementation Overview
The system uses ComfyUI's built-in LRU (Least Recently Used) cache mechanism to manage VRAM naturally:
1. **ComfyUI Configuration**: `--cache-lru 3` (small cache for aggressive eviction)
2. **Eviction Strategy**: Loading smaller models pushes larger ones out of cache
3. **No Core Modifications**: Works with stock ComfyUI
## Active VRAM Manager
### Simple VRAM Manager (Currently Running)
**Status:** Active in the MCP server
```javascript
import { setupVRAMManagement } from './src/services/simple-vram-manager.js';
const vramManager = setupVRAMManagement(comfyuiClient, {
warningThreshold: 0.70, // Start monitoring closely
criticalThreshold: 0.80, // Increase monitoring frequency
cleanupThreshold: 0.85, // Trigger cleanup
emergencyThreshold: 0.95, // Refuse operations
refuseOnEmergency: true, // GPU protection active
verifyCleanup: true, // Validates cleanup worked
preRequestCheck: true, // Checks before FLUX operations
defaultStrategy: 'sd15', // Uses SD1.5 for eviction
idleTimeout: 30 * 60 * 1000 // 30 minutes idle cleanup
});
```
### GPU-Safe VRAM Manager (Available for High-Risk Environments)
**Use when:** Running expensive hardware, poor cooling, or need temperature monitoring
**Status:** Available but not currently active
```javascript
import { setupGPUSafeVRAMManagement } from './src/services/gpu-safe-vram-manager.js';
const vramManager = setupGPUSafeVRAMManagement(comfyuiClient, {
safetyThresholds: {
VRAM_SAFE: 0.65, // Even more conservative
VRAM_WARNING: 0.70,
VRAM_CRITICAL: 0.80,
VRAM_EMERGENCY: 0.90,
VRAM_SHUTDOWN: 0.95,
TEMP_WARNING: 75, // Temperature monitoring
TEMP_CRITICAL: 83,
TEMP_EMERGENCY: 87,
TEMP_SHUTDOWN: 90
},
autoProtection: true,
refuseOnHighVRAM: true,
verifyCleanup: true,
historicalTracking: true, // Long-term monitoring
alerting: true, // External alerts
onEmergencyShutdown: (reason, stats) => {
// Custom shutdown handler
console.error('GPU EMERGENCY:', reason);
// Implement: docker-compose down, notifications, etc.
}
});
```
## Configuration by Environment
### Development Environment
```javascript
{
warningThreshold: 0.75,
criticalThreshold: 0.85,
cleanupThreshold: 0.90,
emergencyThreshold: 0.95,
refuseOnEmergency: false, // Allow testing
logging: true
}
```
### Production Environment
```javascript
{
warningThreshold: 0.65, // More conservative
criticalThreshold: 0.75,
cleanupThreshold: 0.80,
emergencyThreshold: 0.90,
refuseOnEmergency: true, // Protect GPU
verifyCleanup: true, // Verify all operations
logging: true,
alerting: true // External monitoring
}
```
### High-Value GPU (RTX 4090, A100, etc.)
```javascript
{
warningThreshold: 0.60, // Very conservative
criticalThreshold: 0.70,
cleanupThreshold: 0.75,
emergencyThreshold: 0.85,
refuseOnEmergency: true,
verifyCleanup: true,
historicalTracking: true, // Track long-term patterns
alerting: true
}
```
## Current Docker Configuration
### docker-compose.yml (Active)
```yaml
services:
comfyui:
command: [
"--listen", "0.0.0.0",
"--port", "8188",
"--highvram", # ✅ Keep models loaded when active
"--disable-metadata", # ✅ Don't save metadata in images
"--cache-lru", "3", # ✅ Small LRU cache for aggressive eviction
# "--disable-smart-memory", # ❌ REMOVED: Would use too much system RAM
"--fp8_e4m3fn-text-enc", # ✅ FP8 text encoder (saves ~2.4GB)
"--reserve-vram", "1.0" # ✅ Reserve 1GB for OS/system safety
]
environment:
PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:512,garbage_collection_threshold:0.8"
deploy:
resources:
limits:
memory: 16G # Limit container memory
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
```
### Health Checks
```yaml
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8188/system_stats"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
```
## Monitoring
### Current Monitoring (Active)
- **Endpoint:** `/system_stats` on ComfyUI (port 8188)
- **Frequency:** Every 60 seconds
- **Logs:** Available via `docker logs mcp-comfyui-flux-mcp-server-1`
### Check Current Status
```bash
# View VRAM usage
docker exec mcp-comfyui-flux-comfyui-1 nvidia-smi
# View VRAM manager logs
docker logs mcp-comfyui-flux-mcp-server-1 --tail 20 | grep VRAM
# Check both RAM and VRAM
echo "=== System RAM ===" && free -h | head -3 && echo -e "\n=== GPU VRAM ===" && \
docker exec mcp-comfyui-flux-comfyui-1 nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv
```
### Status Endpoint (Optional Enhancement)
```javascript
// Could be added to MCP server for external monitoring
app.get('/vram-status', async (req, res) => {
const status = vramManager.getStatus();
// Return appropriate HTTP status
if (status.emergencyMode) {
res.status(503); // Service Unavailable
} else if (status.currentUsage > 0.8) {
res.status(429); // Too Many Requests
} else {
res.status(200); // OK
}
res.json(status);
});
```
### Prometheus Metrics
```javascript
// Expose metrics for monitoring
app.get('/metrics', async (req, res) => {
const status = vramManager.getStatus();
res.set('Content-Type', 'text/plain');
res.send(`
# HELP gpu_vram_usage Current VRAM usage percentage
# TYPE gpu_vram_usage gauge
gpu_vram_usage ${status.currentUsage}
# HELP gpu_emergency_mode GPU in emergency protection mode
# TYPE gpu_emergency_mode gauge
gpu_emergency_mode ${status.emergencyMode ? 1 : 0}
# HELP gpu_cleanup_failures Consecutive cleanup failures
# TYPE gpu_cleanup_failures counter
gpu_cleanup_failures ${status.consecutiveCleanupFailures}
`);
});
```
## Emergency Procedures
### When Emergency Mode Activates
1. **Immediate Actions:**
- All new operations are refused
- Monitoring increases to every 5 seconds
- Alerts are sent (if configured)
2. **Manual Intervention:**
```bash
# Check GPU status
docker exec mcp-comfyui-flux-comfyui-1 nvidia-smi
# Force cleanup (simplified command)
docker exec mcp-comfyui-flux-mcp-server-1 node -e "
import('./src/services/simple-vram-manager.js').then(m => {
const client = { serverAddress: 'comfyui:8188' };
const manager = new m.SimpleVRAMManager(client);
manager.forceCleanup().then(console.log);
});
"
# Last resort - restart containers
docker-compose -p mcp-comfyui-flux restart
```
3. **Recovery:**
- Emergency mode clears automatically after successful cleanup
- Or manually restart the service
### Temperature Critical
If GPU temperature exceeds 90°C:
1. **IMMEDIATELY** stop all operations
2. Check cooling system
3. Reduce GPU power limit: `nvidia-smi -pl 300`
4. Consider system shutdown if temperature doesn't drop
## Best Practices
1. **Always use conservative thresholds in production**
2. **Monitor continuously** - Use external monitoring (Prometheus, Grafana)
3. **Test cleanup strategies** during low-usage periods
4. **Document incidents** - Learn from emergencies
5. **Have a runbook** - Clear procedures for emergencies
6. **Regular maintenance** - Schedule cleanup during quiet periods
7. **Capacity planning** - Track usage trends over time
## Maintenance Workflows
### Available Strategies
1. **Empty Workflow** (Instant, 0 VRAM)
- Creates a 64x64 black image
- Execution time: <100ms
- Minimal cache activity
2. **SD1.5 Workflow** (2GB VRAM) - **PRIMARY STRATEGY**
- Loads SD1.5 model (2GB)
- Forces FLUX (11GB) out of LRU cache
- Execution time: 2-3 seconds
- Most effective for freeing VRAM
3. **Noise Workflow** (Minimal VRAM)
- Latent space operations only
- Execution time: <200ms
- Clears working memory
## Testing Your Configuration
```javascript
// Run inside the container
docker exec mcp-comfyui-flux-mcp-server-1 node -e "
import('./src/services/simple-vram-manager.js').then(async (module) => {
const { SimpleVRAMManager } = module;
async function testConfig() {
const manager = new SimpleVRAMManager(comfyuiClient, {
// Your configuration
});
// Test thresholds
console.log('Testing VRAM manager configuration...');
// Check current status
const status = await manager.getStatus();
console.log('Current status:', status);
// Test pre-request check
const canRun = await manager.preRequestCheck('flux');
console.log('Can run FLUX:', canRun);
// Test cleanup
const cleanup = await manager.forceCleanup();
console.log('Cleanup result:', cleanup);
}
testConfig();
```
## Troubleshooting
### "Emergency mode won't clear"
- Check if cleanup is actually working
- Verify stats endpoint is responding correctly
- Manually reset: `vramManager.state.emergencyMode = false`
### "Cleanup not freeing memory"
- ✅ SD1.5 model confirmed present at: `models/checkpoints/v1-5-pruned-emaonly-fp16.safetensors`
- ✅ ComfyUI running with `--cache-lru 3`
- ✅ FP8 text encoder present: `models/clip/t5xxl_fp8_e4m3fn_scaled.safetensors`
- If still not working, try `empty` strategy instead of `sd15`
- Check if smart memory is keeping models loaded (remove `--disable-smart-memory` if present)
### "False emergencies"
- Adjust thresholds based on your GPU's actual capacity
- Enable `verifyCleanup` to ensure stats are accurate
- Check for stat endpoint reliability
## Key Metrics
### Expected VRAM Usage
- **Idle (with FP8 encoder):** ~5GB (base ComfyUI + FP8 text encoder)
- **FLUX Loaded:** ~11-12GB
- **After SD1.5 Cleanup:** ~2-3GB
- **Emergency Threshold:** 22.8GB (95% of 24GB)
- **Reserved for OS:** 1GB (configurable)
### Typical Cleanup Results
```
Before: FLUX loaded - 11.2GB / 24GB (46.7%)
After: SD1.5 loaded - 2.1GB / 24GB (8.8%)
Freed: ~9GB
```
## Remember
> **The GPU is irreplaceable. The service is not.**
>
> Always err on the side of caution. A refused request is better than a damaged GPU.
## Implementation Files
- **VRAM Manager:** `/src/services/simple-vram-manager.js`
- **GPU-Safe Manager (optional):** `/src/services/gpu-safe-vram-manager.js`
- **Maintenance Workflows:** `/src/workflows/maintenance-workflow.js`
- **MCP Integration:** `/src/index.js` (lines 14, 38, 94-97, 483-520)
- **Configuration Plan:** `/.claude/docs/plans/vram-management-plan.md`
- **CLI Reference:** `/.claude/utilities/comfyui-cli-reference.md`
## Optimization Notes for Limited RAM Systems
### Why NOT to use `--disable-smart-memory`
With limited system RAM (< 20GB available), avoid `--disable-smart-memory` because:
- Forces models to offload from VRAM to RAM aggressively
- Can consume 18GB+ RAM just for models (FLUX 11GB + T5-XXL 5GB + SD1.5 2GB)
- Causes swap thrashing if RAM fills up
- Better to keep models in VRAM with smart memory management
### Recommended Settings for Limited RAM
```yaml
# DO use:
--highvram # Keep in VRAM rather than swapping to limited RAM
--cache-lru 3 # Still triggers eviction via LRU
--fp8_e4m3fn-text-enc # Reduces memory footprint
--reserve-vram 1.0 # Minimal reserve (you need the VRAM)
# DON'T use:
# --disable-smart-memory # Would overflow RAM
# --lowvram # Would use more RAM for splitting
# --reserve-vram 2.0+ # You need the VRAM more than reserve
```
## LRU Cache Explained
### What `--cache-lru N` Does
- Controls how many **node execution results** are cached
- NOT directly about models, but computed outputs from nodes
- Smaller cache (3) = more aggressive eviction = better for VRAM management
- Larger cache (10+) = less eviction = models stay loaded longer
### Why We Use `--cache-lru 3`
```
With only 3 cache slots:
1. Run FLUX → Uses slot 1
2. Run other workflow → Uses slot 2
3. Run another → Uses slot 3
4. Run SD1.5 → Evicts slot 1 (FLUX) → FLUX can unload from VRAM
```
This creates natural model eviction through cache competition.