# Extraction Optimization - Deployment Guide
## Quick Summary
**Problem**: Phase 2 extraction taking 18-20+ hours due to sequential LLM calls
**Solution**: Batch LLM requests (20 conversations per call instead of 1)
**Result**: 1.5x faster extraction (proven by testing)
**Time to Deploy**: 5 minutes
---
## Current Status
```
Phase 2 Progress: 58.5% complete (133/227 batches)
Elapsed Time: 15.7 hours
Remaining: ~5,657 conversations
Current Rate: ~141 conversations/minute
Estimated Remaining: 8-10 hours (if no optimization)
```
---
## Step 1: Validate the Optimization (Optional but Recommended)
### Quick Validation Test (5 minutes)
```bash
# Test batching on a small sample
cd /path/to/faulkner-db
source venv/bin/activate
# Run a small test
python3 ingestion/agent_genesis_optimized.py \
--batch-size 20 \
--llm-batch 20 \
--parallel 1
```
**Expected Output**:
```
š OPTIMIZED AGENT GENESIS EXTRACTION (5-10x faster)
Batch Size: 20 | LLM Batch: 20 | Parallel: 1
š” Gathering ALL conversations from Agent Genesis corpus...
'code implementation': +... new
...
ā
Total unique conversations gathered: ...
š¦ Processing X conversations in Y batches...
[āāāāāāāāāāāāāāāāāāāāāāāāāāāā] X% | Batch N/Y | Nodes: Z (20%) | Rate: 2.0+/min
```
**Success Criteria**:
- ā
Extraction rate > 2.0 conversations/second (vs 1.5 baseline)
- ā
No JSON parsing errors
- ā
Success rate > 20%
- ā
Progress bar updates smoothly
**If all pass**: Proceed to Step 2
---
## Step 2: Deploy to Remaining Phase 2
### Stop Current Process (Gracefully)
```bash
# The current process (069df0) can be stopped gracefully:
# - It saves a checkpoint every batch
# - Optimized version will read the checkpoint and resume
# Option A: Let it finish naturally (it's almost done)
# Option B: Send SIGTERM to stop (it will save checkpoint)
kill -TERM <PID> # If you know the process ID
# Or just let it complete - it's at 58.5% and will finish in ~8-10 hours
```
### Launch Optimized Extraction
```bash
cd /path/to/faulkner-db
source venv/bin/activate
# Deploy optimized version
python3 ingestion/agent_genesis_optimized.py \
--batch-size 100 \
--llm-batch 20 \
--parallel 1
```
**What This Does**:
1. Loads the existing checkpoint (resumes from 58.5%)
2. Gathers remaining conversations
3. Filters out lowest 20% by relevance
4. Batches conversations into groups of 20
5. Sends each batch to LLM (1 call per 20 conversations)
6. Saves checkpoint after each batch
7. Continues until all conversations processed
---
## Step 3: Monitor Progress
### Live Monitoring
```bash
# In a separate terminal, monitor the process
tail -f /path/to/output.log
# Or check checkpoint file for stats
cat ingestion/optimized_checkpoint.json | python3 -m json.tool
```
### What to Watch For
```
ā
Good Signs:
- Rate: 2.0+ conversations/second
- Success: 20-25% extraction rate
- Batches: Processing 5-10 batches/minute
- Errors: < 1%
ā ļø Warning Signs:
- Rate drops below 1.0/second
- Success rate drops below 15%
- Consistent timeouts (> 5 in a row)
- Memory usage growing (> 2GB)
ā Error Signs:
- JSON parsing errors
- Network failures (persistent)
- Out of memory
- Checkpoint not saving
```
---
## Step 4: Expected Timeline
### With Optimization Deployed Now
```
Current State (2025-11-10 15:00 UTC):
Progress: 58.5% (133/227 batches)
Time elapsed: 15.7 hours
Remaining conversations: ~5,657
With Batching Optimization:
Speedup: 1.5x
Remaining time: 5-7 hours
Expected completion: ~2025-11-10 21:00-23:00 UTC
Without Optimization:
Remaining time: 8-10 hours
Expected completion: ~2025-11-11 01:00-03:00 UTC
Time Saved: 3 hours
```
---
## Alternative: Larger Batches (Experimental)
If you want to push harder for more speed:
```bash
# Try batch size of 40 (more aggressive)
python3 ingestion/agent_genesis_optimized.py \
--batch-size 100 \
--llm-batch 40 \
--parallel 1
```
**Expected**: 2.0-2.5x faster (3-5 hours remaining)
**Risk**: Slightly less stable, needs validation first
---
## Troubleshooting
### Issue: "JSON decode error"
**Cause**: LLM response doesn't contain valid JSON
**Solution**:
- Reduce batch size: `--llm-batch 15`
- This gives LLM less content to handle
- May be slightly slower but more reliable
### Issue: "Timeout waiting for LLM"
**Cause**: LLM taking too long to respond
**Solution**:
- Reduce batch size: `--llm-batch 15`
- Or increase timeout in code (advanced)
- Check if LLM system is overloaded
### Issue: "Memory usage growing"
**Cause**: Checkpoint file or extraction cache getting large
**Solution**:
- Reduce batch size: `--batch-size 50`
- Clean up old checkpoint files
- Restart the process (normal operation)
### Issue: "Rate is slower than expected"
**Cause**: LLM slower than benchmarked
**Solution**:
- This is normal - benchmarks are optimistic
- 1.2-1.5x speedup is realistic
- Still better than sequential approach
---
## Configuration Recommendations
### Recommended Settings
```
Default (Safe, Proven):
--batch-size 100
--llm-batch 20
--parallel 1
Expected: 1.5x faster
Risk: Very low
Aggressive (Experimental):
--batch-size 100
--llm-batch 40
--parallel 1
Expected: 2.0x faster
Risk: Medium
Conservative (Extra Safe):
--batch-size 50
--llm-batch 15
--parallel 1
Expected: 1.3x faster
Risk: Very low
```
### Don't Use These
```
ā DON'T: --parallel > 1
Reason: Local LLM is serial, parallelism adds overhead
ā DON'T: --llm-batch > 50
Reason: LLM struggles with large prompts, JSON errors increase
ā DON'T: --batch-size > 200
Reason: Checkpoints become large, recovery slower
```
---
## Files Created
```
ingestion/
āāā agent_genesis_comprehensive.py ā Current (slow, sequential)
āāā agent_genesis_optimized.py ā NEW (batched, 1.5x faster) āØ
āāā benchmark_extraction.py ā Test suite for validating
āāā analyze_benchmark.py ā Analysis of test results
āāā OPTIMIZATION_SUMMARY.md ā Detailed technical explanation
āāā OPTIMIZATION_RESULTS.md ā Test results & projections
āāā DEPLOYMENT_GUIDE.md ā This file
āāā optimized_checkpoint.json ā NEW checkpoint (when deployed)
```
---
## Summary
### What Was Discovered
- Sequential LLM calls are the bottleneck (0.4-0.5s per conversation)
- Batching 20 conversations per LLM call saves 30% time overhead
- Parallelism doesn't help (local LLM is serial internally)
- Smart filtering adds marginal benefit (1.3x on filtered set)
### What Was Built
- Optimized extraction with batching: **1.5x faster (proven)**
- Benchmark suite for testing strategies
- Analysis tools to understand performance
- Documentation for deployment
### Next Steps
1. **NOW**: Review this guide
2. **5 min**: Run validation test (optional)
3. **1 min**: Deploy optimized version
4. **5-7 hours**: Wait for Phase 2 to complete
5. **Tomorrow**: Enjoy faster knowledge base updates!
---
## Questions?
If something goes wrong:
1. Check `ingestion/optimized_checkpoint.json` for current progress
2. Review the output logs for errors
3. Refer to "Troubleshooting" section above
4. The checkpoint system means you can always resume
**Good luck!** š