Crawl4AI RAG MCP Server

batch-operations.md•31.5 KiB

# Batch Operations ## Overview The Crawl4AI RAG MCP Server provides powerful batch crawling capabilities for efficiently processing large numbers of URLs. The system includes two utilities: 1. **Batch Crawler** (legacy) - Direct batch crawling from URL files 2. **Recrawl Utility** (modern) - API-based recrawling of existing database URLs with improved architecture ## Batch Crawler ### Location `/home/robiloo/Documents/mcpragcrawl4ai/core/utilities/batch_crawler.py` ### Features - **Concurrent Processing**: Configurable number of simultaneous requests (default: 10) - **Async Architecture**: Built on aiohttp for maximum performance - **Retry Logic**: Built-in request timeout handling - **Progress Tracking**: Real-time progress updates every 50 URLs - **Error Recovery**: Failed URLs logged and saved for reprocessing - **Performance Metrics**: Tracks crawl rate, success rate, and timing - **API Integration**: Direct REST API calls for efficient batch processing ## Architecture ``` ┌─────────────────────────────────────────┐ │ URLs File (urls.md) │ │ - One URL per line │ │ - Comments with # │ │ - Plain text format │ └──────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────┐ │ BatchCrawler Class │ │ │ │ ┌───────────────────────────────────┐ │ │ │ load_urls() │ │ │ │ - Read and parse URLs file │ │ │ └───────────────────────────────────┘ │ │ │ │ ┌───────────────────────────────────┐ │ │ │ run_batch_crawl_async() │ │ │ │ - Create aiohttp session │ │ │ │ - Set up semaphore (concurrency) │ │ │ │ - Schedule all tasks │ │ │ └───────────────────────────────────┘ │ │ │ │ ┌───────────────────────────────────┐ │ │ │ crawl_url_async() │ │ │ │ - POST to /api/v1/crawl/store │ │ │ │ - Handle errors and timeouts │ │ │ │ - Track metrics │ │ │ └───────────────────────────────────┘ │ └──────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────┐ │ REST API Server │ │ POST /api/v1/crawl/store │ │ - Crawl URL │ │ - Store in database │ │ - Return success/error │ └──────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────┐ │ Results & Metrics │ │ - Success/failure counts │ │ - Failed URLs file │ │ - Performance statistics │ └─────────────────────────────────────────┘ ``` ## BatchCrawler Class ### Initialization ```python # From core/utilities/batch_crawler.py class BatchCrawler: def __init__(self, urls_file="urls.md", api_url="http://localhost:8080", api_key=None, max_concurrent=10): self.urls_file = urls_file self.api_url = api_url self.max_concurrent = max_concurrent # Try to get API key from env vars self.api_key = api_key or os.getenv("LOCAL_API_KEY") or os.getenv("REMOTE_API_KEY") or os.getenv("RAG_API_KEY") if not self.api_key: print("❌ No API key found!") sys.exit(1) self.results = [] self.headers = { "Authorization": f"Bearer {self.api_key}", "Content-Type": "application/json" } ``` ### Loading URLs ```python # From core/utilities/batch_crawler.py def load_urls(self): """Load URLs from urls.md file""" if not os.path.exists(self.urls_file): print(f"❌ URLs file '{self.urls_file}' not found!") return [] urls = [] try: with open(self.urls_file, 'r') as f: for line_num, line in enumerate(f, 1): line = line.strip() if line and not line.startswith('#'): urls.append(line) if line_num <= 10: # Show first 10 print(f"📋 Loaded URL {len(urls)}: {line}") except Exception as e: print(f"❌ Error reading URLs file: {e}") return [] print(f"\n✅ Loaded {len(urls)} URLs to crawl") return urls ``` ### Async Crawling ```python # From core/utilities/batch_crawler.py async def crawl_url_async(self, session: aiohttp.ClientSession, url: str, index: int, total: int) -> Dict: """Crawl a single URL via API asynchronously""" print(f"[{index}/{total}] 🔄 Crawling: {url}") start_time = time.time() try: async with session.post( f"{self.api_url}/api/v1/crawl/store", headers=self.headers, json={ "url": url, "retention_policy": "permanent", "tags": "batch_recrawl" }, timeout=aiohttp.ClientTimeout(total=60) ) as response: end_time = time.time() duration = end_time - start_time if response.status == 200: result = await response.json() if result.get("success"): print(f"[{index}/{total}] ✅ Success ({duration:.1f}s) - {url}") return {'url': url, 'success': True, 'duration': duration, 'error': None} else: error = result.get("error", "Unknown error") print(f"[{index}/{total}] ❌ Failed: {error} - {url}") return {'url': url, 'success': False, 'duration': duration, 'error': error} except asyncio.TimeoutError: end_time = time.time() duration = end_time - start_time print(f"[{index}/{total}] ⏱️ Timeout after {duration:.1f}s - {url}") return {'url': url, 'success': False, 'duration': duration, 'error': 'Request timeout'} except Exception as e: end_time = time.time() duration = end_time - start_time print(f"[{index}/{total}] 💥 Exception: {e} - {url}") return {'url': url, 'success': False, 'duration': duration, 'error': str(e)} ``` ### Concurrency Control ```python # From core/utilities/batch_crawler.py async def run_batch_crawl_async(self): """Run batch crawl for all URLs with concurrency control""" urls = self.load_urls() if not urls: return # Create aiohttp session async with aiohttp.ClientSession() as session: # Create semaphore to limit concurrent requests semaphore = asyncio.Semaphore(self.max_concurrent) async def bounded_crawl(url: str, index: int, total: int): async with semaphore: result = await self.crawl_url_async(session, url, index, total) self.results.append(result) # Progress update every 50 URLs if len(self.results) % 50 == 0: elapsed = time.time() - batch_start_time rate = len(self.results) / (elapsed / 60) if elapsed > 0 else 0 successful = sum(1 for r in self.results if r['success']) print(f"\n📊 Progress: {len(self.results)}/{total} | Success: {successful}/{len(self.results)} | Rate: {rate:.1f} URLs/min\n") return result # Create all tasks tasks = [bounded_crawl(url, index, len(urls)) for index, url in enumerate(urls, 1)] # Run all tasks concurrently with semaphore limiting concurrency await asyncio.gather(*tasks) ``` ### Summary Statistics ```python # From core/utilities/batch_crawler.py def print_summary(self, batch_duration): """Print final crawl summary""" print("\n" + "="*80) print("📊 BATCH CRAWL SUMMARY") print("="*80) successful = [r for r in self.results if r['success']] failed = [r for r in self.results if not r['success']] print(f"✅ Successful URLs: {len(successful)}/{len(self.results)}") print(f"❌ Failed URLs: {len(failed)}/{len(self.results)}") print(f"⏱️ Total duration: {batch_duration/60:.1f} minutes") print(f"📈 Overall rate: {len(self.results)/(batch_duration/60):.1f} URLs/minute") if successful: avg_duration = sum(r['duration'] for r in successful) / len(successful) print(f"⚡ Average crawl time: {avg_duration:.1f} seconds") if failed: print(f"\n💥 FAILED URLS ({len(failed)}):") # Show first 20 failures for result in failed[:20]: print(f" ❌ {result['url']}") print(f" Error: {result['error']}") # Save failed URLs to file failed_file = "failed_urls.txt" with open(failed_file, 'w') as f: for result in failed: f.write(f"{result['url']}\n") print(f"\n💾 Failed URLs saved to: {failed_file}") ``` ## Usage ### Basic Usage ```bash # Run with default settings (urls.md, localhost:8080, 10 concurrent) python3 core/utilities/batch_crawler.py # Specify custom URL file python3 core/utilities/batch_crawler.py /path/to/urls.txt # Specify custom API URL python3 core/utilities/batch_crawler.py urls.md http://remote-server.com:8080 # Specify custom concurrency level python3 core/utilities/batch_crawler.py urls.md http://localhost:8080 20 ``` ### URL File Format Create a `urls.md` file with one URL per line: ```markdown # Batch Crawl URLs # Lines starting with # are comments and will be ignored https://example.com/page1 https://example.com/page2 https://example.com/page3 # Section 2 - Documentation https://docs.example.com/intro https://docs.example.com/guide https://docs.example.com/api ``` ### Environment Variables The batch crawler automatically detects API keys from environment: ```bash # In .env file or environment LOCAL_API_KEY=your-local-api-key-here REMOTE_API_KEY=your-remote-api-key-here RAG_API_KEY=fallback-api-key-here ``` ### Example Output ``` 🤖 Crawl4AI Batch URL Crawler Usage: python3 batch_crawler.py [urls_file] [api_url] [max_concurrent] ✅ Loaded environment from: /path/to/.env 🔑 Using API key: abc123... ⚡ Max concurrent requests: 10 📋 Loaded URL 1: https://example.com/page1 📋 Loaded URL 2: https://example.com/page2 ... ✅ Loaded 100 URLs to crawl 🎬 STARTING BATCH CRAWL 📝 URLs: 100 🌐 API: http://localhost:8080 ⚡ Concurrent requests: 10 🕐 Started: 2024-01-15 14:30:00 ================================================================================ [1/100] 🔄 Crawling: https://example.com/page1 [2/100] 🔄 Crawling: https://example.com/page2 [1/100] ✅ Success (2.3s) - https://example.com/page1 [2/100] ✅ Success (2.1s) - https://example.com/page2 ... 📊 Progress: 50/100 | Success: 48/50 | Rate: 15.2 URLs/min ... ================================================================================ 📊 BATCH CRAWL SUMMARY ================================================================================ ✅ Successful URLs: 95/100 ❌ Failed URLs: 5/100 ⏱️ Total duration: 6.8 minutes 📈 Overall rate: 14.7 URLs/minute ⚡ Average crawl time: 2.4 seconds 💥 FAILED URLS (5): ❌ https://broken-site.com/page1 Error: Connection timeout ❌ https://another-site.com/missing Error: HTTP 404: Not Found ... 💾 Failed URLs saved to: failed_urls.txt 🏁 Batch crawl completed at 2024-01-15 14:36:50 ``` ## Performance Tuning ### Concurrency Level Adjust based on your system and network capacity: ```bash # Low concurrency (safer, slower) python3 core/utilities/batch_crawler.py urls.md http://localhost:8080 5 # Default concurrency (balanced) python3 core/utilities/batch_crawler.py urls.md http://localhost:8080 10 # High concurrency (faster, more resource intensive) python3 core/utilities/batch_crawler.py urls.md http://localhost:8080 20 # Very high concurrency (requires good hardware) python3 core/utilities/batch_crawler.py urls.md http://localhost:8080 50 ``` ### Recommended Settings | System Type | RAM | CPU Cores | Recommended Concurrency | |------------|-----|-----------|------------------------| | Laptop | 8 GB | 4 | 5-10 | | Desktop | 16 GB | 8 | 10-20 | | Server | 32 GB | 16 | 20-50 | | High-End Server | 64+ GB | 32+ | 50-100 | ### Timeout Configuration Default timeout is 60 seconds per request. Modify in code if needed: ```python # In batch_crawler.py timeout=aiohttp.ClientTimeout(total=60) # Change to desired timeout ``` ## Error Handling ### Common Error Types 1. **Connection Timeout**: Request took longer than 60 seconds 2. **HTTP Errors**: 4xx (client error) or 5xx (server error) 3. **Network Errors**: Connection refused, DNS failure, etc. 4. **API Errors**: Invalid URL, sanitization failure, blocked domain ### Failed URL Recovery Failed URLs are automatically saved to `failed_urls.txt`: ```bash # Retry failed URLs cp failed_urls.txt urls_retry.md python3 core/utilities/batch_crawler.py urls_retry.md ``` ### Retry Strategy For automatic retry with exponential backoff: ```python # Custom retry logic (add to batch_crawler.py) async def crawl_with_retry(self, session, url, max_retries=3): for attempt in range(max_retries): result = await self.crawl_url_async(session, url, index, total) if result['success']: return result # Exponential backoff if attempt < max_retries - 1: wait_time = 2 ** attempt await asyncio.sleep(wait_time) return result ``` ## Database Integration ### Data Flow ``` URLs → Batch Crawler → REST API → RAG System → Database Each URL: 1. Sent to /api/v1/crawl/store 2. Crawled by Crawl4AI 3. Sanitized and validated 4. Stored in SQLite database 5. Embeddings generated 6. Synced to disk (if RAM mode) ``` ### Storage Details All crawled content is stored with: - **URL**: Full URL of the page - **Title**: Page title extracted from HTML - **Content**: Full page content - **Markdown**: Converted markdown representation - **Timestamp**: When the page was crawled - **Tags**: Automatically tagged with "batch_recrawl" - **Retention**: Stored permanently by default - **Embeddings**: 384-dimensional vector embeddings ### Verification Check crawled content in database: ```bash # View database stats python3 core/utilities/dbstats.py # Or via API curl -H "Authorization: Bearer your-api-key" \ http://localhost:8080/api/v1/stats # List recently crawled content curl -H "Authorization: Bearer your-api-key" \ "http://localhost:8080/api/v1/memory?limit=100" # Search for batch crawled content curl -X POST http://localhost:8080/api/v1/search \ -H "Authorization: Bearer your-api-key" \ -H "Content-Type: application/json" \ -d '{"query": "batch_recrawl", "limit": 100}' ``` ## Advanced Usage ### Custom Tags Modify the crawler to use custom tags: ```python # In batch_crawler.py - crawl_url_async() json={ "url": url, "retention_policy": "permanent", "tags": "my_custom_tag,batch_import_2024" # Custom tags } ``` ### Remote API Crawl to a remote server: ```bash # Set remote API details export REMOTE_API_KEY=your-remote-api-key python3 core/utilities/batch_crawler.py urls.md https://remote-server.com:8080 10 ``` ### Filtered Crawling Pre-filter URLs before crawling: ```bash # Extract specific domains grep "docs.python.org" all_urls.txt > python_docs_urls.md # Extract by keyword grep "tutorial" all_urls.txt > tutorial_urls.md # Crawl filtered set python3 core/utilities/batch_crawler.py python_docs_urls.md ``` ## Monitoring ### Real-Time Monitoring Watch progress in real-time: ```bash # In one terminal python3 core/utilities/batch_crawler.py urls.md # In another terminal, monitor API watch -n 5 'curl -s -H "Authorization: Bearer your-api-key" http://localhost:8080/api/v1/stats | jq' ``` ### Log Analysis Check application logs for errors: ```bash # View recent errors tail -n 100 data/crawl4ai_rag_errors.log | grep "batch_crawler" # Monitor logs live tail -f data/crawl4ai_rag_errors.log ``` ## Best Practices 1. **Start Small**: Test with 10-20 URLs before running large batches 2. **Monitor Resources**: Watch CPU, RAM, and network usage 3. **Use Appropriate Concurrency**: Don't overwhelm your system or target servers 4. **Respect Rate Limits**: Consider adding delays for specific domains 5. **Save Failed URLs**: Always review and retry failed URLs 6. **Check Database Stats**: Verify content was stored correctly 7. **Regular Backups**: Backup database before large batch operations 8. **Clean URLs**: Remove duplicates and invalid URLs before batch crawling 9. **Tag Appropriately**: Use descriptive tags for easy filtering later 10. **Document Sources**: Keep notes about where URL lists came from ## Troubleshooting ### Issue: High Memory Usage **Symptoms**: System running out of memory during batch crawl **Solutions:** - Reduce concurrency level - Process URLs in smaller batches - Increase system swap space - Close other applications ### Issue: Slow Crawl Rate **Symptoms**: Crawl rate is much lower than expected **Solutions:** - Check network bandwidth - Verify Crawl4AI service is healthy - Increase concurrency if system has capacity - Check for rate limiting on target sites - Monitor API server performance ### Issue: Many Timeouts **Symptoms**: High percentage of timeout errors **Solutions:** - Increase timeout duration in code - Check network connectivity - Verify target sites are responsive - Reduce concurrency to lower load - Check Crawl4AI container logs ### Issue: API Authentication Failures **Symptoms**: "Unauthorized" errors during batch crawl **Solutions:** - Verify API key is correct in environment - Check .env file is loaded properly - Restart server after changing API key - Test API key with single request first ## Database Restoration If batch crawl is interrupted, you can restore from the last sync point: ```bash # Check what was successfully stored python3 core/utilities/dbstats.py # Compare with crawl progress # Resume from last successful URL in failed_urls.txt ``` ## Integration with Database Stats View batch crawl results: ```python # From core/utilities/dbstats.py python3 core/utilities/dbstats.py # Output shows: # - Total pages crawled # - Recent activity (including batch crawled pages) # - Storage breakdown # - Top tags (including "batch_recrawl") ``` ## References - **Implementation**: `/home/robiloo/Documents/mcpragcrawl4ai/core/utilities/batch_crawler.py` - **Database Stats**: `/home/robiloo/Documents/mcpragcrawl4ai/core/utilities/dbstats.py` - **REST API**: `/home/robiloo/Documents/mcpragcrawl4ai/api/api.py` - **Storage Layer**: `/home/robiloo/Documents/mcpragcrawl4ai/core/data/storage.py` ## Recrawl Utility ### Overview The Recrawl Utility is a modern tool for batch recrawling existing URLs in the database. Unlike the batch crawler which processes URLs from files, this utility: - Reads URLs directly from the database - Sends crawl requests via API (avoiding sync conflicts) - Supports concurrent processing with rate limiting - Provides dry-run mode for previewing operations - Filters by retention policy, tags, or URL patterns ### Location `/home/robiloo/Documents/mcpragcrawl4ai/core/utilities/recrawl_utility.py` ### Key Features - **API-Based Execution**: Uses `/api/v1/crawl/store` instead of direct database access - **Concurrent Processing**: Asyncio-based with semaphore rate limiting - **Database Read**: Reads URLs from disk database (no RAM DB conflicts) - **Flexible Filtering**: Filter by retention policy, tags, or limit - **Rate Limiting**: Configurable delay and concurrency - **Dry-Run Mode**: Preview what will be recrawled before executing - **Progress Tracking**: Real-time statistics with success/failure counts ### Architecture ``` ┌─────────────────────────────────────────┐ │ Disk Database (Direct Read) │ │ - Reads URLs from crawled_content │ │ - No RAM DB initialization │ │ - No sync manager conflicts │ └──────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────┐ │ RecrawlUtility Class │ │ │ │ ┌───────────────────────────────────┐ │ │ │ get_urls_to_recrawl() │ │ │ │ - Query disk DB directly │ │ │ │ - Filter by policy/tags │ │ │ │ - Return URL list │ │ │ └───────────────────────────────────┘ │ │ │ │ ┌───────────────────────────────────┐ │ │ │ recrawl_batch() │ │ │ │ - Concurrent processing │ │ │ │ - Semaphore rate limiting │ │ │ │ - Progress tracking │ │ │ └───────────────────────────────────┘ │ │ │ │ ┌───────────────────────────────────┐ │ │ │ recrawl_url() │ │ │ │ - POST to /api/v1/crawl/store │ │ │ │ - aiohttp async requests │ │ │ │ - Error handling & retry │ │ │ └───────────────────────────────────┘ │ └──────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────┐ │ REST API Server (Docker) │ │ POST /api/v1/crawl/store │ │ - Crawl URL via Crawl4AI │ │ - Clean content (remove navigation) │ │ - Filter language (English only) │ │ - Store in RAM DB │ │ - Generate embeddings │ │ - Sync to disk automatically │ └─────────────────────────────────────────┘ ``` ### Usage #### Basic Usage ```bash # Activate virtual environment cd ~/Documents/mcpragcrawl4ai && source .venv/bin/activate # Dry run to preview all URLs python3 core/utilities/recrawl_utility.py --all --dry-run # Recrawl all URLs with default settings (1 concurrent, 1s delay) python3 core/utilities/recrawl_utility.py --all # Recrawl with 10 concurrent requests, 0.6s delay (60 req/min) python3 core/utilities/recrawl_utility.py --all --concurrent 10 --delay 0.6 # Recrawl specific retention policy python3 core/utilities/recrawl_utility.py --policy permanent --concurrent 5 --delay 1.0 # Recrawl URLs with specific tags python3 core/utilities/recrawl_utility.py --tags "react,documentation" --limit 100 --concurrent 10 --delay 0.5 # Recrawl single URL python3 core/utilities/recrawl_utility.py --url https://example.com ``` #### Command-Line Options | Option | Description | Default | |--------|-------------|---------| | `--url URL` | Recrawl a single URL | None | | `--all` | Recrawl all URLs in database | False | | `--policy POLICY` | Filter by retention policy | None | | `--tags TAGS` | Filter by tags (comma-separated) | None | | `--limit N` | Maximum URLs to recrawl | None | | `--delay SECONDS` | Delay between requests | 1.0 | | `--concurrent N` | Number of concurrent requests | 1 | | `--dry-run` | Preview without making changes | False | ### Rate Limiting Examples ```bash # 30 requests per minute (2s delay, 1 concurrent) python3 core/utilities/recrawl_utility.py --all --concurrent 1 --delay 2.0 # 60 requests per minute (0.6s delay, 10 concurrent) python3 core/utilities/recrawl_utility.py --all --concurrent 10 --delay 0.6 # 120 requests per minute (0.5s delay, 10 concurrent) python3 core/utilities/recrawl_utility.py --all --concurrent 10 --delay 0.5 # Conservative (avoid rate limits) python3 core/utilities/recrawl_utility.py --all --concurrent 5 --delay 1.0 ``` ### Example Output ``` ✅ Loaded environment from /home/user/mcpragcrawl4ai/deployments/server/.env ✅ Adjusted DB_PATH to: ./data/crawl4ai_rag.db ✅ Recrawl utility initialized Database: ./data/crawl4ai_rag.db API: http://localhost:8080 Fetching URLs from database... Found 2295 URLs to recrawl Recrawl 2295 URLs? This will replace content and embeddings. (y/N): y ================================================================================ Recrawl Batch - 2295 URLs Concurrency: 10 simultaneous requests Rate limit: 0.6s delay between requests ================================================================================ [1/2295] 🔄 Recrawling: https://react.dev/blog/2025/02/14/sunsetting-create-react-app ✅ Updated: Sunsetting Create React App – React [2/2295] 🔄 Recrawling: https://react.dev/reference/react/hooks ✅ Updated: Hooks – React [3/2295] 🔄 Recrawling: https://tanstack.com/query/latest/docs/framework/react/overview ❌ Failed: https://tanstack.com/query/latest/docs/framework/react/overview - HTTP 429 ... ================================================================================ Recrawl Statistics ================================================================================ Total URLs: 2295 ✅ Success: 2250 ❌ Failed: 45 ⊘ Skipped: 0 Errors (10): • https://example.com/page1 HTTP 429: Too Many Requests • https://example.com/page2 Non-English content detected: ru ... Success Rate: 98.0% ================================================================================ ``` ### Content Processing Pipeline When recrawling, each URL goes through: 1. **API Request**: POST to `/api/v1/crawl/store` 2. **Crawl4AI Extraction**: Uses `fit_markdown` for cleaner content 3. **Content Cleaning**: Removes navigation, boilerplate (70-80% reduction) 4. **Language Detection**: Filters out non-English content 5. **Database Storage**: Replaces old content and embeddings 6. **RAM DB Sync**: Automatically syncs to disk (if RAM mode enabled) ### Filtering Options #### By Retention Policy ```bash # Recrawl only permanent content python3 core/utilities/recrawl_utility.py --policy permanent # Recrawl only session content python3 core/utilities/recrawl_utility.py --policy session_only ``` #### By Tags ```bash # Recrawl React documentation python3 core/utilities/recrawl_utility.py --tags react # Recrawl multiple tags python3 core/utilities/recrawl_utility.py --tags "react,documentation,tutorial" ``` #### By Limit ```bash # Recrawl first 100 URLs python3 core/utilities/recrawl_utility.py --all --limit 100 # Test with 10 URLs first python3 core/utilities/recrawl_utility.py --all --limit 10 --dry-run ``` ### Error Handling #### Common Error Types 1. **HTTP 429 (Too Many Requests)**: Target server rate limiting - Solution: Reduce concurrency or increase delay 2. **Non-English Content**: Language detection rejected content - This is expected behavior, not an error 3. **Connection Timeout**: Request exceeded 60s - Solution: Server may be overloaded, reduce concurrency 4. **HTTP 404/410**: URL no longer exists - Solution: Consider removing from database ### Performance Tuning #### Concurrency Recommendations | URLs to Recrawl | Recommended Concurrency | Delay | |-----------------|------------------------|-------| | < 100 | 1-5 | 1.0s | | 100-500 | 5-10 | 0.6-1.0s | | 500-2000 | 10-15 | 0.5-0.6s | | 2000+ | 10-20 | 0.5-0.6s | **Note**: Higher concurrency with lower delay may trigger rate limits on target servers. ### Best Practices 1. **Always Dry-Run First**: Preview what will be recrawled ```bash python3 core/utilities/recrawl_utility.py --all --dry-run ``` 2. **Start Conservative**: Begin with low concurrency and increase if needed ```bash python3 core/utilities/recrawl_utility.py --all --concurrent 5 --delay 1.0 ``` 3. **Monitor Progress**: Watch for rate limit errors (HTTP 429) 4. **Filter Before Recrawling**: Use tags/policies to recrawl specific content ```bash python3 core/utilities/recrawl_utility.py --tags documentation --limit 500 ``` 5. **Check Database Stats**: Verify recrawl success ```bash python3 core/utilities/dbstats.py ``` 6. **Docker Must Be Running**: API server must be accessible ```bash docker compose -f deployments/server/docker-compose.yml ps ``` ### Advantages Over Batch Crawler | Feature | Recrawl Utility | Batch Crawler | |---------|----------------|---------------| | Source | Database URLs | File URLs | | Execution | API-based | Direct DB access | | Sync Conflicts | None | Potential | | Filtering | Policy/Tags | None | | Dry-Run | Yes | No | | Content Processing | Full pipeline | Full pipeline | | Rate Limiting | Built-in | Manual | | Use Case | Update existing | Import new | ### Troubleshooting #### Issue: "no such module: vec0" **Cause**: Trying to use RAM DB in recrawl utility (old version) **Solution**: Use latest version that queries disk DB directly and sends to API #### Issue: High failure rate **Symptoms**: Many HTTP 429 or timeout errors **Solutions:** - Reduce concurrency: `--concurrent 5` - Increase delay: `--delay 1.5` - Check API server is running - Monitor server logs for issues #### Issue: Non-English content rejected **Symptoms**: Many "Non-English content detected" errors **This is expected**: Language filtering is working as designed. These URLs contained non-English content and were intentionally skipped. ### Monitoring ```bash # Watch recrawl progress in one terminal python3 core/utilities/recrawl_utility.py --all --concurrent 10 --delay 0.6 # Monitor API server in another terminal docker logs -f crawl4ai-rag-server # Check database stats after completion python3 core/utilities/dbstats.py ``` ## See Also - [Database Statistics](../guides/troubleshooting.md#database-issues) - [API Endpoints](../api/endpoints.md) - [Performance Optimization](../guides/deployment.md#performance-considerations) - [RAM Database Mode](ram-database.md) - [Content Cleaning Pipeline](../guides/content-processing.md)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Rob-P-Smith/mcpragcrawl4ai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

batch-operations.md•31.5 KiB