FreeCrawl MCP Server

README.md•9.64 kB

# FreeCrawl MCP Server A production-ready Model Context Protocol (MCP) server for web scraping and document processing, designed as a self-hosted replacement for Firecrawl. ## 🚀 Features - **JavaScript-enabled web scraping** with Playwright and anti-detection measures - **Document processing** with fallback support for various formats - **Concurrent batch processing** with configurable limits - **Intelligent caching** with SQLite backend - **Rate limiting** per domain - **Comprehensive error handling** with retry logic - **Easy installation** via `uvx` or local development setup - **Health monitoring** and metrics collection ## MCP Config (using `uvx`) ```json { "mcpServers": { "freecrawl": { "command": "uvx", "args": ["freecrawl-mcp"], } } } ``` ## 📦 Installation & Usage ### Quick Start with uvx (Recommended) The easiest way to use FreeCrawl is with `uvx`, which automatically manages dependencies: ```bash # Install browsers on first run uvx freecrawl-mcp --install-browsers # Test functionality uvx freecrawl-mcp --test ``` ### Local Development Setup For local development or customization: 1. **Clone from GitHub:** ```bash git clone https://github.com/dylan-gluck/freecrawl-mcp.git cd freecrawl-mcp ``` 2. **Set up environment:** ```bash # Sync dependencies uv sync # Install browser dependencies uv run freecrawl-mcp --install-browsers # Run tests uv run freecrawl-mcp --test ``` 3. **Run the server:** ```bash uv run freecrawl-mcp ``` ## 🛠 Configuration Configure FreeCrawl using environment variables: ### Basic Configuration ```bash # Transport (stdio for MCP, http for REST API) export FREECRAWL_TRANSPORT=stdio # Browser pool settings export FREECRAWL_MAX_BROWSERS=3 export FREECRAWL_HEADLESS=true # Concurrency limits export FREECRAWL_MAX_CONCURRENT=10 export FREECRAWL_MAX_PER_DOMAIN=3 # Cache settings export FREECRAWL_CACHE=true export FREECRAWL_CACHE_DIR=/tmp/freecrawl_cache export FREECRAWL_CACHE_TTL=3600 export FREECRAWL_CACHE_SIZE=536870912 # 512MB # Rate limiting export FREECRAWL_RATE_LIMIT=60 # requests per minute # Logging export FREECRAWL_LOG_LEVEL=INFO ``` ### Security Settings ```bash # API authentication (optional) export FREECRAWL_REQUIRE_API_KEY=false export FREECRAWL_API_KEYS=key1,key2,key3 # Domain blocking export FREECRAWL_BLOCKED_DOMAINS=localhost,127.0.0.1 # Anti-detection export FREECRAWL_ANTI_DETECT=true export FREECRAWL_ROTATE_UA=true ``` ## 🔧 MCP Tools FreeCrawl provides the following MCP tools: ### `freecrawl_scrape` Scrape content from a single URL with advanced options. **Parameters:** - `url` (string): URL to scrape - `formats` (array): Output formats - `["markdown", "html", "text", "screenshot", "structured"]` - `javascript` (boolean): Enable JavaScript execution (default: true) - `wait_for` (string, optional): CSS selector or time (ms) to wait - `anti_bot` (boolean): Enable anti-detection measures (default: true) - `headers` (object, optional): Custom HTTP headers - `cookies` (object, optional): Custom cookies - `cache` (boolean): Use cached results if available (default: true) - `timeout` (number): Total timeout in milliseconds (default: 30000) **Example:** ```json { "name": "freecrawl_scrape", "arguments": { "url": "https://example.com", "formats": ["markdown", "screenshot"], "javascript": true, "wait_for": "2000" } } ``` ### `freecrawl_batch_scrape` Scrape multiple URLs concurrently. **Parameters:** - `urls` (array): List of URLs to scrape (max 100) - `concurrency` (number): Maximum concurrent requests (default: 5) - `formats` (array): Output formats (default: `["markdown"]`) - `common_options` (object, optional): Options applied to all URLs - `continue_on_error` (boolean): Continue if individual URLs fail (default: true) **Example:** ```json { "name": "freecrawl_batch_scrape", "arguments": { "urls": [ "https://example.com/page1", "https://example.com/page2" ], "concurrency": 3, "formats": ["markdown", "text"] } } ``` ### `freecrawl_extract` Extract structured data using schema-driven approach. **Parameters:** - `url` (string): URL to extract data from - `schema` (object): JSON Schema or Pydantic model definition - `prompt` (string, optional): Custom extraction instructions - `validation` (boolean): Validate against schema (default: true) - `multiple` (boolean): Extract multiple matching items (default: false) **Example:** ```json { "name": "freecrawl_extract", "arguments": { "url": "https://example.com/product", "schema": { "type": "object", "properties": { "title": {"type": "string"}, "price": {"type": "number"} } } } } ``` ### `freecrawl_process_document` Process documents (PDF, DOCX, etc.) with OCR support. **Parameters:** - `file_path` (string, optional): Path to document file - `url` (string, optional): URL to download document from - `strategy` (string): Processing strategy - `"fast"`, `"hi_res"`, `"ocr_only"` (default: "hi_res") - `formats` (array): Output formats - `["markdown", "structured", "text"]` - `languages` (array, optional): OCR languages (e.g., `["eng", "fra"]`) - `extract_images` (boolean): Extract embedded images (default: false) - `extract_tables` (boolean): Extract and structure tables (default: true) **Example:** ```json { "name": "freecrawl_process_document", "arguments": { "url": "https://example.com/document.pdf", "strategy": "hi_res", "formats": ["markdown", "structured"] } } ``` ### `freecrawl_health_check` Get server health status and metrics. **Example:** ```json { "name": "freecrawl_health_check", "arguments": {} } ``` ## 🔄 Integration with Claude Code ### MCP Configuration Add FreeCrawl to your MCP configuration: **Using uvx (Recommended):** ```json { "mcpServers": { "freecrawl": { "command": "uvx", "args": ["freecrawl-mcp"] } } } ``` **Using local development setup:** ```json { "mcpServers": { "freecrawl": { "command": "uv", "args": ["run", "freecrawl-mcp"], "cwd": "/path/to/freecrawl-mcp" } } } ``` ### Usage in Prompts ``` Please scrape the content from https://example.com and extract the main article text in markdown format. ``` Claude Code will automatically use the `freecrawl_scrape` tool to fetch and process the content. ## 🚀 Performance & Scalability ### Resource Usage - **Memory**: ~100MB base + ~50MB per browser instance - **CPU**: Moderate usage during active scraping - **Storage**: Cache grows based on configured limits ### Throughput - **Single requests**: 2-5 seconds typical response time - **Batch processing**: 10-50 concurrent requests depending on configuration - **Cache hit ratio**: 30%+ for repeated content ### Optimization Tips 1. **Enable caching** for frequently accessed content 2. **Adjust concurrency** based on target site rate limits 3. **Use appropriate formats** - markdown is faster than screenshots 4. **Configure rate limiting** to avoid being blocked ## 🛡 Security Considerations ### Anti-Detection - Rotating user agents - Realistic browser fingerprints - Request timing randomization - JavaScript execution in sandboxed environment ### Input Validation - URL format validation - Private IP blocking - Domain blocklist support - Request size limits ### Resource Protection - Memory usage monitoring - Browser pool size limits - Request timeout enforcement - Rate limiting per domain ## 🔧 Troubleshooting ### Common Issues | Issue | Possible Cause | Solution | |-------|----------------|----------| | High memory usage | Too many browser instances | Reduce `FREECRAWL_MAX_BROWSERS` | | Slow responses | JavaScript-heavy sites | Increase timeout or disable JS | | Bot detection | Missing anti-detection | Ensure `FREECRAWL_ANTI_DETECT=true` | | Cache misses | TTL too short | Increase `FREECRAWL_CACHE_TTL` | | Import errors | Missing dependencies | Run `uvx freecrawl-mcp --test` | ### Debug Mode **With uvx:** ```bash export FREECRAWL_LOG_LEVEL=DEBUG uvx freecrawl-mcp --test ``` **Local development:** ```bash export FREECRAWL_LOG_LEVEL=DEBUG uv run freecrawl-mcp --test ``` ## 📈 Monitoring & Observability ### Health Metrics - Browser pool status - Memory and CPU usage - Cache hit rates - Request success rates - Response times ### Logging FreeCrawl provides structured logging with configurable levels: - ERROR: Critical failures - WARNING: Recoverable issues - INFO: General operations - DEBUG: Detailed troubleshooting ## 🔧 Development ### Running Tests **With uvx:** ```bash # Basic functionality test uvx freecrawl-mcp --test ``` **Local development:** ```bash # Basic functionality test uv run freecrawl-mcp --test ``` ### Code Structure - **Core server**: `FreeCrawlServer` class - **Browser management**: `BrowserPool` for resource pooling - **Content extraction**: `ContentExtractor` with multiple strategies - **Caching**: `CacheManager` with SQLite backend - **Rate limiting**: `RateLimiter` with token bucket algorithm ## 📄 License This project is licensed under the MIT License - see the technical specification for details. ## 🤝 Contributing 1. Fork the repository at https://github.com/dylan-gluck/freecrawl-mcp 2. Create a feature branch 3. Set up local development: `uv sync` 4. Run tests: `uv run freecrawl-mcp --test` 5. Submit a pull request ## 📚 Technical Specification For detailed technical information, see `ai_docs/FREECRAWL_TECHNICAL_SPEC.md`. --- **FreeCrawl MCP Server** - Self-hosted web scraping for the modern web 🚀

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/dylan-gluck/freecrawl-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server