Enables JavaScript execution during web scraping for dynamic content rendering and interaction with client-side applications
Converts scraped web content and processed documents into markdown format for easy consumption and analysis
Supports Pydantic model definitions for schema-driven structured data extraction from web content
Built on Python 3.12+ runtime environment for web scraping and document processing operations
Uses SQLite database backend for intelligent caching of scraped content and metadata storage
FreeCrawl MCP Server
A production-ready Model Context Protocol (MCP) server for web scraping and document processing, designed as a self-hosted replacement for Firecrawl.
🚀 Features
- JavaScript-enabled web scraping with Playwright and anti-detection measures
- Document processing with fallback support for various formats
- Concurrent batch processing with configurable limits
- Intelligent caching with SQLite backend
- Rate limiting per domain
- Comprehensive error handling with retry logic
- Easy installation via
uvx
or local development setup - Health monitoring and metrics collection
MCP Config (using uvx
)
📦 Installation & Usage
Quick Start with uvx (Recommended)
The easiest way to use FreeCrawl is with uvx
, which automatically manages dependencies:
Local Development Setup
For local development or customization:
- Clone from GitHub:
- Set up environment:
- Run the server:
🛠 Configuration
Configure FreeCrawl using environment variables:
Basic Configuration
Security Settings
🔧 MCP Tools
FreeCrawl provides the following MCP tools:
freecrawl_scrape
Scrape content from a single URL with advanced options.
Parameters:
url
(string): URL to scrapeformats
(array): Output formats -["markdown", "html", "text", "screenshot", "structured"]
javascript
(boolean): Enable JavaScript execution (default: true)wait_for
(string, optional): CSS selector or time (ms) to waitanti_bot
(boolean): Enable anti-detection measures (default: true)headers
(object, optional): Custom HTTP headerscookies
(object, optional): Custom cookiescache
(boolean): Use cached results if available (default: true)timeout
(number): Total timeout in milliseconds (default: 30000)
Example:
freecrawl_batch_scrape
Scrape multiple URLs concurrently.
Parameters:
urls
(array): List of URLs to scrape (max 100)concurrency
(number): Maximum concurrent requests (default: 5)formats
(array): Output formats (default:["markdown"]
)common_options
(object, optional): Options applied to all URLscontinue_on_error
(boolean): Continue if individual URLs fail (default: true)
Example:
freecrawl_extract
Extract structured data using schema-driven approach.
Parameters:
url
(string): URL to extract data fromschema
(object): JSON Schema or Pydantic model definitionprompt
(string, optional): Custom extraction instructionsvalidation
(boolean): Validate against schema (default: true)multiple
(boolean): Extract multiple matching items (default: false)
Example:
freecrawl_process_document
Process documents (PDF, DOCX, etc.) with OCR support.
Parameters:
file_path
(string, optional): Path to document fileurl
(string, optional): URL to download document fromstrategy
(string): Processing strategy -"fast"
,"hi_res"
,"ocr_only"
(default: "hi_res")formats
(array): Output formats -["markdown", "structured", "text"]
languages
(array, optional): OCR languages (e.g.,["eng", "fra"]
)extract_images
(boolean): Extract embedded images (default: false)extract_tables
(boolean): Extract and structure tables (default: true)
Example:
freecrawl_health_check
Get server health status and metrics.
Example:
🔄 Integration with Claude Code
MCP Configuration
Add FreeCrawl to your MCP configuration:
Using uvx (Recommended):
Using local development setup:
Usage in Prompts
Claude Code will automatically use the freecrawl_scrape
tool to fetch and process the content.
🚀 Performance & Scalability
Resource Usage
- Memory: ~100MB base + ~50MB per browser instance
- CPU: Moderate usage during active scraping
- Storage: Cache grows based on configured limits
Throughput
- Single requests: 2-5 seconds typical response time
- Batch processing: 10-50 concurrent requests depending on configuration
- Cache hit ratio: 30%+ for repeated content
Optimization Tips
- Enable caching for frequently accessed content
- Adjust concurrency based on target site rate limits
- Use appropriate formats - markdown is faster than screenshots
- Configure rate limiting to avoid being blocked
🛡 Security Considerations
Anti-Detection
- Rotating user agents
- Realistic browser fingerprints
- Request timing randomization
- JavaScript execution in sandboxed environment
Input Validation
- URL format validation
- Private IP blocking
- Domain blocklist support
- Request size limits
Resource Protection
- Memory usage monitoring
- Browser pool size limits
- Request timeout enforcement
- Rate limiting per domain
🔧 Troubleshooting
Common Issues
Issue | Possible Cause | Solution |
---|---|---|
High memory usage | Too many browser instances | Reduce FREECRAWL_MAX_BROWSERS |
Slow responses | JavaScript-heavy sites | Increase timeout or disable JS |
Bot detection | Missing anti-detection | Ensure FREECRAWL_ANTI_DETECT=true |
Cache misses | TTL too short | Increase FREECRAWL_CACHE_TTL |
Import errors | Missing dependencies | Run uvx freecrawl-mcp --test |
Debug Mode
With uvx:
Local development:
📈 Monitoring & Observability
Health Metrics
- Browser pool status
- Memory and CPU usage
- Cache hit rates
- Request success rates
- Response times
Logging
FreeCrawl provides structured logging with configurable levels:
- ERROR: Critical failures
- WARNING: Recoverable issues
- INFO: General operations
- DEBUG: Detailed troubleshooting
🔧 Development
Running Tests
With uvx:
Local development:
Code Structure
- Core server:
FreeCrawlServer
class - Browser management:
BrowserPool
for resource pooling - Content extraction:
ContentExtractor
with multiple strategies - Caching:
CacheManager
with SQLite backend - Rate limiting:
RateLimiter
with token bucket algorithm
📄 License
This project is licensed under the MIT License - see the technical specification for details.
🤝 Contributing
- Fork the repository at https://github.com/dylan-gluck/freecrawl-mcp
- Create a feature branch
- Set up local development:
uv sync
- Run tests:
uv run freecrawl-mcp --test
- Submit a pull request
📚 Technical Specification
For detailed technical information, see ai_docs/FREECRAWL_TECHNICAL_SPEC.md
.
FreeCrawl MCP Server - Self-hosted web scraping for the modern web 🚀
This server cannot be installed
hybrid server
The server is able to function both locally and remotely, depending on the configuration or use case.
Enables web scraping and document processing with JavaScript execution, anti-detection measures, batch processing, and structured data extraction. Supports multiple formats including markdown, HTML, screenshots, and handles PDFs with OCR capabilities.
Related MCP Servers
- AsecurityAlicenseAqualityEnables web content scanning and analysis by fetching, analyzing, and extracting information from web pages using tools like page fetching, link extraction, site crawling, and more.Last updated -611MIT License
- AsecurityAlicenseAqualityEnables text extraction from web pages and PDFs, and execution of predefined commands, enhancing content processing and automation capabilities.Last updated -MIT License
- AsecurityAlicenseAqualityProvides comprehensive document processing, including reading, converting, and manipulating various document formats with advanced text and HTML processing capabilities.Last updated -162815MIT License
- AsecurityFlicenseAqualityProvides functionality to fetch web content in various formats, including HTML, JSON, plain text, and Markdown with support for custom headers.Last updated -4111,7593