FreeCrawl MCP Server

FREECRAWL_DISCOVERY.md•9.26 KiB

# FreeCrawl MCP Server - Discovery Document ## Executive Summary Based on comprehensive research across Firecrawl capabilities, Unstructured Open Source, and modern web scraping alternatives, we recommend building **FreeCrawl** as a hybrid MCP server combining: 1. **FastMCP + Playwright** for modern web scraping with JavaScript support 2. **Unstructured Open Source** for document processing and extraction 3. **uv single-file script** architecture for portability and simplicity 4. **Streamable HTTP transport** for production scalability ## Research Synthesis ### Primary Methods Required #### Core Web Scraping Methods ```python # Single URL scraping with format options async def scrape(url: str, formats: list[str] = ["markdown"], **options) -> dict # Batch URL processing with concurrency control async def batch_scrape(urls: list[str], concurrency: int = 10, **options) -> list[dict] # Recursive website crawling with depth limits async def crawl(url: str, max_depth: int = 2, limit: int = 100, **options) -> list[dict] # Website URL discovery and mapping async def map_site(url: str, limit: int = 1000) -> list[str] # Web search with optional content extraction async def search(query: str, limit: int = 5, scrape_results: bool = False) -> list[dict] # AI-powered structured data extraction async def extract(url: str, schema: dict, prompt: str = None) -> dict ``` #### Document Processing Methods (via Unstructured) ```python # Process uploaded/local documents async def process_document(file_path: str, strategy: str = "hi_res") -> list[dict] # Chunk documents for RAG applications async def chunk_document(elements: list[dict], strategy: str = "by_title") -> list[dict] ``` ### Data Model Architecture #### Standard Response Format ```typescript interface ScrapedContent { url: string; title?: string; markdown?: string; html?: string; screenshot?: string; // base64 metadata: { timestamp: string; status_code: number; content_type: string; page_load_time: number; word_count: number; language?: string; }; elements?: DocumentElement[]; // For structured extraction } interface DocumentElement { type: "Title" | "Text" | "List" | "Table" | "Image"; content: string; metadata: { page_number?: number; coordinates?: BoundingBox; confidence?: number; }; } ``` #### Tool Schema Definitions ```typescript // Primary scraping tool { name: "freecrawl_scrape", description: "Scrape content from a single URL with anti-detection", inputSchema: { type: "object", properties: { url: { type: "string", description: "URL to scrape" }, formats: { type: "array", items: { enum: ["markdown", "html", "screenshot", "structured"] }, default: ["markdown"] }, javascript: { type: "boolean", default: true }, wait_for: { type: "number", default: 2000 }, anti_bot: { type: "boolean", default: true }, extract_schema: { type: "object", description: "Schema for structured extraction" } }, required: ["url"] } } ``` ### Requirements for MVP #### Core Dependencies ```toml # /// script # requires-python = ">=3.12" # dependencies = [ # "fastmcp>=0.3.0", # "playwright>=1.40.0", # "unstructured[all-docs]>=0.15.0", # "aiohttp>=3.9.0", # "beautifulsoup4>=4.12.0", # "turndown>=0.8.0", # HTML to Markdown # "pydantic>=2.0.0", # "tenacity>=8.0.0" # Retry logic # ] # /// ``` #### System Requirements - Python 3.12+ - Playwright browser binaries (`playwright install`) - System dependencies for Unstructured: - `libmagic-dev` - `poppler-utils` - `tesseract-ocr` - `libreoffice` #### Transport Configuration ```python # Dual transport support if os.getenv("FREECRAWL_TRANSPORT") == "http": # Streamable HTTP for production transport = httpx_sse.HTTPTransport(port=8000) else: # STDIO for development transport = StdioTransport() ``` ### Architecture Recommendations #### Single-File uv Script Approach **Pros:** - Zero-config deployment - Self-installing dependencies - Portable across environments - Minimal infrastructure requirements **Cons:** - Limited to ~2000 lines for maintainability - No separate test files - Harder to modularize complex features #### Multi-File Project Alternative **When to use:** If MVP grows beyond single-file limitations - Complex authentication requirements - Enterprise features (monitoring, logging) - Extensive test suite requirements - Multiple transport protocols ### Performance & Scalability #### Async Architecture Pattern ```python import asyncio from contextlib import asynccontextmanager class FreeCrawlServer: def __init__(self): self.session_pool = aiohttp.ClientSession( connector=aiohttp.TCPConnector(limit=100), timeout=aiohttp.ClientTimeout(total=30) ) self.browser_pool = None # Managed Playwright browsers @asynccontextmanager async def get_browser(self): # Pool management for concurrent scraping pass async def scrape_concurrent(self, urls: list[str]) -> list[dict]: semaphore = asyncio.Semaphore(10) # Concurrency limit tasks = [self._scrape_with_semaphore(url, semaphore) for url in urls] return await asyncio.gather(*tasks, return_exceptions=True) ``` #### Anti-Detection Strategy ```python # Rotating user agents and proxy support BROWSER_CONFIG = { "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...", "viewport": {"width": 1920, "height": 1080}, "locale": "en-US", "timezone": "America/New_York", "permissions": ["geolocation"], "extra_http_headers": { "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", } } ``` ### Error Handling & Reliability #### Retry Logic with Exponential Backoff ```python from tenacity import retry, stop_after_attempt, wait_exponential @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10) ) async def robust_scrape(url: str) -> dict: # Implementation with comprehensive error handling pass ``` #### Status Monitoring ```python # Health check endpoint for production @mcp.tool() def health_check() -> dict: return { "status": "healthy", "browser_pool_size": len(self.browser_pool), "active_sessions": self.session_pool.connector._conns_per_host, "uptime": time.time() - self.start_time } ``` ## MVP Implementation Plan ### Phase 1: Core Functionality (Week 1-2) - [x] Research completed - [ ] Single-file uv script structure - [ ] Basic scrape tool with Playwright - [ ] Markdown conversion pipeline - [ ] MCP server registration and transport ### Phase 2: Enhanced Features (Week 3-4) - [ ] Batch scraping with concurrency - [ ] Unstructured document processing integration - [ ] Basic anti-detection measures - [ ] Error handling and retry logic ### Phase 3: Production Features (Week 5-6) - [ ] Advanced anti-bot capabilities - [ ] Structured data extraction with schemas - [ ] Performance optimization - [ ] Comprehensive testing and documentation ### Phase 4: Advanced Capabilities (Optional) - [ ] Crawling with depth control - [ ] Web search integration - [ ] Proxy rotation system - [ ] Enterprise authentication ## Comparison with Firecrawl | Feature | Firecrawl | FreeCrawl MVP | |---------|-----------|---------------| | **Web Scraping** | ✅ Advanced | ✅ Playwright-based | | **Document Processing** | ❌ Limited | ✅ Unstructured integration | | **Anti-Detection** | ✅ Professional | ⚠️ Basic (MVP) | | **Cost** | 💰 Credit-based | 🆓 Self-hosted | | **JavaScript Support** | ✅ Advanced | ✅ Full Playwright | | **Deployment** | ☁️ Cloud only | 🏠 Self-hosted | | **Customization** | ❌ Limited | ✅ Full control | | **MCP Integration** | ✅ Official | ✅ FastMCP | ## Risk Assessment ### Technical Risks - **Anti-bot detection**: Requires ongoing maintenance as sites update defenses - **Performance scaling**: May need optimization for high-concurrency scenarios - **Browser management**: Playwright browser lifecycle management complexity ### Mitigation Strategies - Start with basic anti-detection, enhance iteratively - Implement circuit breakers and rate limiting - Use managed browser pools with health checks - Plan for containerized deployment with resource limits ## Success Metrics ### MVP Success Criteria - Successfully scrape 95% of standard websites - Process documents with 90%+ accuracy vs Firecrawl - Handle 10 concurrent requests without degradation - Deploy as single-file script with zero-config ### Performance Targets - Response time: <5 seconds for standard pages - Memory usage: <500MB under normal load - Concurrent capacity: 50+ simultaneous scraping tasks - Error rate: <5% for accessible websites ## Next Steps 1. **Create detailed technical specification** using engineering-lead agent 2. **Prototype single-file implementation** with core scraping functionality 3. **Integrate Unstructured** for document processing pipeline 4. **Implement anti-detection baseline** using Playwright stealth mode 5. **Performance testing** with representative workload 6. **Production deployment** with monitoring and observability --- *This discovery document synthesizes research from Firecrawl documentation, Unstructured Open Source analysis, and modern web scraping best practices as of August 2025.*

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/dylan-gluck/freecrawl-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

FREECRAWL_DISCOVERY.md•9.26 KiB