Supports crawling Angular single-page applications through full Playwright integration with JavaScript execution and dynamic content loading capabilities.
Provides containerized deployment options with multiple variants (CPU-optimized, lightweight, standard, and RunPod serverless) including network isolation, health monitoring, and auto-scaling capabilities.
Supports automated CI/CD deployment workflows for RunPod serverless containers.
Provides Google Search integration with 31 search genres, automatic metadata extraction from search results, safe search enabled by default, and batch search capabilities.
Offers complete JavaScript execution support through Playwright integration for crawling JavaScript-heavy websites including SPAs, with custom script execution and DOM element waiting.
Provides installation scripts and Playwright browser dependency support for Linux/WSL environments.
Supports installation and configuration on macOS systems with platform-specific setup scripts and Claude Desktop integration.
Enables content extraction and export in Markdown format with configurable generation options for crawled web content.
Enables AI-powered content extraction and analysis using OpenAI's LLM models through configurable API key integration.
Implements the MCP server in Python with virtual environment support and pip-based dependency management.
Supports crawling React single-page applications through full Playwright integration with JavaScript execution and dynamic content loading capabilities.
Extracts YouTube video transcripts without authentication using youtube-transcript-api, supporting both auto-generated and manual captions with multi-language support, timestamped segments, and batch processing capabilities.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Crawl4AI MCP Serverextract the transcript from this YouTube video: https://youtube.com/watch?v=abc123"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Crawl4AI MCP Server
A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library. This server provides advanced web crawling, content extraction, and AI-powered analysis capabilities through the standardized MCP interface.
π Key Features
Core Capabilities
π Complete JavaScript Support
This feature set enables comprehensive JavaScript-heavy website handling:
Full Playwright Integration - React, Vue, Angular SPA sites fully supported
Dynamic Content Loading - Auto-waits for content to load
Custom JavaScript Execution - Run custom scripts on pages
DOM Element Waiting -
wait_for_selectorfor specific elementsHuman-like Browsing Simulation - Bypass basic anti-bot measures
JavaScript-Heavy Sites Recommended Settings:
{
"wait_for_js": true,
"simulate_user": true,
"timeout": 30-60,
"generate_markdown": true
}Advanced Web Crawling with complete JavaScript execution support
Deep Crawling with configurable depth and multiple strategies (BFS, DFS, Best-First)
AI-Powered Content Extraction using LLM-based analysis
π File Processing with Microsoft MarkItDown integration
PDF, Office documents, ZIP archives, and more
Automatic file format detection and conversion
Batch processing of archive contents
πΊ YouTube Transcript Extraction (youtube-transcript-api v1.1.0+)
No authentication required - works out of the box
Stable and reliable transcript extraction
Support for both auto-generated and manual captions
Multi-language support with priority settings
Timestamped segment information and clean text output
Batch processing for multiple videos
Entity Extraction with 9 built-in patterns including emails, phones, URLs, and dates
Intelligent Content Filtering (BM25, pruning, LLM-based)
Content Chunking for large document processing
Screenshot Capture and media extraction
Advanced Features
π Google Search Integration with genre-based filtering and metadata extraction
31 search genres (academic, programming, news, etc.)
Automatic title and snippet extraction from search results
Safe search enabled by default for security
Batch search capabilities with result analysis
Multiple Extraction Strategies include CSS selectors, XPath, regex patterns, and LLM-based extraction
Browser Automation supports custom user agents, headers, cookies, and authentication
Caching System with multiple modes for performance optimization
Custom JavaScript Execution for dynamic content interaction
Structured Data Export in multiple formats (JSON, Markdown, HTML)
π¦ Installation
Quick Setup
Linux/macOS:
./setup.shWindows:
setup_windows.batManual Installation
Create and activate virtual environment:
python3 -m venv venv
source venv/bin/activate # Linux/macOS
# or
venv\Scripts\activate.bat # WindowsInstall Python dependencies:
pip install -r requirements.txtInstall Playwright browser dependencies (Linux/WSL):
sudo apt-get update
sudo apt-get install libnss3 libnspr4 libasound2 libatk-bridge2.0-0 libdrm2 libgtk-3-0 libgbm1π³ Docker Deployment (Recommended)
For production deployments and easy setup, we provide multiple Docker container variants optimized for different use cases:
Quick Start with Docker
# Create shared network
docker network create shared_net
# CPU-optimized deployment (recommended for most users)
docker-compose -f docker-compose.cpu.yml build
docker-compose -f docker-compose.cpu.yml up -d
# Verify deployment
docker-compose -f docker-compose.cpu.yml psContainer Variants
Variant | Use Case | Build Time | Image Size | Memory Limit |
CPU-Optimized | VPS, local development | 6-9 min | ~1.5-2GB | 1GB |
Lightweight | Resource-constrained environments | 4-6 min | ~1-1.5GB | 512MB |
Standard | Full features (may include CUDA) | 8-12 min | ~2-3GB | 2GB |
RunPod Serverless | Cloud auto-scaling deployment | 6-9 min | ~1.5-2GB | 1GB |
Docker Features
π Network Isolation: No localhost exposure, only accessible via
shared_netβ‘ CPU Optimization: CUDA-free builds for 50-70% smaller containers
π Auto-scaling: RunPod serverless deployment with automated CI/CD
π Health Monitoring: Built-in health checks and monitoring
π‘οΈ Security: Non-root user, resource limits, safe mode enabled
Quick Commands
# Build all variants for comparison
./scripts/build-cpu-containers.sh
# Deploy to RunPod (automated via GitHub Actions)
# Image: docker.io/gemneye/crawl4ai-runpod-serverless:latest
# Health check
docker exec crawl4ai-mcp-cpu python -c "from crawl4ai_mcp.server import mcp; print('β
Healthy')"For complete Docker documentation, see Docker Guide.
π₯οΈ Usage
Start the MCP Server
STDIO transport (default):
python -m crawl4ai_mcp.serverHTTP transport:
python -m crawl4ai_mcp.server --transport http --host 127.0.0.1 --port 8000π MCP Command Registration (Claude Code CLI)
You can register this MCP server with Claude Code CLI. The following methods are available:
Using .mcp.json Configuration (Recommended)
Create or update
.mcp.jsonin your project directory:
{
"mcpServers": {
"crawl4ai": {
"command": "/home/user/prj/crawl/venv/bin/python",
"args": ["-m", "crawl4ai_mcp.server"],
"env": {
"FASTMCP_LOG_LEVEL": "DEBUG"
}
}
}
}Run
claude mcpor start Claude Code from the project directory
Alternative: Command Line Registration
# Register the MCP server with claude command
claude mcp add crawl4ai "/path/to/your/venv/bin/python -m crawl4ai_mcp.server" \
--cwd /path/to/your/crawl4ai-mcp-project
# With environment variables
claude mcp add crawl4ai "/path/to/your/venv/bin/python -m crawl4ai_mcp.server" \
--cwd /path/to/your/crawl4ai-mcp-project \
-e FASTMCP_LOG_LEVEL=DEBUG
# With project scope (shared with team)
claude mcp add crawl4ai "/path/to/your/venv/bin/python -m crawl4ai_mcp.server" \
--cwd /path/to/your/crawl4ai-mcp-project \
--scope projectHTTP Transport (For Remote Access)
# First start the HTTP server
python -m crawl4ai_mcp.server --transport http --host 127.0.0.1 --port 8000
# Then register the HTTP endpoint
claude mcp add crawl4ai-http --transport http --url http://127.0.0.1:8000/mcp
# Or with Pure StreamableHTTP (recommended)
./scripts/start_pure_http_server.sh
claude mcp add crawl4ai-pure-http --transport http --url http://127.0.0.1:8000/mcpVerification
# List registered MCP servers
claude mcp list
# Test the connection
claude mcp test crawl4ai
# Remove if needed
claude mcp remove crawl4aiSetting API Keys (Optional for LLM Features)
# Add with environment variables for LLM functionality
claude mcp add crawl4ai "python -m crawl4ai_mcp.server" \
--cwd /path/to/your/crawl4ai-mcp-project \
-e OPENAI_API_KEY=your_openai_key \
-e ANTHROPIC_API_KEY=your_anthropic_keyClaude Desktop Integration
π― Pure StreamableHTTP Usage (Recommended)
Start Server by running the startup script:
./scripts/start_pure_http_server.shApply Configuration using one of these methods:
Copy
configs/claude_desktop_config_pure_http.jsonto Claude Desktop's config directoryOr add the following to your existing config:
{ "mcpServers": { "crawl4ai-pure-http": { "url": "http://127.0.0.1:8000/mcp" } } }Restart Claude Desktop to apply settings
Start Using the tools - crawl4ai tools are now available in chat
π Traditional STDIO Usage
Copy the configuration:
cp configs/claude_desktop_config.json ~/.config/claude-desktop/claude_desktop_config.jsonRestart Claude Desktop to enable the crawl4ai tools
π Configuration File Locations
Windows:
%APPDATA%\Claude\claude_desktop_config.jsonmacOS:
~/Library/Application Support/Claude/claude_desktop_config.jsonLinux:
~/.config/claude-desktop/claude_desktop_config.jsonπ HTTP API Access
This MCP server supports multiple HTTP protocols, allowing you to choose the optimal implementation for your use case.
π― Pure StreamableHTTP (Recommended)
Pure JSON HTTP protocol without Server-Sent Events (SSE)
Server Startup
# Method 1: Using startup script
./scripts/start_pure_http_server.sh
# Method 2: Direct startup
python examples/simple_pure_http_server.py --host 127.0.0.1 --port 8000
# Method 3: Background startup
nohup python examples/simple_pure_http_server.py --port 8000 > server.log 2>&1 &Claude Desktop Configuration
{
"mcpServers": {
"crawl4ai-pure-http": {
"url": "http://127.0.0.1:8000/mcp"
}
}
}Usage Steps
Start Server:
./scripts/start_pure_http_server.shApply Configuration: Use
configs/claude_desktop_config_pure_http.jsonRestart Claude Desktop: Apply settings
Verification
# Health check
curl http://127.0.0.1:8000/health
# Complete test
python examples/pure_http_test.pyπ Legacy HTTP (SSE Implementation)
Traditional FastMCP StreamableHTTP protocol (with SSE)
Server Startup
# Method 1: Command line
python -m crawl4ai_mcp.server --transport http --host 127.0.0.1 --port 8001
# Method 2: Environment variables
export MCP_TRANSPORT=http
export MCP_HOST=127.0.0.1
export MCP_PORT=8001
python -m crawl4ai_mcp.serverClaude Desktop Configuration
{
"mcpServers": {
"crawl4ai-legacy-http": {
"url": "http://127.0.0.1:8001/mcp"
}
}
}π Protocol Comparison
Feature | Pure StreamableHTTP | Legacy HTTP (SSE) | STDIO |
Response Format | Plain JSON | Server-Sent Events | Binary |
Configuration Complexity | Low (URL only) | Low (URL only) | High (Process management) |
Debug Ease | High (curl compatible) | Medium (SSE parser needed) | Low |
Independence | High | High | Low |
Performance | High | Medium | High |
π HTTP Usage Examples
Pure StreamableHTTP
# Initialize
SESSION_ID=$(curl -s -X POST http://127.0.0.1:8000/mcp/initialize \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":"init","method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"test","version":"1.0.0"}}}' \
-D- | grep -i mcp-session-id | cut -d' ' -f2 | tr -d '\r')
# Execute tool
curl -X POST http://127.0.0.1:8000/mcp \
-H "Content-Type: application/json" \
-H "mcp-session-id: $SESSION_ID" \
-d '{"jsonrpc":"2.0","id":"crawl","method":"tools/call","params":{"name":"crawl_url","arguments":{"url":"https://example.com"}}}'Legacy HTTP
curl -X POST "http://127.0.0.1:8001/tools/crawl_url" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "generate_markdown": true}'π Detailed Documentation
Pure StreamableHTTP: docs/PURE_STREAMABLE_HTTP.md
HTTP Server Usage: docs/HTTP_SERVER_USAGE.md
Legacy HTTP API: docs/HTTP_API_GUIDE.md
Starting the HTTP Server
Method 1: Command Line
python -m crawl4ai_mcp.server --transport http --host 127.0.0.1 --port 8000Method 2: Environment Variables
export MCP_TRANSPORT=http
export MCP_HOST=127.0.0.1
export MCP_PORT=8000
python -m crawl4ai_mcp.serverMethod 3: Docker (if available)
docker run -p 8000:8000 crawl4ai-mcp --transport http --port 8000Basic Endpoint Information
Once running, the HTTP API provides:
Base URL:
http://127.0.0.1:8000OpenAPI Documentation:
http://127.0.0.1:8000/docsTool Endpoints:
http://127.0.0.1:8000/tools/{tool_name}Resource Endpoints:
http://127.0.0.1:8000/resources/{resource_uri}
All MCP tools (crawl_url, intelligent_extract, process_file, etc.) are accessible via HTTP POST requests with JSON payloads matching the tool parameters.
Quick HTTP Example
curl -X POST "http://127.0.0.1:8000/tools/crawl_url" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "generate_markdown": true}'For detailed HTTP API documentation, examples, and integration guides, see the HTTP API Guide.
π οΈ Tool Selection Guide
π Choose the Right Tool for Your Task
Use Case | Recommended Tool | Key Features |
Single webpage |
| Basic crawling, JS support |
Multiple pages (up to 5) |
| Site mapping, link following |
Search + Crawling |
| Google search + auto-crawl |
Difficult sites |
| Multiple retry strategies |
Extract specific data |
| AI-powered extraction |
Find patterns |
| Emails, phones, URLs, etc. |
Structured data |
| CSS/XPath/LLM schemas |
File processing |
| PDF, Office, ZIP conversion |
YouTube content |
| Subtitle extraction |
β‘ Performance Guidelines
Deep Crawling: Limited to 5 pages max (stability focused)
Batch Processing: Concurrent limits enforced
Timeout Calculation:
pages Γ base_timeoutrecommendedLarge Files: 100MB maximum size limit
Retry Strategy: Manual retry recommended on first failure
π― Best Practices
For JavaScript-Heavy Sites:
Always use
wait_for_js: trueSet
simulate_user: truefor better compatibilityIncrease timeout to 30-60 seconds
Use
wait_for_selectorfor specific elements
For AI Features:
Configure LLM settings with
get_llm_config_infoFallback to non-AI tools if LLM unavailable
Use
intelligent_extractfor semantic understanding
π οΈ MCP Tools
crawl_url
Advanced web crawling with deep crawling support and intelligent filtering.
Key Parameters:
url: Target URL to crawlmax_depth: Maximum crawling depth (None for single page)crawl_strategy: Strategy type ('bfs', 'dfs', 'best_first')content_filter: Filter type ('bm25', 'pruning', 'llm')chunk_content: Enable content chunking for large documentsexecute_js: Custom JavaScript code executionuser_agent: Custom user agent stringheaders: Custom HTTP headerscookies: Authentication cookies
deep_crawl_site
Dedicated tool for comprehensive site mapping and recursive crawling.
Parameters:
url: Starting URLmax_depth: Maximum crawling depth (recommended: 1-3)max_pages: Maximum number of pages to crawlcrawl_strategy: Crawling strategy ('bfs', 'dfs', 'best_first')url_pattern: URL filter pattern (e.g., 'docs', 'blog')score_threshold: Minimum relevance score (0.0-1.0)
intelligent_extract
AI-powered content extraction with advanced filtering and analysis.
Parameters:
url: Target URLextraction_goal: Description of extraction targetcontent_filter: Filter type for content qualityuse_llm: Enable LLM-based intelligent extractionllm_provider: LLM provider (openai, claude, etc.)custom_instructions: Detailed extraction instructions
extract_entities
High-speed entity extraction using regex patterns.
Built-in Entity Types:
emails: Email addressesphones: Phone numbersurls: URLs and linksdates: Date formatsips: IP addressessocial_media: Social media handles (@username, #hashtag)prices: Price informationcredit_cards: Credit card numberscoordinates: Geographic coordinates
extract_structured_data
Traditional structured data extraction using CSS/XPath selectors or LLM schemas.
batch_crawl
Parallel processing of multiple URLs with unified reporting.
crawl_url_with_fallback
Robust crawling with multiple fallback strategies for maximum reliability.
process_file
π File Processing: Convert various file formats to Markdown using Microsoft MarkItDown.
Parameters:
url: File URL (PDF, Office, ZIP, etc.)max_size_mb: Maximum file size limit (default: 100MB)extract_all_from_zip: Extract all files from ZIP archivesinclude_metadata: Include file metadata in response
Supported Formats:
PDF: .pdf
Microsoft Office: .docx, .pptx, .xlsx, .xls
Archives: .zip
Web/Text: .html, .htm, .txt, .md, .csv, .rtf
eBooks: .epub
get_supported_file_formats
π Format Information: Get comprehensive list of supported file formats and their capabilities.
extract_youtube_transcript
πΊ YouTube Processing: Extract transcripts from YouTube videos with language preferences and translation using youtube-transcript-api v1.1.0+.
β Stable and reliable - No authentication required!
Parameters:
url: YouTube video URLlanguages: Preferred languages in order of preference (default: ["ja", "en"])translate_to: Target language for translation (optional)include_timestamps: Include timestamps in transcriptpreserve_formatting: Preserve original formattinginclude_metadata: Include video metadata
batch_extract_youtube_transcripts
πΊ Batch YouTube Processing: Extract transcripts from multiple YouTube videos in parallel.
β Enhanced performance with controlled concurrency for stable batch processing.
Parameters:
urls: List of YouTube video URLslanguages: Preferred languages listtranslate_to: Target language for translation (optional)include_timestamps: Include timestamps in transcriptmax_concurrent: Maximum concurrent requests (1-5, default: 3)
get_youtube_video_info
π YouTube Info: Get available transcript information for a YouTube video without extracting the full transcript.
Parameters:
video_url: YouTube video URL
Returns:
Available transcript languages
Manual/auto-generated distinction
Translatable language information
search_google
π Google Search: Perform Google search with genre filtering and metadata extraction.
Parameters:
query: Search query stringnum_results: Number of results to return (1-100, default: 10)language: Search language (default: "en")region: Search region (default: "us")search_genre: Content genre filter (optional)safe_search: Safe search enabled (always True for security)
Features:
Automatic title and snippet extraction from search results
31 available search genres for content filtering
URL classification and domain analysis
Safe search enforced by default
batch_search_google
π Batch Google Search: Perform multiple Google searches with comprehensive analysis.
Parameters:
queries: List of search queriesnum_results_per_query: Results per query (1-100, default: 10)max_concurrent: Maximum concurrent searches (1-5, default: 3)language: Search language (default: "en")region: Search region (default: "us")search_genre: Content genre filter (optional)
Returns:
Individual search results for each query
Cross-query analysis and statistics
Domain distribution and result type analysis
search_and_crawl
π Integrated Search+Crawl: Perform Google search and automatically crawl top results.
Parameters:
search_query: Google search querynum_search_results: Number of search results (1-20, default: 5)crawl_top_results: Number of top results to crawl (1-10, default: 3)extract_media: Extract media from crawled pagesgenerate_markdown: Generate markdown contentsearch_genre: Content genre filter (optional)
Returns:
Complete search metadata and crawled content
Success rates and processing statistics
Integrated analysis of search and crawl results
get_search_genres
π Search Genres: Get comprehensive list of available search genres and their descriptions.
Returns:
31 available search genres with descriptions
Categorized genre lists (Academic, Technical, News, etc.)
Usage examples for each genre type
π Resources
uri://crawl4ai/config: Default crawler configuration optionsuri://crawl4ai/examples: Usage examples and sample requests
π― Prompts
crawl_website_prompt: Guided website crawling workflowsanalyze_crawl_results_prompt: Crawl result analysisbatch_crawl_setup_prompt: Batch crawling setup
π§ Configuration Examples
π Google Search Examples
Basic Google Search
{
"query": "python machine learning tutorial",
"num_results": 10,
"language": "en",
"region": "us"
}Genre-Filtered Search
{
"query": "machine learning research",
"num_results": 15,
"search_genre": "academic",
"language": "en"
}Batch Search with Analysis
{
"queries": [
"python programming tutorial",
"web development guide",
"data science introduction"
],
"num_results_per_query": 5,
"max_concurrent": 3,
"search_genre": "education"
}Integrated Search and Crawl
{
"search_query": "python official documentation",
"num_search_results": 10,
"crawl_top_results": 5,
"extract_media": false,
"generate_markdown": true,
"search_genre": "documentation"
}Basic Deep Crawling
{
"url": "https://docs.example.com",
"max_depth": 2,
"max_pages": 20,
"crawl_strategy": "bfs"
}AI-Driven Content Extraction
{
"url": "https://news.example.com",
"extraction_goal": "article summary and key points",
"content_filter": "llm",
"use_llm": true,
"custom_instructions": "Extract main article content, summarize key points, and identify important quotes"
}π File Processing Examples
PDF Document Processing
{
"url": "https://example.com/document.pdf",
"max_size_mb": 50,
"include_metadata": true
}Office Document Processing
{
"url": "https://example.com/report.docx",
"max_size_mb": 25,
"include_metadata": true
}ZIP Archive Processing
{
"url": "https://example.com/documents.zip",
"max_size_mb": 100,
"extract_all_from_zip": true,
"include_metadata": true
}Automatic File Detection
The crawl_url tool automatically detects file formats and routes to appropriate processing:
{
"url": "https://example.com/mixed-content.pdf",
"generate_markdown": true
}πΊ YouTube Video Processing Examples
β Stable youtube-transcript-api v1.1.0+ integration - No setup required!
Basic Transcript Extraction
{
"url": "https://www.youtube.com/watch?v=VIDEO_ID",
"languages": ["ja", "en"],
"include_timestamps": true,
"include_metadata": true
}Auto-Translation Feature
{
"url": "https://www.youtube.com/watch?v=VIDEO_ID",
"languages": ["en"],
"translate_to": "ja",
"include_timestamps": false
}Batch Video Processing
{
"urls": [
"https://www.youtube.com/watch?v=VIDEO_ID1",
"https://www.youtube.com/watch?v=VIDEO_ID2",
"https://youtu.be/VIDEO_ID3"
],
"languages": ["ja", "en"],
"max_concurrent": 3
}Automatic YouTube Detection
The crawl_url tool automatically detects YouTube URLs and extracts transcripts:
{
"url": "https://www.youtube.com/watch?v=VIDEO_ID",
"generate_markdown": true
}Video Information Lookup
{
"video_url": "https://www.youtube.com/watch?v=VIDEO_ID"
}Entity Extraction
{
"url": "https://company.com/contact",
"entity_types": ["emails", "phones", "social_media"],
"include_context": true,
"deduplicate": true
}Authenticated Crawling
{
"url": "https://private.example.com",
"auth_token": "Bearer your-token",
"cookies": {"session_id": "abc123"},
"headers": {"X-API-Key": "your-key"}
}ποΈ Project Structure
crawl4ai_mcp/
βββ __init__.py # Package initialization
βββ server.py # Main MCP server (1,184+ lines)
βββ strategies.py # Additional extraction strategies
βββ suppress_output.py # Output suppression utilities
config/
βββ claude_desktop_config_windows.json # Claude Desktop config (Windows)
βββ claude_desktop_config_script.json # Script-based config
βββ claude_desktop_config.json # Basic config
docs/
βββ README_ja.md # Japanese documentation
βββ setup_instructions_ja.md # Detailed setup guide
βββ troubleshooting_ja.md # Troubleshooting guide
scripts/
βββ setup.sh # Linux/macOS setup
βββ setup_windows.bat # Windows setup
βββ run_server.sh # Server startup scriptπ Troubleshooting
Common Issues
ModuleNotFoundError:
Ensure virtual environment is activated
Verify PYTHONPATH is set correctly
Install dependencies:
pip install -r requirements.txt
Playwright Browser Errors:
Install system dependencies:
sudo apt-get install libnss3 libnspr4 libasound2For WSL: Ensure X11 forwarding or headless mode
JSON Parsing Errors:
Resolved: Output suppression implemented in latest version
All crawl4ai verbose output is now properly suppressed
For detailed troubleshooting, see docs/troubleshooting_ja.md.
π Supported Formats & Capabilities
β Web Content
Static Sites: HTML, CSS, JavaScript
Dynamic Sites: React, Vue, Angular SPAs
Complex Sites: JavaScript-heavy, async loading
Protected Sites: Basic auth, cookies, custom headers
β Media & Files
Videos: YouTube (transcript auto-extraction)
Documents: PDF, Word, Excel, PowerPoint, ZIP
Archives: Automatic extraction and processing
Text: Markdown, CSV, RTF, plain text
β Search & Data
Google Search: 31 genre filters available
Entity Extraction: Emails, phones, URLs, dates
Structured Data: CSS/XPath/LLM-based extraction
Batch Processing: Multiple URLs simultaneously
β οΈ Limitations & Important Notes
π« Known Limitations
Authentication Sites: Cannot bypass login requirements
reCAPTCHA Protected: Limited success on heavily protected sites
Rate Limiting: Manual interval management recommended
Automatic Retry: Not implemented - manual retry needed
Deep Crawling: 5 page maximum for stability
π Regional & Language Support
Multi-language Sites: Full Unicode support
Regional Search: Configurable region settings
Character Encoding: Automatic detection
Japanese Content: Complete support
π Error Handling Strategy
First Failure β Immediate manual retry
Timeout Issues β Increase timeout settings
Persistent Problems β Use
crawl_url_with_fallbackAlternative Approach β Try different tool selection
π‘ Common Workflows
π Research & Analysis
1. Competitive Analysis: search_and_crawl β intelligent_extract
2. Site Auditing: crawl_url β extract_entities
3. Content Research: search_google β batch_crawl
4. Deep Analysis: deep_crawl_site β structured extractionπ Typical Success Patterns
E-commerce Sites: Use
simulate_user: trueNews Sites: Enable
wait_for_jsfor dynamic contentDocumentation: Use
deep_crawl_sitewith URL patternsSocial Media: Extract entities for contact information
π Performance Features
Intelligent Caching: 15-minute self-cleaning cache with multiple modes
Async Architecture: Built on asyncio for high performance
Memory Management: Adaptive concurrency based on system resources
Rate Limiting: Configurable delays and request throttling
Parallel Processing: Concurrent crawling of multiple URLs
π‘οΈ Security Features
Output Suppression provides complete isolation of crawl4ai output from MCP JSON
Authentication Support includes token-based and cookie authentication
Secure Headers offer custom header support for API access
Error Isolation includes comprehensive error handling with helpful suggestions
π Dependencies
crawl4ai>=0.3.0- Advanced web crawling libraryfastmcp>=0.1.0- MCP server frameworkpydantic>=2.0.0- Data validation and serializationmarkitdown>=0.0.1a2- File processing and conversion (Microsoft)googlesearch-python>=1.3.0- Google search functionalityaiohttp>=3.8.0- Asynchronous HTTP client for metadata extractionbeautifulsoup4>=4.12.0- HTML parsing for title/snippet extractionyoutube-transcript-api>=1.1.0- Stable YouTube transcript extractionasyncio- Asynchronous programming supporttyping-extensions- Extended type hints
YouTube Features Status:
The following status information applies to YouTube transcript extraction:
YouTube transcript extraction is stable and reliable with v1.1.0+
No authentication or API keys required
Works out of the box after installation
π License
MIT License
π€ Contributing
This project implements the Model Context Protocol specification. It is compatible with any MCP-compliant client and built with the FastMCP framework for easy extension and modification.
π¦ DXT Package Available
One-click installation for Claude Desktop users
This MCP server is available as a DXT (Desktop Extensions) package for easy installation. The following resources are available:
DXT Package can be found at
dxt-packages/crawl4ai-dxt-correct/Installation Guide is available at dxt-packages/README_DXT_PACKAGES.md
Creation Guide is documented at dxt-packages/DXT_CREATION_GUIDE.md
Troubleshooting information is at dxt-packages/DXT_TROUBLESHOOTING_GUIDE.md
Simply drag and drop the .dxt file into Claude Desktop for instant setup.
π Additional Documentation
Infrastructure & Deployment
Docker Guide - Complete Docker containerization guide with multiple variants
Architecture - Technical architecture, design decisions, and container infrastructure
Build & Deployment - Build processes, CI/CD pipeline, and deployment strategies
Configuration - Environment variables, Docker settings, and performance tuning
Deployment Playbook - Production deployment procedures and troubleshooting
Development & Contributing
Contributing Guide - Docker development workflow and contribution guidelines
RunPod Deployment - Serverless cloud deployment on RunPod
GitHub Actions - Automated CI/CD pipeline documentation
API & Integration
Pure StreamableHTTP - Recommended HTTP transport protocol
HTTP Server Usage - HTTP API server configuration
Legacy HTTP API - Detailed HTTP API documentation
Localization & Support
Japanese Documentation - Complete feature documentation in Japanese
Japanese Troubleshooting - Troubleshooting guide in Japanese