Crawl-MCP is a comprehensive web crawling and content extraction MCP server that enables AI assistants to extract and analyze content from websites, documents, YouTube videos, and search results.
Web Crawling & Content Extraction
Crawl single URLs with JavaScript support, CSS selectors, and screenshots (
crawl_url)Use fallback strategies to bypass anti-bot detection (
crawl_url_with_fallback)Deep-crawl entire sites with configurable depth and BFS/DFS strategies (
deep_crawl_site)Batch crawl up to 3–5 URLs simultaneously (
batch_crawl,multi_url_crawl)Extract structured data via CSS selectors or LLM (
extract_structured_data,intelligent_extract)Extract entities like emails and phone numbers (
extract_entities)
Universal File Processing
Convert PDFs, Word, Excel, PowerPoint, and ZIP archives to markdown (
process_file)Handle large files with chunking and BM25 filtering (
enhanced_process_large_content)
YouTube Integration
Extract transcripts with timestamps (
extract_youtube_transcript), batch process up to 3 videos (batch_extract_youtube_transcripts)Retrieve video metadata and comment threads — no API key required (
get_youtube_video_info,extract_youtube_comments)
Google Search Integration
Search across 7 optimized genres (academic, news, technical, commercial, social, etc.) (
search_google)Run batch searches or automatically crawl top results for full content (
batch_search_google,search_and_crawl)
AI-Powered Summarization
Automatically summarize large content, reducing token usage by up to 88.5% while preserving key information, with configurable length (short/medium/long)
Additional Features
Multi-browser support (Chromium, Firefox, WebKit, Chrome) and Docker support
STDIO and HTTP transport modes
Multilingual support (English and Japanese configurable via environment variables)
Enables Google Search integration with 7 optimized search genres using official Google operators, including tools for single searches, batch searches, and combined search-and-crawl operations with metadata extraction.
Extracts video transcripts and summaries from YouTube videos without requiring API keys, supporting multi-language transcript extraction and batch processing of multiple videos.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Crawl-MCPsummarize this article about AI advancements from https://example.com/ai-news"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Crawl-MCP: Unofficial MCP Server for crawl4ai
⚠️ Important: This is an unofficial MCP server implementation for the excellent crawl4ai library.
Not affiliated with the original crawl4ai project.
A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library with advanced AI capabilities. Extract and analyze content from any source: web pages, PDFs, Office documents, YouTube videos, and more. Features intelligent summarization to dramatically reduce token usage while preserving key information.
🌟 Key Features
🔍 Google Search Integration - 7 optimized search genres with Google official operators
🔍 Advanced Web Crawling: JavaScript support, deep site mapping, entity extraction
🌐 Universal Content Extraction: Web pages, PDFs, Word docs, Excel, PowerPoint, ZIP archives
🤖 AI-Powered Summarization: Smart token reduction (up to 88.5%) while preserving essential information
🎬 YouTube Integration: Extract video transcripts and summaries without API keys
⚡ Production Ready: 19 specialized tools with comprehensive error handling
🚀 Quick Start
Prerequisites (Required First)
Python 3.11 以上(FastMCP が Python 3.11+ を要求)
Install system dependencies for Playwright:
Ubuntu 24.04 LTS (Manual Required):
# Manual setup required due to t64 library transition
sudo apt update && sudo apt install -y \
libnss3 libatk-bridge2.0-0 libxss1 libasound2t64 \
libgbm1 libgtk-3-0t64 libxshmfence-dev libxrandr2 \
libxcomposite1 libxcursor1 libxdamage1 libxi6 \
fonts-noto-color-emoji fonts-unifont python3-venv python3-pip
python3 -m venv venv && source venv/bin/activate
pip install playwright==1.55.0 && playwright install chromium
sudo playwright install-depsOther Linux/macOS:
sudo bash scripts/prepare_for_uvx_playwright.shWindows (as Administrator):
scripts/prepare_for_uvx_playwright.ps1Installation
UVX (Recommended - Easiest):
# After system preparation above - that's it!
uvx --from git+https://github.com/walksoda/crawl-mcp crawl-mcpDocker (Production-Ready):
# Clone the repository
git clone https://github.com/walksoda/crawl-mcp
cd crawl-mcp
# Build and run with Docker Compose (STDIO mode)
docker-compose up --build
# Or build and run HTTP mode on port 8000
docker-compose --profile http up --build crawl4ai-mcp-http
# Or build manually
docker build -t crawl4ai-mcp .
docker run -it crawl4ai-mcpDocker Features:
🔧 Multi-Browser Support: Chromium, Firefox, Webkit headless browsers
🐧 Google Chrome: Additional Chrome Stable for compatibility
⚡ Optimized Performance: Pre-configured browser flags for Docker
🔒 Security: Non-root user execution
📦 Complete Dependencies: All required libraries included
Claude Desktop Setup
UVX Installation:
Add to your claude_desktop_config.json:
{
"mcpServers": {
"crawl-mcp": {
"transport": "stdio",
"command": "uvx",
"args": [
"--from",
"git+https://github.com/walksoda/crawl-mcp",
"crawl-mcp"
],
"env": {
"CRAWL4AI_LANG": "en"
}
}
}
}Docker HTTP Mode:
{
"mcpServers": {
"crawl-mcp": {
"transport": "http",
"baseUrl": "http://localhost:8000"
}
}
}For Japanese interface:
"env": {
"CRAWL4AI_LANG": "ja"
}📖 Documentation
Topic | Description |
Complete installation instructions for all platforms | |
Full tool documentation and usage examples | |
Platform-specific setup configurations | |
HTTP API access and integration methods | |
Power user techniques and workflows | |
Contributing and development setup |
Language-Specific Documentation
🛠️ Tool Overview
Web Crawling (3)
crawl_url- Extract web page content with JavaScript supportdeep_crawl_site- Crawl multiple pages from a site with configurable depthcrawl_url_with_fallback- Crawl with fallback strategies for anti-bot sites
Data Extraction (3)
intelligent_extract- Extract specific data from web pages using LLMextract_entities- Extract entities (emails, phones, etc.) from web pagesextract_structured_data- Extract structured data using CSS selectors or LLM
YouTube (4)
extract_youtube_transcript- Extract YouTube transcripts with timestampsbatch_extract_youtube_transcripts- Extract transcripts from multiple YouTube videos (max 3)get_youtube_video_info- Get YouTube video metadata and transcript availabilityextract_youtube_comments- Extract YouTube video comments with pagination
Search (4)
search_google- Search Google with genre filteringbatch_search_google- Perform multiple Google searches (max 3)search_and_crawl- Search Google and crawl top resultsget_search_genres- Get available search genres
File Processing (3)
process_file- Convert PDF, Word, Excel, PowerPoint, ZIP to markdownget_supported_file_formats- Get supported file formats and capabilitiesenhanced_process_large_content- Process large content with chunking and BM25 filtering
Batch Operations (2)
batch_crawl- Crawl multiple URLs with fallback (max 3 URLs)multi_url_crawl- Multi-URL crawl with pattern-based config (max 5 URL patterns)
🎯 Common Use Cases
Content Research:
search_and_crawl → extract_structured_data → analysisDocumentation Mining:
deep_crawl_site → batch processing → extractionMedia Analysis:
extract_youtube_transcript → summarization workflowSite Mapping:
batch_crawl → multi_url_crawl → comprehensive data🚨 Quick Troubleshooting
Installation Issues:
Re-run setup scripts with proper privileges
Try development installation method
Check browser dependencies are installed
Performance Issues:
Use
wait_for_js: truefor JavaScript-heavy sitesIncrease timeout for slow-loading pages
Use
extract_structured_datafor targeted extraction
Configuration Issues:
Check JSON syntax in
claude_desktop_config.jsonVerify file paths are absolute
Restart Claude Desktop after configuration changes
🏗️ Project Structure
Original Library: crawl4ai by unclecode
MCP Wrapper: This repository (walksoda)
Implementation: Unofficial third-party integration
📄 License
This project is an unofficial wrapper around the crawl4ai library. Please refer to the original crawl4ai license for the underlying functionality.
🤝 Contributing
See our Development Guide for contribution guidelines and development setup instructions.
🔗 Related Projects
crawl4ai - The underlying web crawling library
Model Context Protocol - The standard this server implements
Claude Desktop - Primary client for MCP servers
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.