Crawl-MCP
Crawl-MCP is a comprehensive web crawling and content extraction MCP server that enables AI assistants to extract and analyze content from websites, documents, YouTube videos, and search results.
Web Crawling & Content Extraction
Crawl single URLs with JavaScript support, CSS selectors, and screenshots (
crawl_url)Use fallback strategies to bypass anti-bot detection (
crawl_url_with_fallback)Deep-crawl entire sites with configurable depth and BFS/DFS strategies (
deep_crawl_site)Batch crawl up to 3–5 URLs simultaneously (
batch_crawl,multi_url_crawl)Extract structured data via CSS selectors or LLM (
extract_structured_data,intelligent_extract)Extract entities like emails and phone numbers (
extract_entities)
Universal File Processing
Convert PDFs, Word, Excel, PowerPoint, and ZIP archives to markdown (
process_file)Handle large files with chunking and BM25 filtering (
enhanced_process_large_content)
YouTube Integration
Extract transcripts with timestamps (
extract_youtube_transcript), batch process up to 3 videos (batch_extract_youtube_transcripts)Retrieve video metadata and comment threads — no API key required (
get_youtube_video_info,extract_youtube_comments)
Google Search Integration
Search across 7 optimized genres (academic, news, technical, commercial, social, etc.) (
search_google)Run batch searches or automatically crawl top results for full content (
batch_search_google,search_and_crawl)
AI-Powered Summarization
Automatically summarize large content, reducing token usage by up to 88.5% while preserving key information, with configurable length (short/medium/long)
Additional Features
Multi-browser support (Chromium, Firefox, WebKit, Chrome) and Docker support
STDIO and HTTP transport modes
Multilingual support (English and Japanese configurable via environment variables)
Enables Google Search integration with 7 optimized search genres using official Google operators, including tools for single searches, batch searches, and combined search-and-crawl operations with metadata extraction.
Extracts video transcripts and summaries from YouTube videos without requiring API keys, supporting multi-language transcript extraction and batch processing of multiple videos.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Crawl-MCPsummarize this article about AI advancements from https://example.com/ai-news"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Crawl-MCP: Unofficial MCP Server for crawl4ai
⚠️ Important: This is an unofficial MCP server implementation for the excellent crawl4ai library.
Not affiliated with the original crawl4ai project.
A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library with advanced AI capabilities. Extract and analyze content from any source: web pages, PDFs, Office documents, YouTube videos, and more. Features intelligent summarization to dramatically reduce token usage while preserving key information.
🌟 Key Features
🔍 Google Search Integration - 7 optimized search genres with Google official operators
🔍 Advanced Web Crawling: JavaScript support, deep site mapping, entity extraction
🌐 Universal Content Extraction: Web pages, PDFs, Word docs, Excel, PowerPoint, ZIP archives
🤖 AI-Powered Summarization: Smart token reduction (up to 88.5%) while preserving essential information
🎬 YouTube Integration: Extract video transcripts and summaries without API keys
⚡ Production Ready: 19 specialized tools with comprehensive error handling
🚀 Quick Start
Prerequisites (Required First)
Python 3.11 以上(FastMCP が Python 3.11+ を要求)
Install system dependencies for Playwright:
Ubuntu 24.04 LTS (Manual Required):
# Manual setup required due to t64 library transition
sudo apt update && sudo apt install -y \
libnss3 libatk-bridge2.0-0 libxss1 libasound2t64 \
libgbm1 libgtk-3-0t64 libxshmfence-dev libxrandr2 \
libxcomposite1 libxcursor1 libxdamage1 libxi6 \
fonts-noto-color-emoji fonts-unifont python3-venv python3-pip
python3 -m venv venv && source venv/bin/activate
pip install playwright==1.55.0 && playwright install chromium
sudo playwright install-depsOther Linux/macOS:
sudo bash scripts/prepare_for_uvx_playwright.shWindows (as Administrator):
scripts/prepare_for_uvx_playwright.ps1Installation
UVX (Recommended - Easiest):
# After system preparation above - that's it!
uvx --from git+https://github.com/walksoda/crawl-mcp crawl-mcpDocker (Production-Ready):
# Clone the repository
git clone https://github.com/walksoda/crawl-mcp
cd crawl-mcp
# Build and run with Docker Compose (STDIO mode)
docker-compose up --build
# Or build and run HTTP mode on port 8000
docker-compose --profile http up --build crawl4ai-mcp-http
# Or build manually
docker build -t crawl4ai-mcp .
docker run -it crawl4ai-mcpDocker Features:
🔧 Multi-Browser Support: Chromium, Firefox, Webkit headless browsers
🐧 Google Chrome: Additional Chrome Stable for compatibility
⚡ Optimized Performance: Pre-configured browser flags for Docker
🔒 Security: Non-root user execution
📦 Complete Dependencies: All required libraries included
Claude Desktop Setup
UVX Installation:
Add to your claude_desktop_config.json:
{
"mcpServers": {
"crawl-mcp": {
"transport": "stdio",
"command": "uvx",
"args": [
"--from",
"git+https://github.com/walksoda/crawl-mcp",
"crawl-mcp"
],
"env": {
"CRAWL4AI_LANG": "en"
}
}
}
}Docker HTTP Mode:
{
"mcpServers": {
"crawl-mcp": {
"transport": "http",
"baseUrl": "http://localhost:8000"
}
}
}For Japanese interface:
"env": {
"CRAWL4AI_LANG": "ja"
}📖 Documentation
Topic | Description |
Complete installation instructions for all platforms | |
Full tool documentation and usage examples | |
Platform-specific setup configurations | |
HTTP API access and integration methods | |
Power user techniques and workflows | |
Contributing and development setup |
Language-Specific Documentation
🛠️ Tool Overview
Web Crawling (3)
crawl_url- Extract web page content with JavaScript supportdeep_crawl_site- Crawl multiple pages from a site with configurable depthcrawl_url_with_fallback- Crawl with fallback strategies for anti-bot sites
Data Extraction (3)
intelligent_extract- Extract specific data from web pages using LLMextract_entities- Extract entities (emails, phones, etc.) from web pagesextract_structured_data- Extract structured data using CSS selectors or LLM
YouTube (4)
extract_youtube_transcript- Extract YouTube transcripts with timestampsbatch_extract_youtube_transcripts- Extract transcripts from multiple YouTube videos (max 3)get_youtube_video_info- Get YouTube video metadata and transcript availabilityextract_youtube_comments- Extract YouTube video comments with pagination
Search (4)
search_google- Search Google with genre filteringbatch_search_google- Perform multiple Google searches (max 3)search_and_crawl- Search Google and crawl top resultsget_search_genres- Get available search genres
File Processing (3)
process_file- Convert PDF, Word, Excel, PowerPoint, ZIP to markdownget_supported_file_formats- Get supported file formats and capabilitiesenhanced_process_large_content- Process large content with chunking and BM25 filtering
Batch Operations (2)
batch_crawl- Crawl multiple URLs with fallback (max 3 URLs)multi_url_crawl- Multi-URL crawl with pattern-based config (max 5 URL patterns)
🎯 Common Use Cases
Content Research:
search_and_crawl → extract_structured_data → analysisDocumentation Mining:
deep_crawl_site → batch processing → extractionMedia Analysis:
extract_youtube_transcript → summarization workflowSite Mapping:
batch_crawl → multi_url_crawl → comprehensive data🚨 Quick Troubleshooting
Installation Issues:
Re-run setup scripts with proper privileges
Try development installation method
Check browser dependencies are installed
Performance Issues:
Use
wait_for_js: truefor JavaScript-heavy sitesIncrease timeout for slow-loading pages
Use
extract_structured_datafor targeted extraction
Configuration Issues:
Check JSON syntax in
claude_desktop_config.jsonVerify file paths are absolute
Restart Claude Desktop after configuration changes
🏗️ Project Structure
Original Library: crawl4ai by unclecode
MCP Wrapper: This repository (walksoda)
Implementation: Unofficial third-party integration
📄 License
This project is an unofficial wrapper around the crawl4ai library. Please refer to the original crawl4ai license for the underlying functionality.
🤝 Contributing
See our Development Guide for contribution guidelines and development setup instructions.
🔗 Related Projects
crawl4ai - The underlying web crawling library
Model Context Protocol - The standard this server implements
Claude Desktop - Primary client for MCP servers
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/walksoda/crawl-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server