Crawl-MCP

README.md•6.85 KiB

# Crawl-MCP: Unofficial MCP Server for crawl4ai > **⚠️ Important**: This is an **unofficial** MCP server implementation for the excellent [crawl4ai](https://github.com/unclecode/crawl4ai) library. > **Not affiliated** with the original crawl4ai project. A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library with advanced AI capabilities. Extract and analyze content from **any source**: web pages, PDFs, Office documents, YouTube videos, and more. Features intelligent summarization to dramatically reduce token usage while preserving key information. ## 🌟 Key Features - **🔍 Google Search Integration** - 7 optimized search genres with Google official operators - **🔍 Advanced Web Crawling**: JavaScript support, deep site mapping, entity extraction - **🌐 Universal Content Extraction**: Web pages, PDFs, Word docs, Excel, PowerPoint, ZIP archives - **🤖 AI-Powered Summarization**: Smart token reduction (up to 88.5%) while preserving essential information - **🎬 YouTube Integration**: Extract video transcripts and summaries without API keys - **⚡ Production Ready**: 13 specialized tools with comprehensive error handling ## 🚀 Quick Start ### Prerequisites (Required First) - Python 3.11 以上（FastMCP が Python 3.11+ を要求） **Install system dependencies for Playwright:** **Ubuntu 24.04 LTS (Manual Required):** ```bash # Manual setup required due to t64 library transition sudo apt update && sudo apt install -y \ libnss3 libatk-bridge2.0-0 libxss1 libasound2t64 \ libgbm1 libgtk-3-0t64 libxshmfence-dev libxrandr2 \ libxcomposite1 libxcursor1 libxdamage1 libxi6 \ fonts-noto-color-emoji fonts-unifont python3-venv python3-pip python3 -m venv venv && source venv/bin/activate pip install playwright==1.55.0 && playwright install chromium sudo playwright install-deps ``` **Other Linux/macOS:** ```bash sudo bash scripts/prepare_for_uvx_playwright.sh ``` **Windows (as Administrator):** ```powershell scripts/prepare_for_uvx_playwright.ps1 ``` ### Installation **UVX (Recommended - Easiest):** ```bash # After system preparation above - that's it! uvx --from git+https://github.com/walksoda/crawl-mcp crawl-mcp ``` **Docker (Production-Ready):** ```bash # Clone the repository git clone https://github.com/walksoda/crawl-mcp cd crawl-mcp # Build and run with Docker Compose (STDIO mode) docker-compose up --build # Or build and run HTTP mode on port 8000 docker-compose --profile http up --build crawl4ai-mcp-http # Or build manually docker build -t crawl4ai-mcp . docker run -it crawl4ai-mcp ``` **Docker Features:** - 🔧 **Multi-Browser Support**: Chromium, Firefox, Webkit headless browsers - 🐧 **Google Chrome**: Additional Chrome Stable for compatibility - ⚡ **Optimized Performance**: Pre-configured browser flags for Docker - 🔒 **Security**: Non-root user execution - 📦 **Complete Dependencies**: All required libraries included ### Claude Desktop Setup **UVX Installation:** Add to your `claude_desktop_config.json`: ```json { "mcpServers": { "crawl-mcp": { "transport": "stdio", "command": "uvx", "args": [ "--from", "git+https://github.com/walksoda/crawl-mcp", "crawl-mcp" ], "env": { "CRAWL4AI_LANG": "en" } } } } ``` **Docker HTTP Mode:** ```json { "mcpServers": { "crawl-mcp": { "transport": "http", "baseUrl": "http://localhost:8000" } } } ``` **For Japanese interface:** ```json "env": { "CRAWL4AI_LANG": "ja" } ``` ## 📖 Documentation | Topic | Description | |-------|-------------| | **[Installation Guide](docs/INSTALLATION.md)** | Complete installation instructions for all platforms | | **[API Reference](docs/API_REFERENCE.md)** | Full tool documentation and usage examples | | **[Configuration Examples](docs/CONFIGURATION_EXAMPLES.md)** | Platform-specific setup configurations | | **[HTTP Integration](docs/HTTP_INTEGRATION.md)** | HTTP API access and integration methods | | **[Advanced Usage](docs/ADVANCED_USAGE.md)** | Power user techniques and workflows | | **[Development Guide](docs/DEVELOPMENT.md)** | Contributing and development setup | ### Language-Specific Documentation - **English**: [docs/](docs/) directory - **日本語**: [docs/ja/](docs/ja/) directory ## 🛠️ Tool Overview ### Web Crawling - `crawl_url` - Single page crawling with JavaScript support - `deep_crawl_site` - Multi-page site mapping and exploration - `crawl_url_with_fallback` - Robust crawling with retry strategies - `batch_crawl` - Process multiple URLs (max 5) - `multi_url_crawl` - Advanced multi-URL configuration ### Search Integration - `search_google` - Genre-filtered Google search - `search_and_crawl` - Combined search and content extraction - `batch_search_google` - Multiple search queries (max 3) ### Data Extraction - `extract_structured_data` - CSS/XPath/LLM-based structured extraction ### Media Processing - `process_file` - PDF, Office, ZIP to markdown conversion - `extract_youtube_transcript` - Video transcript extraction - `batch_extract_youtube_transcripts` - Multiple videos (max 3) - `get_youtube_video_info` - Video metadata retrieval ## 🎯 Common Use Cases **Content Research:** ```bash search_and_crawl → extract_structured_data → analysis ``` **Documentation Mining:** ```bash deep_crawl_site → batch processing → extraction ``` **Media Analysis:** ```bash extract_youtube_transcript → summarization workflow ``` **Site Mapping:** ```bash batch_crawl → multi_url_crawl → comprehensive data ``` ## 🚨 Quick Troubleshooting **Installation Issues:** 1. Re-run setup scripts with proper privileges 2. Try development installation method 3. Check browser dependencies are installed **Performance Issues:** - Use `wait_for_js: true` for JavaScript-heavy sites - Increase timeout for slow-loading pages - Use `extract_structured_data` for targeted extraction **Configuration Issues:** - Check JSON syntax in `claude_desktop_config.json` - Verify file paths are absolute - Restart Claude Desktop after configuration changes ## 🏗️ Project Structure - **Original Library**: [crawl4ai](https://github.com/unclecode/crawl4ai) by unclecode - **MCP Wrapper**: This repository (walksoda) - **Implementation**: Unofficial third-party integration ## 📄 License This project is an unofficial wrapper around the crawl4ai library. Please refer to the original [crawl4ai license](https://github.com/unclecode/crawl4ai) for the underlying functionality. ## 🤝 Contributing See our [Development Guide](docs/DEVELOPMENT.md) for contribution guidelines and development setup instructions. ## 🔗 Related Projects - [crawl4ai](https://github.com/unclecode/crawl4ai) - The underlying web crawling library - [Model Context Protocol](https://modelcontextprotocol.io/) - The standard this server implements - [Claude Desktop](https://docs.anthropic.com/claude/docs/claude-desktop) - Primary client for MCP servers

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/walksoda/crawl-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

README.md•6.85 KiB