Skip to main content
Glama

Crawl-MCP: Unofficial MCP Server for crawl4ai

⚠️ Important: This is an unofficial MCP server implementation for the excellent crawl4ai library.
Not affiliated with the original crawl4ai project.

A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library with advanced AI capabilities. Extract and analyze content from any source: web pages, PDFs, Office documents, YouTube videos, and more. Features intelligent summarization to dramatically reduce token usage while preserving key information.

🌟 Key Features

  • πŸ” Google Search Integration - 7 optimized search genres with Google official operators

  • πŸ” Advanced Web Crawling: JavaScript support, deep site mapping, entity extraction

  • 🌐 Universal Content Extraction: Web pages, PDFs, Word docs, Excel, PowerPoint, ZIP archives

  • πŸ€– AI-Powered Summarization: Smart token reduction (up to 88.5%) while preserving essential information

  • 🎬 YouTube Integration: Extract video transcripts and summaries without API keys

  • ⚑ Production Ready: 17 specialized tools with comprehensive error handling

πŸš€ Quick Start

Prerequisites (Required First)

  • Python 3.11 δ»₯上(FastMCP が Python 3.11+ を要求)

Install system dependencies for Playwright:

Ubuntu 24.04 LTS (Manual Required):

# Manual setup required due to t64 library transition sudo apt update && sudo apt install -y \ libnss3 libatk-bridge2.0-0 libxss1 libasound2t64 \ libgbm1 libgtk-3-0t64 libxshmfence-dev libxrandr2 \ libxcomposite1 libxcursor1 libxdamage1 libxi6 \ fonts-noto-color-emoji fonts-unifont python3-venv python3-pip python3 -m venv venv && source venv/bin/activate pip install playwright==1.55.0 && playwright install chromium sudo playwright install-deps

Other Linux/macOS:

sudo bash scripts/prepare_for_uvx_playwright.sh

Windows (as Administrator):

scripts/prepare_for_uvx_playwright.ps1

Installation

UVX (Recommended - Easiest):

# After system preparation above - that's it! uvx --from git+https://github.com/walksoda/crawl-mcp crawl-mcp

Docker (Production-Ready):

# Clone the repository git clone https://github.com/walksoda/crawl-mcp cd crawl-mcp # Build and run with Docker Compose (STDIO mode) docker-compose up --build # Or build and run HTTP mode on port 8000 docker-compose --profile http up --build crawl4ai-mcp-http # Or build manually docker build -t crawl4ai-mcp . docker run -it crawl4ai-mcp

Docker Features:

  • πŸ”§ Multi-Browser Support: Chromium, Firefox, Webkit headless browsers

  • 🐧 Google Chrome: Additional Chrome Stable for compatibility

  • ⚑ Optimized Performance: Pre-configured browser flags for Docker

  • πŸ”’ Security: Non-root user execution

  • πŸ“¦ Complete Dependencies: All required libraries included

Claude Desktop Setup

UVX Installation: Add to your claude_desktop_config.json:

{ "mcpServers": { "crawl-mcp": { "transport": "stdio", "command": "uvx", "args": [ "--from", "git+https://github.com/walksoda/crawl-mcp", "crawl-mcp" ], "env": { "CRAWL4AI_LANG": "en" } } } }

Docker HTTP Mode:

{ "mcpServers": { "crawl-mcp": { "transport": "http", "baseUrl": "http://localhost:8000" } } }

For Japanese interface:

"env": { "CRAWL4AI_LANG": "ja" }

πŸ“– Documentation

Topic

Description

Installation Guide

Complete installation instructions for all platforms

API Reference

Full tool documentation and usage examples

Configuration Examples

Platform-specific setup configurations

HTTP Integration

HTTP API access and integration methods

Advanced Usage

Power user techniques and workflows

Development Guide

Contributing and development setup

Language-Specific Documentation

πŸ› οΈ Tool Overview

Web Crawling

  • crawl_url - Single page crawling with JavaScript support

  • deep_crawl_site - Multi-page site mapping and exploration

  • crawl_url_with_fallback - Robust crawling with retry strategies

  • batch_crawl - Process multiple URLs simultaneously

AI-Powered Analysis

  • intelligent_extract - Semantic content extraction with custom instructions

  • auto_summarize - LLM-based summarization for large content

  • extract_entities - Pattern-based entity extraction (emails, phones, URLs, etc.)

Media Processing

  • process_file - Convert PDFs, Office docs, ZIP archives to markdown

  • extract_youtube_transcript - Multi-language transcript extraction

  • batch_extract_youtube_transcripts - Process multiple videos

Search Integration

  • search_google - Genre-filtered Google search with metadata

  • search_and_crawl - Combined search and content extraction

  • batch_search_google - Multiple search queries with analysis

🎯 Common Use Cases

Content Research:

search_and_crawl β†’ intelligent_extract β†’ structured analysis

Documentation Mining:

deep_crawl_site β†’ batch processing β†’ comprehensive extraction

Media Analysis:

extract_youtube_transcript β†’ auto_summarize β†’ insight generation

Competitive Intelligence:

batch_crawl β†’ extract_entities β†’ comparative analysis

🚨 Quick Troubleshooting

Installation Issues:

  1. Run system diagnostics: Use get_system_diagnostics tool

  2. Re-run setup scripts with proper privileges

  3. Try development installation method

Performance Issues:

  • Use wait_for_js: true for JavaScript-heavy sites

  • Increase timeout for slow-loading pages

  • Enable auto_summarize for large content

Configuration Issues:

  • Check JSON syntax in claude_desktop_config.json

  • Verify file paths are absolute

  • Restart Claude Desktop after configuration changes

πŸ—οΈ Project Structure

  • Original Library: crawl4ai by unclecode

  • MCP Wrapper: This repository (walksoda)

  • Implementation: Unofficial third-party integration

πŸ“„ License

This project is an unofficial wrapper around the crawl4ai library. Please refer to the original crawl4ai license for the underlying functionality.

🀝 Contributing

See our Development Guide for contribution guidelines and development setup instructions.

-
security - not tested
A
license - permissive license
-
quality - not tested

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/walksoda/crawl-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server