Crawl4AI MCP Server

README.md•6.38 kB

# Crawl4AI MCP Server A powerful Model Context Protocol (MCP) server that provides web scraping and crawling capabilities using Crawl4AI. This server acts as the "hands and eyes" for client-side AI, enabling intelligent web content analysis and extraction. ## Features - **🔍 Page Structure Analysis**: Extract clean HTML or Markdown content from any webpage - **🎯 Schema-Based Extraction**: Precision data extraction using CSS selectors and AI-generated schemas - **📸 Screenshot Capture**: Visual webpage representation for analysis - **⚡ Async Operations**: Non-blocking web crawling with progress reporting - **🛡️ Error Handling**: Comprehensive error handling and validation - **📊 MCP Integration**: Full Model Context Protocol compatibility with logging and progress tracking ## Architecture ``` ┌─────────────────┐ ┌───────────────────┐ ┌─────────────────┐ │ Client AI │ │ Crawl4AI MCP │ │ Web Content │ │ ("Brain") │◄──►│ Server │◄──►│ (Websites) │ │ │ │ ("Hands & Eyes") │ │ │ └─────────────────┘ └───────────────────┘ └─────────────────┘ ``` - **FastMCP**: Handles MCP protocol and tool registration - **AsyncWebCrawler**: Provides async web scraping capabilities - **Stdio Transport**: MCP-compatible communication channel - **Error-Safe Logging**: All logs directed to stderr to prevent protocol corruption ## Installation ### Prerequisites - Python 3.10 or higher - pip package manager ### Setup 1. **Clone or download this repository**: ```bash git clone <repository-url> cd crawl4ai-mcp ``` 2. **Create and activate virtual environment**: ```bash python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate ``` 3. **Install dependencies**: ```bash pip install -r requirements.txt ``` 4. **Install Playwright browsers** (required for screenshots): ```bash playwright install ``` ## Usage ### Starting the Server ```bash # Activate virtual environment source venv/bin/activate # Start the MCP server python3 crawl4ai_mcp_server.py ``` ### Testing with MCP Inspector For interactive testing and development: ```bash # Start MCP Inspector interface fastmcp dev crawl4ai_mcp_server.py ``` This will start a web interface (usually at http://localhost:6274) where you can test all tools interactively. ## Available Tools ### 1. server_status **Purpose**: Get server health and capabilities information **Parameters**: None **Example Response**: ```json { "server_name": "Crawl4AI-MCP-Server", "version": "1.0.0", "status": "operational", "capabilities": ["web_crawling", "content_extraction", "screenshot_capture", "schema_based_extraction"] } ``` ### 2. get_page_structure **Purpose**: Extract webpage content for analysis (the "eyes" function) **Parameters**: - `url` (string): The webpage URL to analyze - `format` (string, optional): Output format - "html" or "markdown" (default: "html") **Example**: ```json { "url": "https://example.com", "format": "html" } ``` ### 3. crawl_with_schema **Purpose**: Precision data extraction using CSS selectors (the "hands" function) **Parameters**: - `url` (string): The webpage URL to extract data from - `extraction_schema` (string): JSON string defining field names and CSS selectors **Example Schema**: ```json { "title": "h1", "description": "p.description", "price": ".price-value", "author": ".author-name", "tags": ".tag" } ``` **Example Usage**: ```json { "url": "https://example.com/product", "extraction_schema": "{\"title\": \"h1\", \"price\": \".price\", \"description\": \"p\"}" } ``` ### 4. take_screenshot **Purpose**: Capture visual representation of webpage **Parameters**: - `url` (string): The webpage URL to screenshot **Example**: ```json { "url": "https://example.com" } ``` **Returns**: Base64-encoded PNG image data with metadata ## Integration with Claude Desktop To use this server with Claude Desktop, add this configuration to your Claude Desktop settings: ```json { "mcpServers": { "crawl4ai": { "command": "python3", "args": ["/path/to/crawl4ai-mcp/crawl4ai_mcp_server.py"], "env": {} } } } ``` Replace `/path/to/crawl4ai-mcp/` with the actual path to your installation directory. ## Error Handling All tools include comprehensive error handling and return structured JSON responses: ```json { "error": "Error description", "url": "https://example.com", "success": false } ``` Common error scenarios: - Invalid URL format - Network connectivity issues - Invalid extraction schemas - Screenshot capture failures ## Development ### Project Structure ``` crawl4ai-mcp/ ├── crawl4ai_mcp_server.py # Main server implementation ├── requirements.txt # Python dependencies ├── pyproject.toml # Project configuration ├── USAGE_EXAMPLES.md # Detailed usage examples └── README.md # This file ``` ### Dependencies - **fastmcp**: FastMCP framework for MCP server development - **crawl4ai**: Core web crawling and extraction library - **pydantic**: Data validation and parsing - **playwright**: Browser automation for screenshots ### Testing Run the linter to ensure code quality: ```bash ruff check . ``` Test server startup: ```bash python3 crawl4ai_mcp_server.py ``` ## Contributing 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Test thoroughly with MCP Inspector 5. Submit a pull request ## License This project is open source. See the LICENSE file for details. ## Support For issues and questions: 1. Check the troubleshooting section in USAGE_EXAMPLES.md 2. Test with MCP Inspector to isolate issues 3. Verify all dependencies are correctly installed 4. Ensure virtual environment is activated ## Acknowledgments - **Crawl4AI**: Powerful web crawling and extraction capabilities - **FastMCP**: Streamlined MCP server development framework - **Model Context Protocol**: Standardized AI tool integration

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Nexus-Digital-Automations/crawl4ai-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server