README.mdā¢6.38 kB
# Crawl4AI MCP Server
A powerful Model Context Protocol (MCP) server that provides web scraping and crawling capabilities using Crawl4AI. This server acts as the "hands and eyes" for client-side AI, enabling intelligent web content analysis and extraction.
## Features
- **š Page Structure Analysis**: Extract clean HTML or Markdown content from any webpage
- **šÆ Schema-Based Extraction**: Precision data extraction using CSS selectors and AI-generated schemas
- **šø Screenshot Capture**: Visual webpage representation for analysis
- **ā” Async Operations**: Non-blocking web crawling with progress reporting
- **š”ļø Error Handling**: Comprehensive error handling and validation
- **š MCP Integration**: Full Model Context Protocol compatibility with logging and progress tracking
## Architecture
```
āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā
ā Client AI ā ā Crawl4AI MCP ā ā Web Content ā
ā ("Brain") āāāāāŗā Server āāāāāŗā (Websites) ā
ā ā ā ("Hands & Eyes") ā ā ā
āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā
```
- **FastMCP**: Handles MCP protocol and tool registration
- **AsyncWebCrawler**: Provides async web scraping capabilities
- **Stdio Transport**: MCP-compatible communication channel
- **Error-Safe Logging**: All logs directed to stderr to prevent protocol corruption
## Installation
### Prerequisites
- Python 3.10 or higher
- pip package manager
### Setup
1. **Clone or download this repository**:
```bash
git clone <repository-url>
cd crawl4ai-mcp
```
2. **Create and activate virtual environment**:
```bash
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
3. **Install dependencies**:
```bash
pip install -r requirements.txt
```
4. **Install Playwright browsers** (required for screenshots):
```bash
playwright install
```
## Usage
### Starting the Server
```bash
# Activate virtual environment
source venv/bin/activate
# Start the MCP server
python3 crawl4ai_mcp_server.py
```
### Testing with MCP Inspector
For interactive testing and development:
```bash
# Start MCP Inspector interface
fastmcp dev crawl4ai_mcp_server.py
```
This will start a web interface (usually at http://localhost:6274) where you can test all tools interactively.
## Available Tools
### 1. server_status
**Purpose**: Get server health and capabilities information
**Parameters**: None
**Example Response**:
```json
{
"server_name": "Crawl4AI-MCP-Server",
"version": "1.0.0",
"status": "operational",
"capabilities": ["web_crawling", "content_extraction", "screenshot_capture", "schema_based_extraction"]
}
```
### 2. get_page_structure
**Purpose**: Extract webpage content for analysis (the "eyes" function)
**Parameters**:
- `url` (string): The webpage URL to analyze
- `format` (string, optional): Output format - "html" or "markdown" (default: "html")
**Example**:
```json
{
"url": "https://example.com",
"format": "html"
}
```
### 3. crawl_with_schema
**Purpose**: Precision data extraction using CSS selectors (the "hands" function)
**Parameters**:
- `url` (string): The webpage URL to extract data from
- `extraction_schema` (string): JSON string defining field names and CSS selectors
**Example Schema**:
```json
{
"title": "h1",
"description": "p.description",
"price": ".price-value",
"author": ".author-name",
"tags": ".tag"
}
```
**Example Usage**:
```json
{
"url": "https://example.com/product",
"extraction_schema": "{\"title\": \"h1\", \"price\": \".price\", \"description\": \"p\"}"
}
```
### 4. take_screenshot
**Purpose**: Capture visual representation of webpage
**Parameters**:
- `url` (string): The webpage URL to screenshot
**Example**:
```json
{
"url": "https://example.com"
}
```
**Returns**: Base64-encoded PNG image data with metadata
## Integration with Claude Desktop
To use this server with Claude Desktop, add this configuration to your Claude Desktop settings:
```json
{
"mcpServers": {
"crawl4ai": {
"command": "python3",
"args": ["/path/to/crawl4ai-mcp/crawl4ai_mcp_server.py"],
"env": {}
}
}
}
```
Replace `/path/to/crawl4ai-mcp/` with the actual path to your installation directory.
## Error Handling
All tools include comprehensive error handling and return structured JSON responses:
```json
{
"error": "Error description",
"url": "https://example.com",
"success": false
}
```
Common error scenarios:
- Invalid URL format
- Network connectivity issues
- Invalid extraction schemas
- Screenshot capture failures
## Development
### Project Structure
```
crawl4ai-mcp/
āāā crawl4ai_mcp_server.py # Main server implementation
āāā requirements.txt # Python dependencies
āāā pyproject.toml # Project configuration
āāā USAGE_EXAMPLES.md # Detailed usage examples
āāā README.md # This file
```
### Dependencies
- **fastmcp**: FastMCP framework for MCP server development
- **crawl4ai**: Core web crawling and extraction library
- **pydantic**: Data validation and parsing
- **playwright**: Browser automation for screenshots
### Testing
Run the linter to ensure code quality:
```bash
ruff check .
```
Test server startup:
```bash
python3 crawl4ai_mcp_server.py
```
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test thoroughly with MCP Inspector
5. Submit a pull request
## License
This project is open source. See the LICENSE file for details.
## Support
For issues and questions:
1. Check the troubleshooting section in USAGE_EXAMPLES.md
2. Test with MCP Inspector to isolate issues
3. Verify all dependencies are correctly installed
4. Ensure virtual environment is activated
## Acknowledgments
- **Crawl4AI**: Powerful web crawling and extraction capabilities
- **FastMCP**: Streamlined MCP server development framework
- **Model Context Protocol**: Standardized AI tool integration