Skip to main content
Glama

MCP Server for Crawl4AI

by omgwtfwow
README.md16.5 kB
# MCP Server for Crawl4AI > **Note:** Tested with Crawl4AI version 0.7.4 [![npm version](https://img.shields.io/npm/v/mcp-crawl4ai-ts.svg)](https://www.npmjs.com/package/mcp-crawl4ai-ts) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Node.js CI](https://img.shields.io/badge/Node.js-18+-green.svg)](https://nodejs.org/) [![coverage](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2Fomgwtfwow%2Fe2abffb0deb25afa2bf9185f440dae81%2Fraw%2Fcoverage.json&cacheSeconds=300)](https://omgwtfwow.github.io/mcp-crawl4ai-ts/coverage/) TypeScript implementation of an MCP server for Crawl4AI. Provides tools for web crawling, content extraction, and browser automation. ## Table of Contents - [Prerequisites](#prerequisites) - [Quick Start](#quick-start) - [Configuration](#configuration) - [Client-Specific Instructions](#client-specific-instructions) - [Available Tools](#available-tools) - [1. get_markdown](#1-get_markdown---extract-content-as-markdown-with-filtering) - [2. capture_screenshot](#2-capture_screenshot---capture-webpage-screenshot) - [3. generate_pdf](#3-generate_pdf---convert-webpage-to-pdf) - [4. execute_js](#4-execute_js---execute-javascript-and-get-return-values) - [5. batch_crawl](#5-batch_crawl---crawl-multiple-urls-concurrently) - [6. smart_crawl](#6-smart_crawl---auto-detect-and-handle-different-content-types) - [7. get_html](#7-get_html---get-sanitized-html-for-analysis) - [8. extract_links](#8-extract_links---extract-and-categorize-page-links) - [9. crawl_recursive](#9-crawl_recursive---deep-crawl-website-following-links) - [10. parse_sitemap](#10-parse_sitemap---extract-urls-from-xml-sitemaps) - [11. crawl](#11-crawl---advanced-web-crawling-with-full-configuration) - [12. manage_session](#12-manage_session---unified-session-management) - [13. extract_with_llm](#13-extract_with_llm---extract-structured-data-using-ai) - [Advanced Configuration](#advanced-configuration) - [Development](#development) - [License](#license) ## Prerequisites - Node.js 18+ and npm - A running Crawl4AI server ## Quick Start ### 1. Start the Crawl4AI server (for example, local docker) ```bash docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.4 ``` ### 2. Add to your MCP client This MCP server works with any MCP-compatible client (Claude Desktop, Claude Code, Cursor, LMStudio, etc.). #### Using npx (Recommended) ```json { "mcpServers": { "crawl4ai": { "command": "npx", "args": ["mcp-crawl4ai-ts"], "env": { "CRAWL4AI_BASE_URL": "http://localhost:11235" } } } } ``` #### Using local installation ```json { "mcpServers": { "crawl4ai": { "command": "node", "args": ["/path/to/mcp-crawl4ai-ts/dist/index.js"], "env": { "CRAWL4AI_BASE_URL": "http://localhost:11235" } } } } ``` #### With all optional variables ```json { "mcpServers": { "crawl4ai": { "command": "npx", "args": ["mcp-crawl4ai-ts"], "env": { "CRAWL4AI_BASE_URL": "http://localhost:11235", "CRAWL4AI_API_KEY": "your-api-key", "SERVER_NAME": "custom-name", "SERVER_VERSION": "1.0.0" } } } } ``` ## Configuration ### Environment Variables ```env # Required CRAWL4AI_BASE_URL=http://localhost:11235 # Optional - Server Configuration CRAWL4AI_API_KEY= # If your server requires auth SERVER_NAME=crawl4ai-mcp # Custom name for the MCP server SERVER_VERSION=1.0.0 # Custom version ``` ## Client-Specific Instructions ### Claude Desktop Add to `~/Library/Application Support/Claude/claude_desktop_config.json` ### Claude Code ```bash claude mcp add crawl4ai -e CRAWL4AI_BASE_URL=http://localhost:11235 -- npx mcp-crawl4ai-ts ``` ### Other MCP Clients Consult your client's documentation for MCP server configuration. The key details: - **Command**: `npx mcp-crawl4ai-ts` or `node /path/to/dist/index.js` - **Required env**: `CRAWL4AI_BASE_URL` - **Optional env**: `CRAWL4AI_API_KEY`, `SERVER_NAME`, `SERVER_VERSION` ## Available Tools ### 1. `get_markdown` - Extract content as markdown with filtering ```typescript { url: string, // Required: URL to extract markdown from filter?: 'raw'|'fit'|'bm25'|'llm', // Filter type (default: 'fit') query?: string, // Query for bm25/llm filters cache?: string // Cache-bust parameter (default: '0') } ``` Extracts content as markdown with various filtering options. Use 'bm25' or 'llm' filters with a query for specific content extraction. ### 2. `capture_screenshot` - Capture webpage screenshot ```typescript { url: string, // Required: URL to capture screenshot_wait_for?: number // Seconds to wait before screenshot (default: 2) } ``` Returns base64-encoded PNG. Note: This is stateless - for screenshots after JS execution, use `crawl` with `screenshot: true`. ### 3. `generate_pdf` - Convert webpage to PDF ```typescript { url: string // Required: URL to convert to PDF } ``` Returns base64-encoded PDF. Stateless tool - for PDFs after JS execution, use `crawl` with `pdf: true`. ### 4. `execute_js` - Execute JavaScript and get return values ```typescript { url: string, // Required: URL to load scripts: string | string[] // Required: JavaScript to execute } ``` Executes JavaScript and returns results. Each script can use 'return' to get values back. Stateless - for persistent JS execution use `crawl` with `js_code`. ### 5. `batch_crawl` - Crawl multiple URLs concurrently ```typescript { urls: string[], // Required: List of URLs to crawl max_concurrent?: number, // Parallel request limit (default: 5) remove_images?: boolean, // Remove images from output (default: false) bypass_cache?: boolean, // Bypass cache for all URLs (default: false) configs?: Array<{ // Optional: Per-URL configurations (v3.0.0+) url: string, [key: string]: any // Any crawl parameters for this specific URL }> } ``` Efficiently crawls multiple URLs in parallel. Each URL gets a fresh browser instance. With `configs` array, you can specify different parameters for each URL. ### 6. `smart_crawl` - Auto-detect and handle different content types ```typescript { url: string, // Required: URL to crawl max_depth?: number, // Maximum depth for recursive crawling (default: 2) follow_links?: boolean, // Follow links in content (default: true) bypass_cache?: boolean // Bypass cache (default: false) } ``` Intelligently detects content type (HTML/sitemap/RSS) and processes accordingly. ### 7. `get_html` - Get sanitized HTML for analysis ```typescript { url: string // Required: URL to extract HTML from } ``` Returns preprocessed HTML optimized for structure analysis. Use for building schemas or analyzing patterns. ### 8. `extract_links` - Extract and categorize page links ```typescript { url: string, // Required: URL to extract links from categorize?: boolean // Group by type (default: true) } ``` Extracts all links and groups them by type: internal, external, social media, documents, images. ### 9. `crawl_recursive` - Deep crawl website following links ```typescript { url: string, // Required: Starting URL max_depth?: number, // Maximum depth to crawl (default: 3) max_pages?: number, // Maximum pages to crawl (default: 50) include_pattern?: string, // Regex pattern for URLs to include exclude_pattern?: string // Regex pattern for URLs to exclude } ``` Crawls a website following internal links up to specified depth. Returns content from all discovered pages. ### 10. `parse_sitemap` - Extract URLs from XML sitemaps ```typescript { url: string, // Required: Sitemap URL (e.g., /sitemap.xml) filter_pattern?: string // Optional: Regex pattern to filter URLs } ``` Extracts all URLs from XML sitemaps. Supports regex filtering for specific URL patterns. ### 11. `crawl` - Advanced web crawling with full configuration ```typescript { url: string, // URL to crawl // Browser Configuration browser_type?: 'chromium'|'firefox'|'webkit'|'undetected', // Browser engine (undetected = stealth mode) viewport_width?: number, // Browser width (default: 1080) viewport_height?: number, // Browser height (default: 600) user_agent?: string, // Custom user agent proxy_server?: string | { // Proxy URL (string or object format) server: string, username?: string, password?: string }, proxy_username?: string, // Proxy auth (if using string format) proxy_password?: string, // Proxy password (if using string format) cookies?: Array<{name, value, domain}>, // Pre-set cookies headers?: Record<string,string>, // Custom headers // Crawler Configuration word_count_threshold?: number, // Min words per block (default: 200) excluded_tags?: string[], // HTML tags to exclude remove_overlay_elements?: boolean, // Remove popups/modals js_code?: string | string[], // JavaScript to execute wait_for?: string, // Wait condition (selector or JS) wait_for_timeout?: number, // Wait timeout (default: 30000) delay_before_scroll?: number, // Pre-scroll delay scroll_delay?: number, // Between-scroll delay process_iframes?: boolean, // Include iframe content exclude_external_links?: boolean, // Remove external links screenshot?: boolean, // Capture screenshot pdf?: boolean, // Generate PDF session_id?: string, // Reuse browser session (only works with crawl tool) cache_mode?: 'ENABLED'|'BYPASS'|'DISABLED', // Cache control // New in v3.0.0 (Crawl4AI 0.7.3/0.7.4) css_selector?: string, // CSS selector to filter content delay_before_return_html?: number, // Delay in seconds before returning HTML include_links?: boolean, // Include extracted links in response resolve_absolute_urls?: boolean, // Convert relative URLs to absolute // LLM Extraction (REST API only supports 'llm' type) extraction_type?: 'llm', // Only 'llm' extraction is supported via REST API extraction_schema?: object, // Schema for structured extraction extraction_instruction?: string, // Natural language extraction prompt extraction_strategy?: { // Advanced extraction configuration provider?: string, api_key?: string, model?: string, [key: string]: any }, table_extraction_strategy?: { // Table extraction configuration enable_chunking?: boolean, thresholds?: object, [key: string]: any }, markdown_generator_options?: { // Markdown generation options include_links?: boolean, preserve_formatting?: boolean, [key: string]: any }, timeout?: number, // Overall timeout (default: 60000) verbose?: boolean // Detailed logging } ``` ### 12. `manage_session` - Unified session management ```typescript { action: 'create' | 'clear' | 'list', // Required: Action to perform session_id?: string, // For 'create' and 'clear' actions initial_url?: string, // For 'create' action: URL to load browser_type?: 'chromium' | 'firefox' | 'webkit' | 'undetected' // For 'create' action } ``` Unified tool for managing browser sessions. Supports three actions: - **create**: Start a persistent browser session - **clear**: Remove a session from local tracking - **list**: Show all active sessions Examples: ```typescript // Create a new session { action: 'create', session_id: 'my-session', initial_url: 'https://example.com' } // Clear a session { action: 'clear', session_id: 'my-session' } // List all sessions { action: 'list' } ``` ### 13. `extract_with_llm` - Extract structured data using AI ```typescript { url: string, // URL to extract data from query: string // Natural language extraction instructions } ``` Uses AI to extract structured data from webpages. Returns results immediately without any polling or job management. This is the recommended way to extract specific information since CSS/XPath extraction is not supported via the REST API. ## Advanced Configuration For detailed information about all available configuration options, extraction strategies, and advanced features, please refer to the official Crawl4AI documentation: - [Crawl4AI Documentation](https://docs.crawl4ai.com/) - [Crawl4AI GitHub Repository](https://github.com/unclecode/crawl4ai) ## Changelog See [CHANGELOG.md](CHANGELOG.md) for detailed version history and recent updates. ## Development ### Setup ```bash # 1. Start the Crawl4AI server docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest # 2. Install MCP server git clone https://github.com/omgwtfwow/mcp-crawl4ai-ts.git cd mcp-crawl4ai-ts npm install cp .env.example .env # 3. Development commands npm run dev # Development mode npm test # Run tests npm run lint # Check code quality npm run build # Production build # 4. Add to your MCP client (See "Using local installation") ``` ### Running Integration Tests Integration tests require a running Crawl4AI server. Configure your environment: ```bash # Required for integration tests export CRAWL4AI_BASE_URL=http://localhost:11235 export CRAWL4AI_API_KEY=your-api-key # If authentication is required # Optional: For LLM extraction tests export LLM_PROVIDER=openai/gpt-4o-mini export LLM_API_TOKEN=your-llm-api-key export LLM_BASE_URL=https://api.openai.com/v1 # If using custom endpoint # Run integration tests (ALWAYS use the npm script; don't call `jest` directly) npm run test:integration # Run a single integration test file npm run test:integration -- src/__tests__/integration/extract-links.integration.test.ts > IMPORTANT: Do NOT run `npx jest` directly for integration tests. The npm script injects `NODE_OPTIONS=--experimental-vm-modules` which is required for ESM + ts-jest. Running Jest directly will produce `SyntaxError: Cannot use import statement outside a module` and hang. ``` Integration tests cover: - Dynamic content and JavaScript execution - Session management and cookies - Content extraction (LLM-based only) - Media handling (screenshots, PDFs) - Performance and caching - Content filtering - Bot detection avoidance - Error handling ### Integration Test Checklist 1. Docker container healthy: ```bash docker ps --filter name=crawl4ai --format '{{.Names}} {{.Status}}' curl -sf http://localhost:11235/health || echo "Health check failed" ``` 2. Env vars loaded (either exported or in `.env`): `CRAWL4AI_BASE_URL` (required), optional: `CRAWL4AI_API_KEY`, `LLM_PROVIDER`, `LLM_API_TOKEN`, `LLM_BASE_URL`. 3. Use `npm run test:integration` (never raw `jest`). 4. To target one file add it after `--` (see example above). 5. Expect total runtime ~2–3 minutes; longer or immediate hang usually means missing `NODE_OPTIONS` or wrong Jest version. ### Troubleshooting | Symptom | Likely Cause | Fix | |---------|--------------|-----| | `SyntaxError: Cannot use import statement outside a module` | Ran `jest` directly without script flags | Re-run with `npm run test:integration` | | Hangs on first test (RUNS ...) | Missing experimental VM modules flag | Use npm script / ensure `NODE_OPTIONS=--experimental-vm-modules` | | Network timeouts | Crawl4AI container not healthy / DNS blocked | Restart container: `docker restart <name>` | | LLM tests skipped | Missing `LLM_PROVIDER` or `LLM_API_TOKEN` | Export required LLM vars | | New Jest major upgrade breaks tests | Version mismatch with `ts-jest` | Keep Jest 29.x unless `ts-jest` upgraded accordingly | ### Version Compatibility Note Current stack: `jest@29.x` + `ts-jest@29.x` + ESM (`"type": "module"`). Updating Jest to 30+ requires upgrading `ts-jest` and revisiting `jest.config.cjs`. Keep versions aligned to avoid parse errors. ## License MIT

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/omgwtfwow/mcp-crawl4ai-ts'

If you have feedback or need assistance with the MCP directory API, please join our Discord server