Skip to main content
Glama

MCP Server for Crawl4AI

by omgwtfwow

MCP Server for Crawl4AI

Note: Tested with Crawl4AI version 0.7.4

TypeScript implementation of an MCP server for Crawl4AI. Provides tools for web crawling, content extraction, and browser automation.

Table of Contents

Prerequisites

  • Node.js 18+ and npm
  • A running Crawl4AI server

Quick Start

1. Start the Crawl4AI server (for example, local docker)

docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.4

2. Add to your MCP client

This MCP server works with any MCP-compatible client (Claude Desktop, Claude Code, Cursor, LMStudio, etc.).

{ "mcpServers": { "crawl4ai": { "command": "npx", "args": ["mcp-crawl4ai-ts"], "env": { "CRAWL4AI_BASE_URL": "http://localhost:11235" } } } }
Using local installation
{ "mcpServers": { "crawl4ai": { "command": "node", "args": ["/path/to/mcp-crawl4ai-ts/dist/index.js"], "env": { "CRAWL4AI_BASE_URL": "http://localhost:11235" } } } }
With all optional variables
{ "mcpServers": { "crawl4ai": { "command": "npx", "args": ["mcp-crawl4ai-ts"], "env": { "CRAWL4AI_BASE_URL": "http://localhost:11235", "CRAWL4AI_API_KEY": "your-api-key", "SERVER_NAME": "custom-name", "SERVER_VERSION": "1.0.0" } } } }

Configuration

Environment Variables

# Required CRAWL4AI_BASE_URL=http://localhost:11235 # Optional - Server Configuration CRAWL4AI_API_KEY= # If your server requires auth SERVER_NAME=crawl4ai-mcp # Custom name for the MCP server SERVER_VERSION=1.0.0 # Custom version

Client-Specific Instructions

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json

Claude Code

claude mcp add crawl4ai -e CRAWL4AI_BASE_URL=http://localhost:11235 -- npx mcp-crawl4ai-ts

Other MCP Clients

Consult your client's documentation for MCP server configuration. The key details:

  • Command: npx mcp-crawl4ai-ts or node /path/to/dist/index.js
  • Required env: CRAWL4AI_BASE_URL
  • Optional env: CRAWL4AI_API_KEY, SERVER_NAME, SERVER_VERSION

Available Tools

1. get_markdown - Extract content as markdown with filtering

{ url: string, // Required: URL to extract markdown from filter?: 'raw'|'fit'|'bm25'|'llm', // Filter type (default: 'fit') query?: string, // Query for bm25/llm filters cache?: string // Cache-bust parameter (default: '0') }

Extracts content as markdown with various filtering options. Use 'bm25' or 'llm' filters with a query for specific content extraction.

2. capture_screenshot - Capture webpage screenshot

{ url: string, // Required: URL to capture screenshot_wait_for?: number // Seconds to wait before screenshot (default: 2) }

Returns base64-encoded PNG. Note: This is stateless - for screenshots after JS execution, use crawl with screenshot: true.

3. generate_pdf - Convert webpage to PDF

{ url: string // Required: URL to convert to PDF }

Returns base64-encoded PDF. Stateless tool - for PDFs after JS execution, use crawl with pdf: true.

4. execute_js - Execute JavaScript and get return values

{ url: string, // Required: URL to load scripts: string | string[] // Required: JavaScript to execute }

Executes JavaScript and returns results. Each script can use 'return' to get values back. Stateless - for persistent JS execution use crawl with js_code.

5. batch_crawl - Crawl multiple URLs concurrently

{ urls: string[], // Required: List of URLs to crawl max_concurrent?: number, // Parallel request limit (default: 5) remove_images?: boolean, // Remove images from output (default: false) bypass_cache?: boolean, // Bypass cache for all URLs (default: false) configs?: Array<{ // Optional: Per-URL configurations (v3.0.0+) url: string, [key: string]: any // Any crawl parameters for this specific URL }> }

Efficiently crawls multiple URLs in parallel. Each URL gets a fresh browser instance. With configs array, you can specify different parameters for each URL.

6. smart_crawl - Auto-detect and handle different content types

{ url: string, // Required: URL to crawl max_depth?: number, // Maximum depth for recursive crawling (default: 2) follow_links?: boolean, // Follow links in content (default: true) bypass_cache?: boolean // Bypass cache (default: false) }

Intelligently detects content type (HTML/sitemap/RSS) and processes accordingly.

7. get_html - Get sanitized HTML for analysis

{ url: string // Required: URL to extract HTML from }

Returns preprocessed HTML optimized for structure analysis. Use for building schemas or analyzing patterns.

{ url: string, // Required: URL to extract links from categorize?: boolean // Group by type (default: true) }

Extracts all links and groups them by type: internal, external, social media, documents, images.

{ url: string, // Required: Starting URL max_depth?: number, // Maximum depth to crawl (default: 3) max_pages?: number, // Maximum pages to crawl (default: 50) include_pattern?: string, // Regex pattern for URLs to include exclude_pattern?: string // Regex pattern for URLs to exclude }

Crawls a website following internal links up to specified depth. Returns content from all discovered pages.

10. parse_sitemap - Extract URLs from XML sitemaps

{ url: string, // Required: Sitemap URL (e.g., /sitemap.xml) filter_pattern?: string // Optional: Regex pattern to filter URLs }

Extracts all URLs from XML sitemaps. Supports regex filtering for specific URL patterns.

11. crawl - Advanced web crawling with full configuration

{ url: string, // URL to crawl // Browser Configuration browser_type?: 'chromium'|'firefox'|'webkit'|'undetected', // Browser engine (undetected = stealth mode) viewport_width?: number, // Browser width (default: 1080) viewport_height?: number, // Browser height (default: 600) user_agent?: string, // Custom user agent proxy_server?: string | { // Proxy URL (string or object format) server: string, username?: string, password?: string }, proxy_username?: string, // Proxy auth (if using string format) proxy_password?: string, // Proxy password (if using string format) cookies?: Array<{name, value, domain}>, // Pre-set cookies headers?: Record<string,string>, // Custom headers // Crawler Configuration word_count_threshold?: number, // Min words per block (default: 200) excluded_tags?: string[], // HTML tags to exclude remove_overlay_elements?: boolean, // Remove popups/modals js_code?: string | string[], // JavaScript to execute wait_for?: string, // Wait condition (selector or JS) wait_for_timeout?: number, // Wait timeout (default: 30000) delay_before_scroll?: number, // Pre-scroll delay scroll_delay?: number, // Between-scroll delay process_iframes?: boolean, // Include iframe content exclude_external_links?: boolean, // Remove external links screenshot?: boolean, // Capture screenshot pdf?: boolean, // Generate PDF session_id?: string, // Reuse browser session (only works with crawl tool) cache_mode?: 'ENABLED'|'BYPASS'|'DISABLED', // Cache control // New in v3.0.0 (Crawl4AI 0.7.3/0.7.4) css_selector?: string, // CSS selector to filter content delay_before_return_html?: number, // Delay in seconds before returning HTML include_links?: boolean, // Include extracted links in response resolve_absolute_urls?: boolean, // Convert relative URLs to absolute // LLM Extraction (REST API only supports 'llm' type) extraction_type?: 'llm', // Only 'llm' extraction is supported via REST API extraction_schema?: object, // Schema for structured extraction extraction_instruction?: string, // Natural language extraction prompt extraction_strategy?: { // Advanced extraction configuration provider?: string, api_key?: string, model?: string, [key: string]: any }, table_extraction_strategy?: { // Table extraction configuration enable_chunking?: boolean, thresholds?: object, [key: string]: any }, markdown_generator_options?: { // Markdown generation options include_links?: boolean, preserve_formatting?: boolean, [key: string]: any }, timeout?: number, // Overall timeout (default: 60000) verbose?: boolean // Detailed logging }

12. manage_session - Unified session management

{ action: 'create' | 'clear' | 'list', // Required: Action to perform session_id?: string, // For 'create' and 'clear' actions initial_url?: string, // For 'create' action: URL to load browser_type?: 'chromium' | 'firefox' | 'webkit' | 'undetected' // For 'create' action }

Unified tool for managing browser sessions. Supports three actions:

  • create: Start a persistent browser session
  • clear: Remove a session from local tracking
  • list: Show all active sessions

Examples:

// Create a new session { action: 'create', session_id: 'my-session', initial_url: 'https://example.com' } // Clear a session { action: 'clear', session_id: 'my-session' } // List all sessions { action: 'list' }

13. extract_with_llm - Extract structured data using AI

{ url: string, // URL to extract data from query: string // Natural language extraction instructions }

Uses AI to extract structured data from webpages. Returns results immediately without any polling or job management. This is the recommended way to extract specific information since CSS/XPath extraction is not supported via the REST API.

Advanced Configuration

For detailed information about all available configuration options, extraction strategies, and advanced features, please refer to the official Crawl4AI documentation:

Changelog

See CHANGELOG.md for detailed version history and recent updates.

Development

Setup

# 1. Start the Crawl4AI server docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest # 2. Install MCP server git clone https://github.com/omgwtfwow/mcp-crawl4ai-ts.git cd mcp-crawl4ai-ts npm install cp .env.example .env # 3. Development commands npm run dev # Development mode npm test # Run tests npm run lint # Check code quality npm run build # Production build # 4. Add to your MCP client (See "Using local installation")

Running Integration Tests

Integration tests require a running Crawl4AI server. Configure your environment:

# Required for integration tests export CRAWL4AI_BASE_URL=http://localhost:11235 export CRAWL4AI_API_KEY=your-api-key # If authentication is required # Optional: For LLM extraction tests export LLM_PROVIDER=openai/gpt-4o-mini export LLM_API_TOKEN=your-llm-api-key export LLM_BASE_URL=https://api.openai.com/v1 # If using custom endpoint # Run integration tests (ALWAYS use the npm script; don't call `jest` directly) npm run test:integration # Run a single integration test file npm run test:integration -- src/__tests__/integration/extract-links.integration.test.ts > IMPORTANT: Do NOT run `npx jest` directly for integration tests. The npm script injects `NODE_OPTIONS=--experimental-vm-modules` which is required for ESM + ts-jest. Running Jest directly will produce `SyntaxError: Cannot use import statement outside a module` and hang.

Integration tests cover:

  • Dynamic content and JavaScript execution
  • Session management and cookies
  • Content extraction (LLM-based only)
  • Media handling (screenshots, PDFs)
  • Performance and caching
  • Content filtering
  • Bot detection avoidance
  • Error handling

Integration Test Checklist

  1. Docker container healthy:
docker ps --filter name=crawl4ai --format '{{.Names}} {{.Status}}' curl -sf http://localhost:11235/health || echo "Health check failed"
  1. Env vars loaded (either exported or in .env): CRAWL4AI_BASE_URL (required), optional: CRAWL4AI_API_KEY, LLM_PROVIDER, LLM_API_TOKEN, LLM_BASE_URL.
  2. Use npm run test:integration (never raw jest).
  3. To target one file add it after -- (see example above).
  4. Expect total runtime ~2–3 minutes; longer or immediate hang usually means missing NODE_OPTIONS or wrong Jest version.

Troubleshooting

SymptomLikely CauseFix
SyntaxError: Cannot use import statement outside a moduleRan jest directly without script flagsRe-run with npm run test:integration
Hangs on first test (RUNS ...)Missing experimental VM modules flagUse npm script / ensure NODE_OPTIONS=--experimental-vm-modules
Network timeoutsCrawl4AI container not healthy / DNS blockedRestart container: docker restart <name>
LLM tests skippedMissing LLM_PROVIDER or LLM_API_TOKENExport required LLM vars
New Jest major upgrade breaks testsVersion mismatch with ts-jestKeep Jest 29.x unless ts-jest upgraded accordingly

Version Compatibility Note

Current stack: jest@29.x + ts-jest@29.x + ESM ("type": "module"). Updating Jest to 30+ requires upgrading ts-jest and revisiting jest.config.cjs. Keep versions aligned to avoid parse errors.

License

MIT

-
security - not tested
A
license - permissive license
-
quality - not tested

remote-capable server

The server can be hosted and run remotely because it primarily relies on remote services or has no dependency on the local environment.

TypeScript implementation of an MCP server that provides tools for web crawling, content extraction, and browser automation, enabling AI systems to access and process web content through 15 specialized tools.

  1. Table of Contents
    1. Prerequisites
      1. Quick Start
        1. 1. Start the Crawl4AI server (for example, local docker)
        2. 2. Add to your MCP client
      2. Configuration
        1. Environment Variables
      3. Client-Specific Instructions
        1. Claude Desktop
        2. Claude Code
        3. Other MCP Clients
      4. Available Tools
        1. 1. get_markdown - Extract content as markdown with filtering
        2. 2. capture_screenshot - Capture webpage screenshot
        3. 3. generate_pdf - Convert webpage to PDF
        4. 4. execute_js - Execute JavaScript and get return values
        5. 5. batch_crawl - Crawl multiple URLs concurrently
        6. 6. smart_crawl - Auto-detect and handle different content types
        7. 7. get_html - Get sanitized HTML for analysis
        8. 8. extract_links - Extract and categorize page links
        9. 9. crawl_recursive - Deep crawl website following links
        10. 10. parse_sitemap - Extract URLs from XML sitemaps
        11. 11. crawl - Advanced web crawling with full configuration
        12. 12. manage_session - Unified session management
        13. 13. extract_with_llm - Extract structured data using AI
      5. Advanced Configuration
        1. Changelog
          1. Development
            1. Setup
            2. Running Integration Tests
            3. Integration Test Checklist
            4. Troubleshooting
            5. Version Compatibility Note
          2. License

            Related MCP Servers

            • A
              security
              A
              license
              A
              quality
              A TypeScript-based MCP server utilizing the UseScraper API to provide web scraping capabilities, allowing users to extract content from webpages in various formats.
              Last updated -
              2
              MIT License
              • Apple
            • -
              security
              F
              license
              -
              quality
              Bridge the gap between your web crawl and AI language models. With mcp-server-webcrawl, your AI client filters and analyzes web content under your direction or autonomously, extracting insights from your web content. Supports WARC, wget, InterroBot, Katana, and SiteOne crawlers.
              Last updated -
              22
              Python
              • Apple
            • -
              security
              F
              license
              -
              quality
              A MCP server that allows AI assistants to interact with the browser, including getting page content as markdown, modifying page styles, and searching browser history.
              Last updated -
              82
            • -
              security
              F
              license
              -
              quality
              An MCP server that crawls API documentation websites and exposes their content to AI models, enabling them to search, browse, and reference API specifications.
              Last updated -

            View all related MCP servers

            MCP directory API

            We provide all the information about MCP servers via our MCP API.

            curl -X GET 'https://glama.ai/api/mcp/v1/servers/omgwtfwow/mcp-crawl4ai-ts'

            If you have feedback or need assistance with the MCP directory API, please join our Discord server