Provides integration with Crawl4AI server which can be run as a Docker container for easy deployment.
Supports Firefox as one of the browser engines for web crawling and content extraction tasks.
Executes custom JavaScript code on web pages and returns the results, enabling advanced web scraping capabilities.
Extracts web content as markdown with various filtering options including raw, fit, bm25, and LLM-based filtering.
Runs on Node.js 18+ to provide web crawling and browser automation capabilities.
Integrates with OpenAI models for LLM-based content extraction and structured data generation.
Auto-detects and processes RSS feeds as part of the smart_crawl functionality.
Supports creation and management of persistent browser sessions to maintain state across multiple requests.
Handles SVG content during web crawling and includes it in extraction results when appropriate.
Parses XML sitemaps to extract URLs with support for regex filtering of specific URL patterns.
MCP Server for Crawl4AI
TypeScript implementation of an MCP server for Crawl4AI. Provides tools for web crawling, content extraction, and browser automation.
Table of Contents
- Prerequisites
- Quick Start
- Configuration
- Client-Specific Instructions
- Available Tools
- Advanced Configuration
- Development
- License
Prerequisites
- Node.js 18+ and npm
- A running Crawl4AI server
Quick Start
1. Start the Crawl4AI server (for example, local docker)
Note: Tested with Crawl4AI version 0.7.2
2. Add to your MCP client
This MCP server works with any MCP-compatible client (Claude Desktop, Claude Code, Cursor, LMStudio, etc.).
Using npx (Recommended)
Using local installation
With all optional variables
Configuration
Environment Variables
Client-Specific Instructions
Claude Desktop
Add to ~/Library/Application Support/Claude/claude_desktop_config.json
Claude Code
Other MCP Clients
Consult your client's documentation for MCP server configuration. The key details:
- Command:
npx mcp-crawl4ai-ts
ornode /path/to/dist/index.js
- Required env:
CRAWL4AI_BASE_URL
- Optional env:
CRAWL4AI_API_KEY
,SERVER_NAME
,SERVER_VERSION
Available Tools
1. get_markdown
- Extract content as markdown with filtering
Extracts content as markdown with various filtering options. Use 'bm25' or 'llm' filters with a query for specific content extraction.
2. capture_screenshot
- Capture webpage screenshot
Returns base64-encoded PNG. Note: This is stateless - for screenshots after JS execution, use crawl
with screenshot: true
.
3. generate_pdf
- Convert webpage to PDF
Returns base64-encoded PDF. Stateless tool - for PDFs after JS execution, use crawl
with pdf: true
.
4. execute_js
- Execute JavaScript and get return values
Executes JavaScript and returns results. Each script can use 'return' to get values back. Stateless - for persistent JS execution use crawl
with js_code
.
5. batch_crawl
- Crawl multiple URLs concurrently
Efficiently crawls multiple URLs in parallel. Each URL gets a fresh browser instance.
6. smart_crawl
- Auto-detect and handle different content types
Intelligently detects content type (HTML/sitemap/RSS) and processes accordingly.
7. get_html
- Get sanitized HTML for analysis
Returns preprocessed HTML optimized for structure analysis. Use for building schemas or analyzing patterns.
8. extract_links
- Extract and categorize page links
Extracts all links and groups them by type: internal, external, social media, documents, images.
9. crawl_recursive
- Deep crawl website following links
Crawls a website following internal links up to specified depth. Returns content from all discovered pages.
10. parse_sitemap
- Extract URLs from XML sitemaps
Extracts all URLs from XML sitemaps. Supports regex filtering for specific URL patterns.
11. crawl
- Advanced web crawling with full configuration
12. create_session
- Create persistent browser session
Creates a persistent browser session for maintaining state across multiple requests. Returns the session_id for use with the crawl
tool.
Important: Only the crawl
tool supports session_id. Other tools are stateless and create new browsers each time.
13. clear_session
- Remove session from tracking
Removes session from local tracking. Note: The actual browser session on the server persists until timeout.
14. list_sessions
- List tracked browser sessions
Returns all locally tracked sessions with creation time, last used time, and initial URL. Note: These are session references - actual server state may differ.
15. extract_with_llm
- Extract structured data using AI
Uses AI to extract structured data from webpages. Returns results immediately without any polling or job management. This is the recommended way to extract specific information since CSS/XPath extraction is not supported via the REST API.
Advanced Configuration
For detailed information about all available configuration options, extraction strategies, and advanced features, please refer to the official Crawl4AI documentation:
Changelog
See CHANGELOG.md for detailed version history and recent updates.
Development
Setup
Running Integration Tests
Integration tests require a running Crawl4AI server. Configure your environment:
Integration tests cover:
- Dynamic content and JavaScript execution
- Session management and cookies
- Content extraction (LLM-based only)
- Media handling (screenshots, PDFs)
- Performance and caching
- Content filtering
- Bot detection avoidance
- Error handling
License
MIT
This server cannot be installed
remote-capable server
The server can be hosted and run remotely because it primarily relies on remote services or has no dependency on the local environment.
TypeScript implementation of an MCP server that provides tools for web crawling, content extraction, and browser automation, enabling AI systems to access and process web content through 15 specialized tools.
- Table of Contents
- Prerequisites
- Quick Start
- Configuration
- Client-Specific Instructions
- Available Tools
- 1. get_markdown - Extract content as markdown with filtering
- 2. capture_screenshot - Capture webpage screenshot
- 3. generate_pdf - Convert webpage to PDF
- 4. execute_js - Execute JavaScript and get return values
- 5. batch_crawl - Crawl multiple URLs concurrently
- 6. smart_crawl - Auto-detect and handle different content types
- 7. get_html - Get sanitized HTML for analysis
- 8. extract_links - Extract and categorize page links
- 9. crawl_recursive - Deep crawl website following links
- 10. parse_sitemap - Extract URLs from XML sitemaps
- 11. crawl - Advanced web crawling with full configuration
- 12. create_session - Create persistent browser session
- 13. clear_session - Remove session from tracking
- 14. list_sessions - List tracked browser sessions
- 15. extract_with_llm - Extract structured data using AI
- Advanced Configuration
- Changelog
- Development
- License
Related MCP Servers
- AsecurityAlicenseAqualityA TypeScript-based MCP server utilizing the UseScraper API to provide web scraping capabilities, allowing users to extract content from webpages in various formats.Last updated -12JavaScriptMIT License
- -securityFlicense-qualityBridge the gap between your web crawl and AI language models. With mcp-server-webcrawl, your AI client filters and analyzes web content under your direction or autonomously, extracting insights from your web content. Supports WARC, wget, InterroBot, Katana, and SiteOne crawlers.Last updated -16Python
- -securityFlicense-qualityA MCP server that allows AI assistants to interact with the browser, including getting page content as markdown, modifying page styles, and searching browser history.Last updated -79TypeScript
- -securityFlicense-qualityAn MCP server that crawls API documentation websites and exposes their content to AI models, enabling them to search, browse, and reference API specifications.Last updated -Python