Server Configuration
Describes the environment variables required to run the server.
Name | Required | Description | Default |
---|---|---|---|
SERVER_NAME | No | Custom name for the MCP server | crawl4ai-mcp |
SERVER_VERSION | No | Custom version for the MCP server | 1.0.0 |
CRAWL4AI_API_KEY | No | API key for authentication if your server requires it | |
CRAWL4AI_BASE_URL | Yes | URL of the Crawl4AI server | http://localhost:11235 |
Schema
Prompts
Interactive templates invoked by user choice
Name | Description |
---|---|
No prompts |
Resources
Contextual data attached and managed by the client
Name | Description |
---|---|
No resources |
Tools
Functions exposed to the LLM to take actions
Name | Description |
---|---|
get_markdown | [STATELESS] Extract content as markdown with filtering options. Supports: raw (full content), fit (optimized, default), bm25 (keyword search), llm (AI-powered extraction). Use bm25/llm with query for specific content. Creates new browser each time. For persistence use create_session + crawl. |
capture_screenshot | [STATELESS] Capture webpage screenshot. Returns base64-encoded PNG data. Creates new browser each time. Optionally saves screenshot to local directory. IMPORTANT: Chained calls (execute_js then capture_screenshot) will NOT work - the screenshot won't see JS changes! For JS changes + screenshot use create_session + crawl(session_id, js_code, screenshot:true) in ONE call. |
generate_pdf | [STATELESS] Convert webpage to PDF. Returns base64-encoded PDF data. Creates new browser each time. Cannot capture form fills or JS changes. For persistent PDFs use create_session + crawl(session_id, pdf:true). |
execute_js | [STATELESS] Execute JavaScript and get return values + page content. Creates new browser each time. Use for: extracting data, triggering dynamic content, checking page state. Scripts with "return" statements return actual values (strings, numbers, objects, arrays). Note: null returns as {"success": true}. Returns values but page state is lost. For persistent JS execution, use crawl with session_id. |
batch_crawl | [STATELESS] Crawl multiple URLs concurrently for efficiency. Use when: processing URL lists, comparing multiple pages, or bulk data extraction. Faster than sequential crawling. Max 5 concurrent by default. Each URL gets a fresh browser. Cannot maintain state between URLs. For persistent operations use create_session + crawl. |
smart_crawl | [STATELESS] Auto-detect and handle different content types (HTML, sitemap, RSS, text). Use when: URL type is unknown, crawling feeds/sitemaps, or want automatic format handling. Adapts strategy based on content. Creates new browser each time. For persistent operations use create_session + crawl. |
get_html | [STATELESS] Get sanitized/processed HTML for inspection and automation planning. Use when: finding form fields/selectors, analyzing page structure before automation, building schemas. Returns cleaned HTML showing element names, IDs, and classes - perfect for identifying selectors for subsequent crawl operations. Commonly used before crawl to find selectors for automation. Creates new browser each time. |
extract_links | [STATELESS] Extract and categorize all page links. Use when: building sitemaps, analyzing site structure, finding broken links, or discovering resources. Groups by internal/external/social/documents. Creates new browser each time. For persistent operations use create_session + crawl. |
crawl_recursive | [STATELESS] Deep crawl a website following internal links. Use when: mapping entire sites, finding all pages, building comprehensive indexes. Control with max_depth (default 3) and max_pages (default 50). Note: May need JS execution for dynamic sites. Each page gets a fresh browser. For persistent operations use create_session + crawl. |
parse_sitemap | [STATELESS] Extract URLs from XML sitemaps. Use when: discovering all site pages, planning crawl strategies, or checking sitemap validity. Supports regex filtering. Try sitemap.xml or robots.txt first. Creates new browser each time. |
crawl | [SUPPORTS SESSIONS] THE ONLY TOOL WITH BROWSER PERSISTENCE RECOMMENDED PATTERNS: • Inspect first workflow:
• Multi-step with state:
WITH session_id: Maintains browser state (cookies, localStorage, page) across calls WITHOUT session_id: Creates fresh browser each time (like other tools) WHEN TO USE SESSIONS vs STATELESS: • Need state between calls? → create_session + crawl • Just extracting data? → Use stateless tools • Filling forms? → Inspect first, then use sessions • Taking screenshot after JS? → Must use crawl with session • Unsure if elements exist? → Always use get_html first CRITICAL FOR js_code: RECOMMENDED: Always use screenshot: true when running js_code This avoids server serialization errors and gives visual confirmation |
manage_session | [SESSION MANAGEMENT] Unified tool for managing browser sessions. Supports three actions: • CREATE: Start a persistent browser session that maintains state across calls • CLEAR: Remove a session from local tracking • LIST: Show all active sessions with age and usage info USAGE EXAMPLES:
Browser sessions maintain ALL state (cookies, localStorage, page) across multiple crawl calls. Essential for: forms, login flows, multi-step processes, maintaining state across operations. |
extract_with_llm | [STATELESS] Ask questions about webpage content using AI. Returns natural language answers. Crawls fresh each time. For dynamic content or sessions, use crawl with session_id first. |