crawl

Extract web content with JavaScript execution and browser persistence for multi-step workflows like form filling and dynamic interaction.

Instructions

[SUPPORTS SESSIONS] THE ONLY TOOL WITH BROWSER PERSISTENCE

RECOMMENDED PATTERNS: • Inspect first workflow:

get_html(url) → find selectors & verify elements exist
create_session() → "session-123"
crawl({url, session_id: "session-123", js_code: ["action 1"]})
crawl({url: "/page2", session_id: "session-123", js_code: ["action 2"]})

• Multi-step with state:

create_session() → "session-123"
crawl({url, session_id: "session-123"}) → inspect current state
crawl({url, session_id: "session-123", js_code: ["verified actions"]})

WITH session_id: Maintains browser state (cookies, localStorage, page) across calls WITHOUT session_id: Creates fresh browser each time (like other tools)

WHEN TO USE SESSIONS vs STATELESS: • Need state between calls? → create_session + crawl • Just extracting data? → Use stateless tools • Filling forms? → Inspect first, then use sessions • Taking screenshot after JS? → Must use crawl with session • Unsure if elements exist? → Always use get_html first

CRITICAL FOR js_code: RECOMMENDED: Always use screenshot: true when running js_code This avoids server serialization errors and gives visual confirmation

Input Schema

Name	Required	Description	Default
`browser_type`	No	Browser engine for crawling. Chromium offers best compatibility, Firefox for specific use cases, WebKit for Safari-like behavior	chromium
`cache_mode`	No	Cache strategy. ENABLED: Use cache if available. BYPASS: Fetch fresh (recommended). DISABLED: No cache	BYPASS
`cookies`	No	Pre-set cookies for authentication or personalization
`delay_before_scroll`	No	Milliseconds to wait before scrolling. Allows initial content to render
`exclude_domains`	No	List of domains to exclude from links (e.g., ["ads.com", "tracker.io"])
`exclude_external_images`	No	Exclude images from external domains
`exclude_external_links`	No	Remove links pointing to different domains for cleaner content
`exclude_social_media_links`	No	Remove links to social media platforms
`excluded_selector`	No	CSS selector for elements to remove. Comma-separate multiple selectors. SELECTOR STRATEGY: Use get_html first to inspect page structure. Look for: • id attributes (e.g., #cookie-banner) • CSS classes (e.g., .advertisement, .popup) • data-* attributes (e.g., [data-type="ad"]) • Element type + attributes (e.g., div[role="banner"]) Examples: "#cookie-banner, .advertisement, .social-share"
`excluded_tags`	No	HTML tags to remove completely. Common: ["nav", "footer", "aside", "script", "style"]. Cleans up content before extraction
`headers`	No	Custom HTTP headers for API keys, auth tokens, or specific server requirements
`ignore_body_visibility`	No	Skip checking if body element is visible
`image_description_min_word_threshold`	No	Minimum words for image alt text to be considered valid
`image_score_threshold`	No	Minimum relevance score for images (filters low-quality images)
`js_code`	No	JavaScript to execute. Each string runs separately. Use return to get values. IMPORTANT: Always verify elements exist before acting on them! Use get_html first to find correct selectors, then: GOOD: ["if (document.querySelector('input[name=\"email\"]')) { ... }"] BAD: ["document.querySelector('input[name=\"email\"]').value = '...'"] USAGE PATTERNS: 1. WITH screenshot/pdf: {js_code: [...], screenshot: true} ✓ 2. MULTI-STEP: First {js_code: [...], session_id: "x"}, then {js_only: true, session_id: "x"} 3. AVOID: {js_code: [...], js_only: true} on first call ✗ SELECTOR TIPS: Use get_html first to find: • name="..." (best for forms) • id="..." (if unique) • class="..." (careful, may repeat) FORM EXAMPLE WITH VERIFICATION: [ "const emailInput = document.querySelector('input[name=\"email\"]');", "if (emailInput) emailInput.value = 'user@example.com';", "const submitBtn = document.querySelector('button[type=\"submit\"]');", "if (submitBtn) submitBtn.click();" ]
`js_only`	No	FOR SUBSEQUENT CALLS ONLY: Reuse existing session without navigation First call: Use js_code WITHOUT js_only (or with screenshot/pdf) Later calls: Use js_only=true to run more JS in same session ERROR: Using js_only=true on first call causes server errors
`keep_data_attributes`	No	Preserve data-* attributes in cleaned HTML
`log_console`	No	Capture browser console logs for debugging
`magic`	No	EXPERIMENTAL: Auto-handles popups, cookies, overlays. Use as LAST RESORT - can conflict with wait_for & CSS extraction Try first: remove_overlay_elements, excluded_selector Avoid with: CSS extraction, precise timing needs
`only_text`	No	Extract only text content, no HTML structure
`override_navigator`	No	Override navigator properties for stealth
`page_timeout`	No	Page navigation timeout in milliseconds
`pdf`	No	Generate PDF as base64 preserving exact layout
`process_iframes`	No	Extract content from embedded iframes including videos and forms
`proxy_password`	No	Proxy authentication password
`proxy_server`	No	Proxy server URL (e.g., "http://proxy.example.com:8080")
`proxy_username`	No	Proxy authentication username
`remove_forms`	No	Remove all form elements from extracted content
`remove_overlay_elements`	No	Automatically remove popups, modals, and overlays that obscure content
`scan_full_page`	No	Auto-scroll entire page to trigger lazy loading. WARNING: Can be slow on long pages. Avoid combining with wait_until:"networkidle" or CSS extraction on dynamic sites. Better to use virtual_scroll_config for infinite feeds
`screenshot`	No	Capture full-page screenshot as base64 PNG
`screenshot_directory`	No	Directory path to save screenshot (e.g., ~/Desktop, /tmp). Do NOT include filename - it will be auto-generated. Large screenshots (>800KB) won't be returned inline when saved.
`screenshot_wait_for`	No	Extra wait time in seconds before taking screenshot
`scroll_delay`	No	Milliseconds between scroll steps for lazy-loaded content
`session_id`	No	ENABLES PERSISTENCE: Use SAME ID across all crawl calls to maintain browser state. • First call with ID: Creates persistent browser • Subsequent calls with SAME ID: Reuses browser with all state intact • Different/no ID: Fresh browser (stateless) WARNING: ONLY works with crawl tool - other tools ignore this parameter
`simulate_user`	No	Mimic human behavior with random mouse movements and delays. Helps bypass bot detection on protected sites. Slows crawling but improves success rate
`timeout`	No	Overall request timeout in milliseconds
`url`	Yes	The URL to crawl
`user_agent`	No	Custom browser identity. Use for: mobile sites (include "Mobile"), avoiding bot detection, or specific browser requirements. Example: "Mozilla/5.0 (iPhone...)"
`verbose`	No	Enable server-side debug logging (not shown in output). Only for troubleshooting. Does not affect extraction results
`viewport_height`	No	Browser window height in pixels. Impacts content loading and screenshot dimensions
`viewport_width`	No	Browser window width in pixels. Affects responsive layouts and content visibility
`virtual_scroll_config`	No	For infinite scroll sites that REPLACE content (Twitter/Instagram feeds). USE when: Content disappears as you scroll (virtual scrolling) DON'T USE when: Content appends (use scan_full_page instead) Example: {container_selector: "#timeline", scroll_count: 10, wait_after_scroll: 1}
`wait_for`	No	Wait for element that loads AFTER initial page load. Format: "css:.selector" or "js:() => condition" WHEN TO USE: • Dynamic content that loads after page (AJAX, lazy load) • Elements that appear after animations/transitions • Content loaded by JavaScript frameworks WHEN NOT TO USE: • Elements already in initial HTML (forms, static content) • Standard page elements (just use wait_until: "load") • Can cause timeouts/errors if element already exists! SELECTOR TIPS: Use get_html first to check if element exists Examples: "css:.ajax-content", "js:() => document.querySelector('.lazy-loaded')"
`wait_for_images`	No	Wait for all images to load before extraction
`wait_for_timeout`	No	Maximum milliseconds to wait for condition
`wait_until`	No	When to consider page loaded (use INSTEAD of wait_for for initial load): • "domcontentloaded" (default): Fast, DOM ready, use for forms/static content • "load": All resources loaded, use if you need images • "networkidle": Wait for network quiet, use for heavy JS apps WARNING: Don't use wait_for for elements in initial HTML!	domcontentloaded
`word_count_threshold`	No	Min words per text block. Filters out menus, footers, and short snippets. Lower = more content but more noise. Higher = only substantial paragraphs

Input Schema (JSON Schema)

{ "properties": { "browser_type": { "default": "chromium", "description": "Browser engine for crawling. Chromium offers best compatibility, Firefox for specific use cases, WebKit for Safari-like behavior", "enum": [ "chromium", "firefox", "webkit" ], "type": "string" }, "cache_mode": { "default": "BYPASS", "description": "Cache strategy. ENABLED: Use cache if available. BYPASS: Fetch fresh (recommended). DISABLED: No cache", "enum": [ "ENABLED", "BYPASS", "DISABLED" ], "type": "string" }, "cookies": { "description": "Pre-set cookies for authentication or personalization", "items": { "properties": { "domain": { "description": "Domain where cookie is valid", "type": "string" }, "name": { "description": "Cookie name", "type": "string" }, "path": { "description": "URL path scope for cookie", "type": "string" }, "value": { "description": "Cookie value", "type": "string" } }, "required": [ "name", "value", "domain" ], "type": "object" }, "type": "array" }, "delay_before_scroll": { "default": 1000, "description": "Milliseconds to wait before scrolling. Allows initial content to render", "type": "number" }, "exclude_domains": { "description": "List of domains to exclude from links (e.g., [\"ads.com\", \"tracker.io\"])", "items": { "type": "string" }, "type": "array" }, "exclude_external_images": { "default": false, "description": "Exclude images from external domains", "type": "boolean" }, "exclude_external_links": { "default": false, "description": "Remove links pointing to different domains for cleaner content", "type": "boolean" }, "exclude_social_media_links": { "default": false, "description": "Remove links to social media platforms", "type": "boolean" }, "excluded_selector": { "description": "CSS selector for elements to remove. Comma-separate multiple selectors.\n\nSELECTOR STRATEGY: Use get_html first to inspect page structure. Look for:\n • id attributes (e.g., #cookie-banner)\n • CSS classes (e.g., .advertisement, .popup)\n • data-* attributes (e.g., [data-type=\"ad\"])\n • Element type + attributes (e.g., div[role=\"banner\"])\n\nExamples: \"#cookie-banner, .advertisement, .social-share\"", "type": "string" }, "excluded_tags": { "description": "HTML tags to remove completely. Common: [\"nav\", \"footer\", \"aside\", \"script\", \"style\"]. Cleans up content before extraction", "items": { "type": "string" }, "type": "array" }, "headers": { "description": "Custom HTTP headers for API keys, auth tokens, or specific server requirements", "type": "object" }, "ignore_body_visibility": { "default": true, "description": "Skip checking if body element is visible", "type": "boolean" }, "image_description_min_word_threshold": { "default": 50, "description": "Minimum words for image alt text to be considered valid", "type": "number" }, "image_score_threshold": { "default": 3, "description": "Minimum relevance score for images (filters low-quality images)", "type": "number" }, "js_code": { "description": "JavaScript to execute. Each string runs separately. Use return to get values.\n\nIMPORTANT: Always verify elements exist before acting on them!\nUse get_html first to find correct selectors, then:\nGOOD: [\"if (document.querySelector('input[name=\\\"email\\\"]')) { ... }\"]\nBAD: [\"document.querySelector('input[name=\\\"email\\\"]').value = '...'\"]\n\nUSAGE PATTERNS:\n1. WITH screenshot/pdf: {js_code: [...], screenshot: true} ✓\n2. MULTI-STEP: First {js_code: [...], session_id: \"x\"}, then {js_only: true, session_id: \"x\"}\n3. AVOID: {js_code: [...], js_only: true} on first call ✗\n\nSELECTOR TIPS: Use get_html first to find:\n • name=\"...\" (best for forms)\n • id=\"...\" (if unique)\n • class=\"...\" (careful, may repeat)\n\nFORM EXAMPLE WITH VERIFICATION: [\n \"const emailInput = document.querySelector('input[name=\\\"email\\\"]');\",\n \"if (emailInput) emailInput.value = 'user@example.com';\",\n \"const submitBtn = document.querySelector('button[type=\\\"submit\\\"]');\",\n \"if (submitBtn) submitBtn.click();\"\n]", "items": { "type": "string" }, "type": [ "string", "array" ] }, "js_only": { "default": false, "description": "FOR SUBSEQUENT CALLS ONLY: Reuse existing session without navigation\nFirst call: Use js_code WITHOUT js_only (or with screenshot/pdf)\nLater calls: Use js_only=true to run more JS in same session\nERROR: Using js_only=true on first call causes server errors", "type": "boolean" }, "keep_data_attributes": { "default": false, "description": "Preserve data-* attributes in cleaned HTML", "type": "boolean" }, "log_console": { "default": false, "description": "Capture browser console logs for debugging", "type": "boolean" }, "magic": { "default": false, "description": "EXPERIMENTAL: Auto-handles popups, cookies, overlays.\nUse as LAST RESORT - can conflict with wait_for & CSS extraction\nTry first: remove_overlay_elements, excluded_selector\nAvoid with: CSS extraction, precise timing needs", "type": "boolean" }, "only_text": { "default": false, "description": "Extract only text content, no HTML structure", "type": "boolean" }, "override_navigator": { "default": false, "description": "Override navigator properties for stealth", "type": "boolean" }, "page_timeout": { "default": 60000, "description": "Page navigation timeout in milliseconds", "type": "number" }, "pdf": { "default": false, "description": "Generate PDF as base64 preserving exact layout", "type": "boolean" }, "process_iframes": { "default": false, "description": "Extract content from embedded iframes including videos and forms", "type": "boolean" }, "proxy_password": { "description": "Proxy authentication password", "type": "string" }, "proxy_server": { "description": "Proxy server URL (e.g., \"http://proxy.example.com:8080\")", "type": "string" }, "proxy_username": { "description": "Proxy authentication username", "type": "string" }, "remove_forms": { "default": false, "description": "Remove all form elements from extracted content", "type": "boolean" }, "remove_overlay_elements": { "default": false, "description": "Automatically remove popups, modals, and overlays that obscure content", "type": "boolean" }, "scan_full_page": { "default": false, "description": "Auto-scroll entire page to trigger lazy loading. WARNING: Can be slow on long pages. Avoid combining with wait_until:\"networkidle\" or CSS extraction on dynamic sites. Better to use virtual_scroll_config for infinite feeds", "type": "boolean" }, "screenshot": { "default": false, "description": "Capture full-page screenshot as base64 PNG", "type": "boolean" }, "screenshot_directory": { "description": "Directory path to save screenshot (e.g., ~/Desktop, /tmp). Do NOT include filename - it will be auto-generated. Large screenshots (>800KB) won't be returned inline when saved.", "type": "string" }, "screenshot_wait_for": { "description": "Extra wait time in seconds before taking screenshot", "type": "number" }, "scroll_delay": { "default": 500, "description": "Milliseconds between scroll steps for lazy-loaded content", "type": "number" }, "session_id": { "description": "ENABLES PERSISTENCE: Use SAME ID across all crawl calls to maintain browser state.\n• First call with ID: Creates persistent browser\n• Subsequent calls with SAME ID: Reuses browser with all state intact\n• Different/no ID: Fresh browser (stateless)\nWARNING: ONLY works with crawl tool - other tools ignore this parameter", "type": "string" }, "simulate_user": { "default": false, "description": "Mimic human behavior with random mouse movements and delays. Helps bypass bot detection on protected sites. Slows crawling but improves success rate", "type": "boolean" }, "timeout": { "default": 60000, "description": "Overall request timeout in milliseconds", "type": "number" }, "url": { "description": "The URL to crawl", "type": "string" }, "user_agent": { "description": "Custom browser identity. Use for: mobile sites (include \"Mobile\"), avoiding bot detection, or specific browser requirements. Example: \"Mozilla/5.0 (iPhone...)\"", "type": "string" }, "verbose": { "default": false, "description": "Enable server-side debug logging (not shown in output). Only for troubleshooting. Does not affect extraction results", "type": "boolean" }, "viewport_height": { "default": 600, "description": "Browser window height in pixels. Impacts content loading and screenshot dimensions", "type": "number" }, "viewport_width": { "default": 1080, "description": "Browser window width in pixels. Affects responsive layouts and content visibility", "type": "number" }, "virtual_scroll_config": { "description": "For infinite scroll sites that REPLACE content (Twitter/Instagram feeds).\nUSE when: Content disappears as you scroll (virtual scrolling)\nDON'T USE when: Content appends (use scan_full_page instead)\nExample: {container_selector: \"#timeline\", scroll_count: 10, wait_after_scroll: 1}", "properties": { "container_selector": { "description": "CSS selector for the scrollable container.\n\nSELECTOR STRATEGY: Use get_html first to inspect page structure. Look for:\n • id attributes (e.g., #timeline)\n • role attributes (e.g., [role=\"feed\"])\n • CSS classes (e.g., .feed, .timeline)\n • data-* attributes (e.g., [data-testid=\"primaryColumn\"])\n\nCommon: \"#timeline\" (Twitter), \"[role='feed']\" (generic), \".feed\" (Instagram)", "type": "string" }, "scroll_by": { "default": "container_height", "description": "Distance per scroll. \"container_height\": one viewport, \"page_height\": full page, or pixels like 500", "type": [ "string", "number" ] }, "scroll_count": { "default": 10, "description": "How many times to scroll. Each scroll loads new content batch. More = more posts but slower", "type": "number" }, "wait_after_scroll": { "default": 0.5, "description": "Seconds to wait after each scroll", "type": "number" } }, "required": [ "container_selector" ], "type": "object" }, "wait_for": { "description": "Wait for element that loads AFTER initial page load. Format: \"css:.selector\" or \"js:() => condition\"\n\nWHEN TO USE:\n • Dynamic content that loads after page (AJAX, lazy load)\n • Elements that appear after animations/transitions\n • Content loaded by JavaScript frameworks\n\nWHEN NOT TO USE:\n • Elements already in initial HTML (forms, static content)\n • Standard page elements (just use wait_until: \"load\")\n • Can cause timeouts/errors if element already exists!\n\nSELECTOR TIPS: Use get_html first to check if element exists\nExamples: \"css:.ajax-content\", \"js:() => document.querySelector('.lazy-loaded')\"", "type": "string" }, "wait_for_images": { "default": false, "description": "Wait for all images to load before extraction", "type": "boolean" }, "wait_for_timeout": { "default": 30000, "description": "Maximum milliseconds to wait for condition", "type": "number" }, "wait_until": { "default": "domcontentloaded", "description": "When to consider page loaded (use INSTEAD of wait_for for initial load):\n• \"domcontentloaded\" (default): Fast, DOM ready, use for forms/static content\n• \"load\": All resources loaded, use if you need images\n• \"networkidle\": Wait for network quiet, use for heavy JS apps\nWARNING: Don't use wait_for for elements in initial HTML!", "enum": [ "domcontentloaded", "networkidle", "load" ], "type": "string" }, "word_count_threshold": { "default": 200, "description": "Min words per text block. Filters out menus, footers, and short snippets. Lower = more content but more noise. Higher = only substantial paragraphs", "type": "number" } }, "required": [ "url" ], "type": "object" }

MCP Server for Crawl4AI

crawl

Instructions

Input Schema

Input Schema (JSON Schema)

Other Tools

Latest Blog Posts

MCP directory API