Skip to main content
Glama

MCP Server for Crawl4AI

by omgwtfwow

crawl

Extract web content with JavaScript execution and browser persistence for multi-step workflows like form filling and dynamic interaction.

Instructions

[SUPPORTS SESSIONS] THE ONLY TOOL WITH BROWSER PERSISTENCE

RECOMMENDED PATTERNS: • Inspect first workflow:

  1. get_html(url) → find selectors & verify elements exist

  2. create_session() → "session-123"

  3. crawl({url, session_id: "session-123", js_code: ["action 1"]})

  4. crawl({url: "/page2", session_id: "session-123", js_code: ["action 2"]})

• Multi-step with state:

  1. create_session() → "session-123"

  2. crawl({url, session_id: "session-123"}) → inspect current state

  3. crawl({url, session_id: "session-123", js_code: ["verified actions"]})

WITH session_id: Maintains browser state (cookies, localStorage, page) across calls WITHOUT session_id: Creates fresh browser each time (like other tools)

WHEN TO USE SESSIONS vs STATELESS: • Need state between calls? → create_session + crawl • Just extracting data? → Use stateless tools • Filling forms? → Inspect first, then use sessions • Taking screenshot after JS? → Must use crawl with session • Unsure if elements exist? → Always use get_html first

CRITICAL FOR js_code: RECOMMENDED: Always use screenshot: true when running js_code This avoids server serialization errors and gives visual confirmation

Input Schema

NameRequiredDescriptionDefault
browser_typeNoBrowser engine for crawling. Chromium offers best compatibility, Firefox for specific use cases, WebKit for Safari-like behaviorchromium
cache_modeNoCache strategy. ENABLED: Use cache if available. BYPASS: Fetch fresh (recommended). DISABLED: No cacheBYPASS
cookiesNoPre-set cookies for authentication or personalization
delay_before_scrollNoMilliseconds to wait before scrolling. Allows initial content to render
exclude_domainsNoList of domains to exclude from links (e.g., ["ads.com", "tracker.io"])
exclude_external_imagesNoExclude images from external domains
exclude_external_linksNoRemove links pointing to different domains for cleaner content
exclude_social_media_linksNoRemove links to social media platforms
excluded_selectorNoCSS selector for elements to remove. Comma-separate multiple selectors. SELECTOR STRATEGY: Use get_html first to inspect page structure. Look for: • id attributes (e.g., #cookie-banner) • CSS classes (e.g., .advertisement, .popup) • data-* attributes (e.g., [data-type="ad"]) • Element type + attributes (e.g., div[role="banner"]) Examples: "#cookie-banner, .advertisement, .social-share"
excluded_tagsNoHTML tags to remove completely. Common: ["nav", "footer", "aside", "script", "style"]. Cleans up content before extraction
headersNoCustom HTTP headers for API keys, auth tokens, or specific server requirements
ignore_body_visibilityNoSkip checking if body element is visible
image_description_min_word_thresholdNoMinimum words for image alt text to be considered valid
image_score_thresholdNoMinimum relevance score for images (filters low-quality images)
js_codeNoJavaScript to execute. Each string runs separately. Use return to get values. IMPORTANT: Always verify elements exist before acting on them! Use get_html first to find correct selectors, then: GOOD: ["if (document.querySelector('input[name=\"email\"]')) { ... }"] BAD: ["document.querySelector('input[name=\"email\"]').value = '...'"] USAGE PATTERNS: 1. WITH screenshot/pdf: {js_code: [...], screenshot: true} ✓ 2. MULTI-STEP: First {js_code: [...], session_id: "x"}, then {js_only: true, session_id: "x"} 3. AVOID: {js_code: [...], js_only: true} on first call ✗ SELECTOR TIPS: Use get_html first to find: • name="..." (best for forms) • id="..." (if unique) • class="..." (careful, may repeat) FORM EXAMPLE WITH VERIFICATION: [ "const emailInput = document.querySelector('input[name=\"email\"]');", "if (emailInput) emailInput.value = 'user@example.com';", "const submitBtn = document.querySelector('button[type=\"submit\"]');", "if (submitBtn) submitBtn.click();" ]
js_onlyNoFOR SUBSEQUENT CALLS ONLY: Reuse existing session without navigation First call: Use js_code WITHOUT js_only (or with screenshot/pdf) Later calls: Use js_only=true to run more JS in same session ERROR: Using js_only=true on first call causes server errors
keep_data_attributesNoPreserve data-* attributes in cleaned HTML
log_consoleNoCapture browser console logs for debugging
magicNoEXPERIMENTAL: Auto-handles popups, cookies, overlays. Use as LAST RESORT - can conflict with wait_for & CSS extraction Try first: remove_overlay_elements, excluded_selector Avoid with: CSS extraction, precise timing needs
only_textNoExtract only text content, no HTML structure
override_navigatorNoOverride navigator properties for stealth
page_timeoutNoPage navigation timeout in milliseconds
pdfNoGenerate PDF as base64 preserving exact layout
process_iframesNoExtract content from embedded iframes including videos and forms
proxy_passwordNoProxy authentication password
proxy_serverNoProxy server URL (e.g., "http://proxy.example.com:8080")
proxy_usernameNoProxy authentication username
remove_formsNoRemove all form elements from extracted content
remove_overlay_elementsNoAutomatically remove popups, modals, and overlays that obscure content
scan_full_pageNoAuto-scroll entire page to trigger lazy loading. WARNING: Can be slow on long pages. Avoid combining with wait_until:"networkidle" or CSS extraction on dynamic sites. Better to use virtual_scroll_config for infinite feeds
screenshotNoCapture full-page screenshot as base64 PNG
screenshot_directoryNoDirectory path to save screenshot (e.g., ~/Desktop, /tmp). Do NOT include filename - it will be auto-generated. Large screenshots (>800KB) won't be returned inline when saved.
screenshot_wait_forNoExtra wait time in seconds before taking screenshot
scroll_delayNoMilliseconds between scroll steps for lazy-loaded content
session_idNoENABLES PERSISTENCE: Use SAME ID across all crawl calls to maintain browser state. • First call with ID: Creates persistent browser • Subsequent calls with SAME ID: Reuses browser with all state intact • Different/no ID: Fresh browser (stateless) WARNING: ONLY works with crawl tool - other tools ignore this parameter
simulate_userNoMimic human behavior with random mouse movements and delays. Helps bypass bot detection on protected sites. Slows crawling but improves success rate
timeoutNoOverall request timeout in milliseconds
urlYesThe URL to crawl
user_agentNoCustom browser identity. Use for: mobile sites (include "Mobile"), avoiding bot detection, or specific browser requirements. Example: "Mozilla/5.0 (iPhone...)"
verboseNoEnable server-side debug logging (not shown in output). Only for troubleshooting. Does not affect extraction results
viewport_heightNoBrowser window height in pixels. Impacts content loading and screenshot dimensions
viewport_widthNoBrowser window width in pixels. Affects responsive layouts and content visibility
virtual_scroll_configNoFor infinite scroll sites that REPLACE content (Twitter/Instagram feeds). USE when: Content disappears as you scroll (virtual scrolling) DON'T USE when: Content appends (use scan_full_page instead) Example: {container_selector: "#timeline", scroll_count: 10, wait_after_scroll: 1}
wait_forNoWait for element that loads AFTER initial page load. Format: "css:.selector" or "js:() => condition" WHEN TO USE: • Dynamic content that loads after page (AJAX, lazy load) • Elements that appear after animations/transitions • Content loaded by JavaScript frameworks WHEN NOT TO USE: • Elements already in initial HTML (forms, static content) • Standard page elements (just use wait_until: "load") • Can cause timeouts/errors if element already exists! SELECTOR TIPS: Use get_html first to check if element exists Examples: "css:.ajax-content", "js:() => document.querySelector('.lazy-loaded')"
wait_for_imagesNoWait for all images to load before extraction
wait_for_timeoutNoMaximum milliseconds to wait for condition
wait_untilNoWhen to consider page loaded (use INSTEAD of wait_for for initial load): • "domcontentloaded" (default): Fast, DOM ready, use for forms/static content • "load": All resources loaded, use if you need images • "networkidle": Wait for network quiet, use for heavy JS apps WARNING: Don't use wait_for for elements in initial HTML!domcontentloaded
word_count_thresholdNoMin words per text block. Filters out menus, footers, and short snippets. Lower = more content but more noise. Higher = only substantial paragraphs

Input Schema (JSON Schema)

{ "properties": { "browser_type": { "default": "chromium", "description": "Browser engine for crawling. Chromium offers best compatibility, Firefox for specific use cases, WebKit for Safari-like behavior", "enum": [ "chromium", "firefox", "webkit" ], "type": "string" }, "cache_mode": { "default": "BYPASS", "description": "Cache strategy. ENABLED: Use cache if available. BYPASS: Fetch fresh (recommended). DISABLED: No cache", "enum": [ "ENABLED", "BYPASS", "DISABLED" ], "type": "string" }, "cookies": { "description": "Pre-set cookies for authentication or personalization", "items": { "properties": { "domain": { "description": "Domain where cookie is valid", "type": "string" }, "name": { "description": "Cookie name", "type": "string" }, "path": { "description": "URL path scope for cookie", "type": "string" }, "value": { "description": "Cookie value", "type": "string" } }, "required": [ "name", "value", "domain" ], "type": "object" }, "type": "array" }, "delay_before_scroll": { "default": 1000, "description": "Milliseconds to wait before scrolling. Allows initial content to render", "type": "number" }, "exclude_domains": { "description": "List of domains to exclude from links (e.g., [\"ads.com\", \"tracker.io\"])", "items": { "type": "string" }, "type": "array" }, "exclude_external_images": { "default": false, "description": "Exclude images from external domains", "type": "boolean" }, "exclude_external_links": { "default": false, "description": "Remove links pointing to different domains for cleaner content", "type": "boolean" }, "exclude_social_media_links": { "default": false, "description": "Remove links to social media platforms", "type": "boolean" }, "excluded_selector": { "description": "CSS selector for elements to remove. Comma-separate multiple selectors.\n\nSELECTOR STRATEGY: Use get_html first to inspect page structure. Look for:\n • id attributes (e.g., #cookie-banner)\n • CSS classes (e.g., .advertisement, .popup)\n • data-* attributes (e.g., [data-type=\"ad\"])\n • Element type + attributes (e.g., div[role=\"banner\"])\n\nExamples: \"#cookie-banner, .advertisement, .social-share\"", "type": "string" }, "excluded_tags": { "description": "HTML tags to remove completely. Common: [\"nav\", \"footer\", \"aside\", \"script\", \"style\"]. Cleans up content before extraction", "items": { "type": "string" }, "type": "array" }, "headers": { "description": "Custom HTTP headers for API keys, auth tokens, or specific server requirements", "type": "object" }, "ignore_body_visibility": { "default": true, "description": "Skip checking if body element is visible", "type": "boolean" }, "image_description_min_word_threshold": { "default": 50, "description": "Minimum words for image alt text to be considered valid", "type": "number" }, "image_score_threshold": { "default": 3, "description": "Minimum relevance score for images (filters low-quality images)", "type": "number" }, "js_code": { "description": "JavaScript to execute. Each string runs separately. Use return to get values.\n\nIMPORTANT: Always verify elements exist before acting on them!\nUse get_html first to find correct selectors, then:\nGOOD: [\"if (document.querySelector('input[name=\\\"email\\\"]')) { ... }\"]\nBAD: [\"document.querySelector('input[name=\\\"email\\\"]').value = '...'\"]\n\nUSAGE PATTERNS:\n1. WITH screenshot/pdf: {js_code: [...], screenshot: true} ✓\n2. MULTI-STEP: First {js_code: [...], session_id: \"x\"}, then {js_only: true, session_id: \"x\"}\n3. AVOID: {js_code: [...], js_only: true} on first call ✗\n\nSELECTOR TIPS: Use get_html first to find:\n • name=\"...\" (best for forms)\n • id=\"...\" (if unique)\n • class=\"...\" (careful, may repeat)\n\nFORM EXAMPLE WITH VERIFICATION: [\n \"const emailInput = document.querySelector('input[name=\\\"email\\\"]');\",\n \"if (emailInput) emailInput.value = 'user@example.com';\",\n \"const submitBtn = document.querySelector('button[type=\\\"submit\\\"]');\",\n \"if (submitBtn) submitBtn.click();\"\n]", "items": { "type": "string" }, "type": [ "string", "array" ] }, "js_only": { "default": false, "description": "FOR SUBSEQUENT CALLS ONLY: Reuse existing session without navigation\nFirst call: Use js_code WITHOUT js_only (or with screenshot/pdf)\nLater calls: Use js_only=true to run more JS in same session\nERROR: Using js_only=true on first call causes server errors", "type": "boolean" }, "keep_data_attributes": { "default": false, "description": "Preserve data-* attributes in cleaned HTML", "type": "boolean" }, "log_console": { "default": false, "description": "Capture browser console logs for debugging", "type": "boolean" }, "magic": { "default": false, "description": "EXPERIMENTAL: Auto-handles popups, cookies, overlays.\nUse as LAST RESORT - can conflict with wait_for & CSS extraction\nTry first: remove_overlay_elements, excluded_selector\nAvoid with: CSS extraction, precise timing needs", "type": "boolean" }, "only_text": { "default": false, "description": "Extract only text content, no HTML structure", "type": "boolean" }, "override_navigator": { "default": false, "description": "Override navigator properties for stealth", "type": "boolean" }, "page_timeout": { "default": 60000, "description": "Page navigation timeout in milliseconds", "type": "number" }, "pdf": { "default": false, "description": "Generate PDF as base64 preserving exact layout", "type": "boolean" }, "process_iframes": { "default": false, "description": "Extract content from embedded iframes including videos and forms", "type": "boolean" }, "proxy_password": { "description": "Proxy authentication password", "type": "string" }, "proxy_server": { "description": "Proxy server URL (e.g., \"http://proxy.example.com:8080\")", "type": "string" }, "proxy_username": { "description": "Proxy authentication username", "type": "string" }, "remove_forms": { "default": false, "description": "Remove all form elements from extracted content", "type": "boolean" }, "remove_overlay_elements": { "default": false, "description": "Automatically remove popups, modals, and overlays that obscure content", "type": "boolean" }, "scan_full_page": { "default": false, "description": "Auto-scroll entire page to trigger lazy loading. WARNING: Can be slow on long pages. Avoid combining with wait_until:\"networkidle\" or CSS extraction on dynamic sites. Better to use virtual_scroll_config for infinite feeds", "type": "boolean" }, "screenshot": { "default": false, "description": "Capture full-page screenshot as base64 PNG", "type": "boolean" }, "screenshot_directory": { "description": "Directory path to save screenshot (e.g., ~/Desktop, /tmp). Do NOT include filename - it will be auto-generated. Large screenshots (>800KB) won't be returned inline when saved.", "type": "string" }, "screenshot_wait_for": { "description": "Extra wait time in seconds before taking screenshot", "type": "number" }, "scroll_delay": { "default": 500, "description": "Milliseconds between scroll steps for lazy-loaded content", "type": "number" }, "session_id": { "description": "ENABLES PERSISTENCE: Use SAME ID across all crawl calls to maintain browser state.\n• First call with ID: Creates persistent browser\n• Subsequent calls with SAME ID: Reuses browser with all state intact\n• Different/no ID: Fresh browser (stateless)\nWARNING: ONLY works with crawl tool - other tools ignore this parameter", "type": "string" }, "simulate_user": { "default": false, "description": "Mimic human behavior with random mouse movements and delays. Helps bypass bot detection on protected sites. Slows crawling but improves success rate", "type": "boolean" }, "timeout": { "default": 60000, "description": "Overall request timeout in milliseconds", "type": "number" }, "url": { "description": "The URL to crawl", "type": "string" }, "user_agent": { "description": "Custom browser identity. Use for: mobile sites (include \"Mobile\"), avoiding bot detection, or specific browser requirements. Example: \"Mozilla/5.0 (iPhone...)\"", "type": "string" }, "verbose": { "default": false, "description": "Enable server-side debug logging (not shown in output). Only for troubleshooting. Does not affect extraction results", "type": "boolean" }, "viewport_height": { "default": 600, "description": "Browser window height in pixels. Impacts content loading and screenshot dimensions", "type": "number" }, "viewport_width": { "default": 1080, "description": "Browser window width in pixels. Affects responsive layouts and content visibility", "type": "number" }, "virtual_scroll_config": { "description": "For infinite scroll sites that REPLACE content (Twitter/Instagram feeds).\nUSE when: Content disappears as you scroll (virtual scrolling)\nDON'T USE when: Content appends (use scan_full_page instead)\nExample: {container_selector: \"#timeline\", scroll_count: 10, wait_after_scroll: 1}", "properties": { "container_selector": { "description": "CSS selector for the scrollable container.\n\nSELECTOR STRATEGY: Use get_html first to inspect page structure. Look for:\n • id attributes (e.g., #timeline)\n • role attributes (e.g., [role=\"feed\"])\n • CSS classes (e.g., .feed, .timeline)\n • data-* attributes (e.g., [data-testid=\"primaryColumn\"])\n\nCommon: \"#timeline\" (Twitter), \"[role='feed']\" (generic), \".feed\" (Instagram)", "type": "string" }, "scroll_by": { "default": "container_height", "description": "Distance per scroll. \"container_height\": one viewport, \"page_height\": full page, or pixels like 500", "type": [ "string", "number" ] }, "scroll_count": { "default": 10, "description": "How many times to scroll. Each scroll loads new content batch. More = more posts but slower", "type": "number" }, "wait_after_scroll": { "default": 0.5, "description": "Seconds to wait after each scroll", "type": "number" } }, "required": [ "container_selector" ], "type": "object" }, "wait_for": { "description": "Wait for element that loads AFTER initial page load. Format: \"css:.selector\" or \"js:() => condition\"\n\nWHEN TO USE:\n • Dynamic content that loads after page (AJAX, lazy load)\n • Elements that appear after animations/transitions\n • Content loaded by JavaScript frameworks\n\nWHEN NOT TO USE:\n • Elements already in initial HTML (forms, static content)\n • Standard page elements (just use wait_until: \"load\")\n • Can cause timeouts/errors if element already exists!\n\nSELECTOR TIPS: Use get_html first to check if element exists\nExamples: \"css:.ajax-content\", \"js:() => document.querySelector('.lazy-loaded')\"", "type": "string" }, "wait_for_images": { "default": false, "description": "Wait for all images to load before extraction", "type": "boolean" }, "wait_for_timeout": { "default": 30000, "description": "Maximum milliseconds to wait for condition", "type": "number" }, "wait_until": { "default": "domcontentloaded", "description": "When to consider page loaded (use INSTEAD of wait_for for initial load):\n• \"domcontentloaded\" (default): Fast, DOM ready, use for forms/static content\n• \"load\": All resources loaded, use if you need images\n• \"networkidle\": Wait for network quiet, use for heavy JS apps\nWARNING: Don't use wait_for for elements in initial HTML!", "enum": [ "domcontentloaded", "networkidle", "load" ], "type": "string" }, "word_count_threshold": { "default": 200, "description": "Min words per text block. Filters out menus, footers, and short snippets. Lower = more content but more noise. Higher = only substantial paragraphs", "type": "number" } }, "required": [ "url" ], "type": "object" }

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/omgwtfwow/mcp-crawl4ai-ts'

If you have feedback or need assistance with the MCP directory API, please join our Discord server