crawl
Extract web content with JavaScript execution and browser persistence for multi-step workflows like form filling and dynamic interaction.
Instructions
[SUPPORTS SESSIONS] THE ONLY TOOL WITH BROWSER PERSISTENCE
RECOMMENDED PATTERNS: • Inspect first workflow:
get_html(url) → find selectors & verify elements exist
create_session() → "session-123"
crawl({url, session_id: "session-123", js_code: ["action 1"]})
crawl({url: "/page2", session_id: "session-123", js_code: ["action 2"]})
• Multi-step with state:
create_session() → "session-123"
crawl({url, session_id: "session-123"}) → inspect current state
crawl({url, session_id: "session-123", js_code: ["verified actions"]})
WITH session_id: Maintains browser state (cookies, localStorage, page) across calls WITHOUT session_id: Creates fresh browser each time (like other tools)
WHEN TO USE SESSIONS vs STATELESS: • Need state between calls? → create_session + crawl • Just extracting data? → Use stateless tools • Filling forms? → Inspect first, then use sessions • Taking screenshot after JS? → Must use crawl with session • Unsure if elements exist? → Always use get_html first
CRITICAL FOR js_code: RECOMMENDED: Always use screenshot: true when running js_code This avoids server serialization errors and gives visual confirmation
Input Schema
Name | Required | Description | Default |
---|---|---|---|
browser_type | No | Browser engine for crawling. Chromium offers best compatibility, Firefox for specific use cases, WebKit for Safari-like behavior | chromium |
cache_mode | No | Cache strategy. ENABLED: Use cache if available. BYPASS: Fetch fresh (recommended). DISABLED: No cache | BYPASS |
cookies | No | Pre-set cookies for authentication or personalization | |
delay_before_scroll | No | Milliseconds to wait before scrolling. Allows initial content to render | |
exclude_domains | No | List of domains to exclude from links (e.g., ["ads.com", "tracker.io"]) | |
exclude_external_images | No | Exclude images from external domains | |
exclude_external_links | No | Remove links pointing to different domains for cleaner content | |
exclude_social_media_links | No | Remove links to social media platforms | |
excluded_selector | No | CSS selector for elements to remove. Comma-separate multiple selectors. SELECTOR STRATEGY: Use get_html first to inspect page structure. Look for: • id attributes (e.g., #cookie-banner) • CSS classes (e.g., .advertisement, .popup) • data-* attributes (e.g., [data-type="ad"]) • Element type + attributes (e.g., div[role="banner"]) Examples: "#cookie-banner, .advertisement, .social-share" | |
excluded_tags | No | HTML tags to remove completely. Common: ["nav", "footer", "aside", "script", "style"]. Cleans up content before extraction | |
headers | No | Custom HTTP headers for API keys, auth tokens, or specific server requirements | |
ignore_body_visibility | No | Skip checking if body element is visible | |
image_description_min_word_threshold | No | Minimum words for image alt text to be considered valid | |
image_score_threshold | No | Minimum relevance score for images (filters low-quality images) | |
js_code | No | JavaScript to execute. Each string runs separately. Use return to get values. IMPORTANT: Always verify elements exist before acting on them! Use get_html first to find correct selectors, then: GOOD: ["if (document.querySelector('input[name=\"email\"]')) { ... }"] BAD: ["document.querySelector('input[name=\"email\"]').value = '...'"] USAGE PATTERNS: 1. WITH screenshot/pdf: {js_code: [...], screenshot: true} ✓ 2. MULTI-STEP: First {js_code: [...], session_id: "x"}, then {js_only: true, session_id: "x"} 3. AVOID: {js_code: [...], js_only: true} on first call ✗ SELECTOR TIPS: Use get_html first to find: • name="..." (best for forms) • id="..." (if unique) • class="..." (careful, may repeat) FORM EXAMPLE WITH VERIFICATION: [ "const emailInput = document.querySelector('input[name=\"email\"]');", "if (emailInput) emailInput.value = 'user@example.com';", "const submitBtn = document.querySelector('button[type=\"submit\"]');", "if (submitBtn) submitBtn.click();" ] | |
js_only | No | FOR SUBSEQUENT CALLS ONLY: Reuse existing session without navigation First call: Use js_code WITHOUT js_only (or with screenshot/pdf) Later calls: Use js_only=true to run more JS in same session ERROR: Using js_only=true on first call causes server errors | |
keep_data_attributes | No | Preserve data-* attributes in cleaned HTML | |
log_console | No | Capture browser console logs for debugging | |
magic | No | EXPERIMENTAL: Auto-handles popups, cookies, overlays. Use as LAST RESORT - can conflict with wait_for & CSS extraction Try first: remove_overlay_elements, excluded_selector Avoid with: CSS extraction, precise timing needs | |
only_text | No | Extract only text content, no HTML structure | |
override_navigator | No | Override navigator properties for stealth | |
page_timeout | No | Page navigation timeout in milliseconds | |
No | Generate PDF as base64 preserving exact layout | ||
process_iframes | No | Extract content from embedded iframes including videos and forms | |
proxy_password | No | Proxy authentication password | |
proxy_server | No | Proxy server URL (e.g., "http://proxy.example.com:8080") | |
proxy_username | No | Proxy authentication username | |
remove_forms | No | Remove all form elements from extracted content | |
remove_overlay_elements | No | Automatically remove popups, modals, and overlays that obscure content | |
scan_full_page | No | Auto-scroll entire page to trigger lazy loading. WARNING: Can be slow on long pages. Avoid combining with wait_until:"networkidle" or CSS extraction on dynamic sites. Better to use virtual_scroll_config for infinite feeds | |
screenshot | No | Capture full-page screenshot as base64 PNG | |
screenshot_directory | No | Directory path to save screenshot (e.g., ~/Desktop, /tmp). Do NOT include filename - it will be auto-generated. Large screenshots (>800KB) won't be returned inline when saved. | |
screenshot_wait_for | No | Extra wait time in seconds before taking screenshot | |
scroll_delay | No | Milliseconds between scroll steps for lazy-loaded content | |
session_id | No | ENABLES PERSISTENCE: Use SAME ID across all crawl calls to maintain browser state. • First call with ID: Creates persistent browser • Subsequent calls with SAME ID: Reuses browser with all state intact • Different/no ID: Fresh browser (stateless) WARNING: ONLY works with crawl tool - other tools ignore this parameter | |
simulate_user | No | Mimic human behavior with random mouse movements and delays. Helps bypass bot detection on protected sites. Slows crawling but improves success rate | |
timeout | No | Overall request timeout in milliseconds | |
url | Yes | The URL to crawl | |
user_agent | No | Custom browser identity. Use for: mobile sites (include "Mobile"), avoiding bot detection, or specific browser requirements. Example: "Mozilla/5.0 (iPhone...)" | |
verbose | No | Enable server-side debug logging (not shown in output). Only for troubleshooting. Does not affect extraction results | |
viewport_height | No | Browser window height in pixels. Impacts content loading and screenshot dimensions | |
viewport_width | No | Browser window width in pixels. Affects responsive layouts and content visibility | |
virtual_scroll_config | No | For infinite scroll sites that REPLACE content (Twitter/Instagram feeds). USE when: Content disappears as you scroll (virtual scrolling) DON'T USE when: Content appends (use scan_full_page instead) Example: {container_selector: "#timeline", scroll_count: 10, wait_after_scroll: 1} | |
wait_for | No | Wait for element that loads AFTER initial page load. Format: "css:.selector" or "js:() => condition" WHEN TO USE: • Dynamic content that loads after page (AJAX, lazy load) • Elements that appear after animations/transitions • Content loaded by JavaScript frameworks WHEN NOT TO USE: • Elements already in initial HTML (forms, static content) • Standard page elements (just use wait_until: "load") • Can cause timeouts/errors if element already exists! SELECTOR TIPS: Use get_html first to check if element exists Examples: "css:.ajax-content", "js:() => document.querySelector('.lazy-loaded')" | |
wait_for_images | No | Wait for all images to load before extraction | |
wait_for_timeout | No | Maximum milliseconds to wait for condition | |
wait_until | No | When to consider page loaded (use INSTEAD of wait_for for initial load): • "domcontentloaded" (default): Fast, DOM ready, use for forms/static content • "load": All resources loaded, use if you need images • "networkidle": Wait for network quiet, use for heavy JS apps WARNING: Don't use wait_for for elements in initial HTML! | domcontentloaded |
word_count_threshold | No | Min words per text block. Filters out menus, footers, and short snippets. Lower = more content but more noise. Higher = only substantial paragraphs |