Skip to main content
Glama

MCP Server for Crawl4AI

by omgwtfwow

Server Configuration

Describes the environment variables required to run the server.

NameRequiredDescriptionDefault
SERVER_NAMENoCustom name for the MCP servercrawl4ai-mcp
SERVER_VERSIONNoCustom version for the MCP server1.0.0
CRAWL4AI_API_KEYNoAPI key for authentication if your server requires it
CRAWL4AI_BASE_URLYesURL of the Crawl4AI serverhttp://localhost:11235

Schema

Prompts

Interactive templates invoked by user choice

NameDescription

No prompts

Resources

Contextual data attached and managed by the client

NameDescription

No resources

Tools

Functions exposed to the LLM to take actions

NameDescription
get_markdown

[STATELESS] Extract content as markdown with filtering options. Supports: raw (full content), fit (optimized, default), bm25 (keyword search), llm (AI-powered extraction). Use bm25/llm with query for specific content. Creates new browser each time. For persistence use create_session + crawl.

capture_screenshot

[STATELESS] Capture webpage screenshot. Returns base64-encoded PNG data. Creates new browser each time. Optionally saves screenshot to local directory. IMPORTANT: Chained calls (execute_js then capture_screenshot) will NOT work - the screenshot won't see JS changes! For JS changes + screenshot use create_session + crawl(session_id, js_code, screenshot:true) in ONE call.

generate_pdf

[STATELESS] Convert webpage to PDF. Returns base64-encoded PDF data. Creates new browser each time. Cannot capture form fills or JS changes. For persistent PDFs use create_session + crawl(session_id, pdf:true).

execute_js

[STATELESS] Execute JavaScript and get return values + page content. Creates new browser each time. Use for: extracting data, triggering dynamic content, checking page state. Scripts with "return" statements return actual values (strings, numbers, objects, arrays). Note: null returns as {"success": true}. Returns values but page state is lost. For persistent JS execution, use crawl with session_id.

batch_crawl

[STATELESS] Crawl multiple URLs concurrently for efficiency. Use when: processing URL lists, comparing multiple pages, or bulk data extraction. Faster than sequential crawling. Max 5 concurrent by default. Each URL gets a fresh browser. Cannot maintain state between URLs. For persistent operations use create_session + crawl.

smart_crawl

[STATELESS] Auto-detect and handle different content types (HTML, sitemap, RSS, text). Use when: URL type is unknown, crawling feeds/sitemaps, or want automatic format handling. Adapts strategy based on content. Creates new browser each time. For persistent operations use create_session + crawl.

get_html

[STATELESS] Get sanitized/processed HTML for inspection and automation planning. Use when: finding form fields/selectors, analyzing page structure before automation, building schemas. Returns cleaned HTML showing element names, IDs, and classes - perfect for identifying selectors for subsequent crawl operations. Commonly used before crawl to find selectors for automation. Creates new browser each time.

extract_links

[STATELESS] Extract and categorize all page links. Use when: building sitemaps, analyzing site structure, finding broken links, or discovering resources. Groups by internal/external/social/documents. Creates new browser each time. For persistent operations use create_session + crawl.

crawl_recursive

[STATELESS] Deep crawl a website following internal links. Use when: mapping entire sites, finding all pages, building comprehensive indexes. Control with max_depth (default 3) and max_pages (default 50). Note: May need JS execution for dynamic sites. Each page gets a fresh browser. For persistent operations use create_session + crawl.

parse_sitemap

[STATELESS] Extract URLs from XML sitemaps. Use when: discovering all site pages, planning crawl strategies, or checking sitemap validity. Supports regex filtering. Try sitemap.xml or robots.txt first. Creates new browser each time.

crawl

[SUPPORTS SESSIONS] THE ONLY TOOL WITH BROWSER PERSISTENCE

RECOMMENDED PATTERNS: • Inspect first workflow:

  1. get_html(url) → find selectors & verify elements exist

  2. create_session() → "session-123"

  3. crawl({url, session_id: "session-123", js_code: ["action 1"]})

  4. crawl({url: "/page2", session_id: "session-123", js_code: ["action 2"]})

• Multi-step with state:

  1. create_session() → "session-123"

  2. crawl({url, session_id: "session-123"}) → inspect current state

  3. crawl({url, session_id: "session-123", js_code: ["verified actions"]})

WITH session_id: Maintains browser state (cookies, localStorage, page) across calls WITHOUT session_id: Creates fresh browser each time (like other tools)

WHEN TO USE SESSIONS vs STATELESS: • Need state between calls? → create_session + crawl • Just extracting data? → Use stateless tools • Filling forms? → Inspect first, then use sessions • Taking screenshot after JS? → Must use crawl with session • Unsure if elements exist? → Always use get_html first

CRITICAL FOR js_code: RECOMMENDED: Always use screenshot: true when running js_code This avoids server serialization errors and gives visual confirmation

manage_session

[SESSION MANAGEMENT] Unified tool for managing browser sessions. Supports three actions:

• CREATE: Start a persistent browser session that maintains state across calls • CLEAR: Remove a session from local tracking • LIST: Show all active sessions with age and usage info

USAGE EXAMPLES:

  1. Create session: {action: "create", session_id: "my-session", initial_url: "https://example.com"}

  2. Clear session: {action: "clear", session_id: "my-session"}

  3. List sessions: {action: "list"}

Browser sessions maintain ALL state (cookies, localStorage, page) across multiple crawl calls. Essential for: forms, login flows, multi-step processes, maintaining state across operations.

extract_with_llm

[STATELESS] Ask questions about webpage content using AI. Returns natural language answers. Crawls fresh each time. For dynamic content or sessions, use crawl with session_id first.

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/omgwtfwow/mcp-crawl4ai-ts'

If you have feedback or need assistance with the MCP directory API, please join our Discord server