Skip to main content
Glama

crawl

Recursively crawl a website to extract text and links from multiple pages. Follows site boundaries and respects robots.txt with configurable depth limits.

Instructions

Recursively crawl a website via BFS. Opens pages in new tabs, extracts text and links, follows them up to max_depth. Respects robots.txt and scope constraints.

When to use: Extracting content from multiple pages of a site when the URL structure is not known in advance. When NOT to use: Use crawl_sitemap when the site has a sitemap.xml, or navigate for a single page.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesStarting URL to crawl
max_depthNoMaximum link-follow depth (0 = start page only). Default: 2
max_pagesNoMaximum number of pages to crawl. Default: 20
cursorNoOpaque pagination cursor returned as nextCursor from a prior crawl call. Cursoring paginates returned pages after the crawl completes.
scopeNoURL glob pattern limiting which URLs to follow (e.g. "https://docs.example.com/**"). Default: same origin as start URL.
include_patternsNoURL glob patterns — only follow links matching at least one
exclude_patternsNoURL glob patterns — skip links matching any of these
output_formatNoContent format per page. "markdown-clean" uses cheerio+turndown to strip nav/footer/ads. Default: markdown
onlyMainContentNomarkdown-clean only: strip nav/header/footer/aside/ads. Default: true.
includeLinksNomarkdown-clean only: preserve <a> as markdown links. Default: true.
content_filterNomarkdown-clean only: deterministic fit_markdown filter. Default: none.
return_rawNomarkdown-clean only: include raw_markdown in each page. Default: false.
return_fitNomarkdown-clean only: include fit_markdown and use it as content when filtering. Default: true when filtered.
respect_robotsNoWhether to fetch and obey robots.txt. Default: true
delay_msNoDelay between page fetches in milliseconds. Default: 1000
concurrencyNoMax parallel tab fetches. Default: 3
engineNoFetch engine: "cdp" (default, opens a Chrome tab per page), "static" (Node fetch only, fails closed on insufficient pages), or "auto" (static first, fall back to CDP when static is insufficient).
include_metricsNoWhen true, include approximate output size/token metrics in the JSON result. Default: false.
strategyNoCrawl traversal strategy. Default: bfs. best_first scores discovered URLs by query/url_score and visits highest-scoring URLs first.
queryNoOptional query terms used by strategy=best_first URL scoring.
url_scoreNoOptional strategy=best_first URL scoring hints: keywords, prefer_paths, exclude_paths, same_depth_bias.
dispatcherNoCrawl concurrency dispatcher. Default: fixed. adaptive reduces concurrency on memory/error pressure and records origin backoff for 429/503 responses.
dispatcher_optionsNodispatcher=adaptive options: min_concurrency, max_concurrency, memory_pressure_mb, origin_backoff_ms, rate_limit_statuses.
cache_modeNoOpt-in crawl content cache mode. Default: disabled.
cache_ttl_msNoMaximum age for enabled/read_only cache hits. Negative or omitted means no TTL expiry.
cache_scopeNoCache namespace/safety scope. Default: public; session adds a session fingerprint to the key.
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description adds behavioral context beyond the sparse annotations: opening new tabs, respecting robots.txt and scope constraints, following links up to max_depth. It does not contradict annotations. Some details like engine behavior or caching are not mentioned, but the core behavior is well-covered.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three short paragraphs: core functionality, usage guidance, and a trailing note on scope/responsiveness. Every sentence adds value, front-loaded with the most important information. No wasted words.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a complex tool with 26 parameters and no output schema, the description provides a solid high-level understanding. It covers the main use case, BFS traversal, robots.txt handling, and links to sibling tools. Advanced features like caching or best-first strategy are not detailed, but the schema handles those. The description is sufficiently complete for an agent to decide when to invoke it.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, so each parameter is already described in the schema. The description adds minimal extra semantics for individual parameters, mentioning only max_depth implicitly. Baseline of 3 is appropriate since the schema carries the burden.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: recursively crawl a website via BFS, extracting text and links. It distinguishes itself from siblings like navigate (single page) and crawl_sitemap (when sitemap exists). The specific verb 'crawl' and resource 'website' with method 'BFS' make it unambiguous.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly provides 'When to use' and 'When NOT to use' sections, with clear alternatives: use crawl_sitemap when a sitemap is available, or navigate for a single page. This helps the agent choose correctly among siblings.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/shaun0927/openchrome'

If you have feedback or need assistance with the MCP directory API, please join our Discord server