Skip to main content
Glama

x402_crawl_site

Crawl websites via BFS to extract markdown, links, tables, images, and metadata from multiple pages. Configure depth, page limits, and path filters for structured data collection.

Instructions

Crawl a website via BFS and return per-page extraction results (markdown, links, tables, images, metadata). Price: $0.10 USDC per crawl (paid mode) | Free test: returns fixture data.

Crawls up to max_pages pages starting from the seed URL, up to max_depth link hops deep. Same extraction pipeline as x402_scrape_url — each page returns markdown, links, tables, images, metadata. Optional include_paths/exclude_paths glob filters (e.g. '/blog/*') restrict which URLs are followed. Hard limits: max 15 pages, max depth 5. Response includes pages_requested, pages_crawled, pages_skipped. Without X402_PRIVATE_KEY, only the free test endpoint is available.

Returns: seed_url, pages_requested, pages_crawled, pages_skipped, reasons_skipped, results array.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesSeed URL to begin crawling (http/https, max 2048 chars)
max_pagesNoMaximum pages to crawl (1-15, default: 10)
max_depthNoMaximum link depth from seed URL (1-5, default: 2)
include_pathsNoOnly follow URLs matching these path glob patterns (e.g. '/blog/*', max 20)
exclude_pathsNoSkip URLs matching these path glob patterns (e.g. '/admin/*', max 20)

Implementation Reference

  • The handler implementation for x402_crawl_site, which delegates to either the scraping API's paid endpoint or a free test endpoint.
    async (params) => {
      const base = APIS.scraping.baseUrl;
      try {
        const usePaid = !!PRIVATE_KEY;
        if (usePaid) {
          const payload: Record<string, unknown> = {
            url: params.url,
            max_pages: params.max_pages,
            max_depth: params.max_depth,
          };
          if (params.include_paths) payload.include_paths = params.include_paths;
          if (params.exclude_paths) payload.exclude_paths = params.exclude_paths;
          const data = await apiPost(base, "/crawl", payload, true);
          return textResult({ mode: "paid", cost: "$0.10", ...data });
        } else {
          const data = await apiGet(base, "/crawl/test");
          return textResult({
            mode: "free_test",
            note: "Free test — returns fixture data. Set X402_PRIVATE_KEY for live crawling.",
            ...data,
          });
  • Input validation schema for x402_crawl_site.
    {
      url: z.string().url()
        .describe("Seed URL to begin crawling (http/https, max 2048 chars)"),
      max_pages: z.number().int().min(1).max(15).default(10)
        .describe("Maximum pages to crawl (1-15, default: 10)"),
      max_depth: z.number().int().min(1).max(5).default(2)
        .describe("Maximum link depth from seed URL (1-5, default: 2)"),
      include_paths: z.array(z.string()).max(20).optional()
        .describe("Only follow URLs matching these path glob patterns (e.g. '/blog/*', max 20)"),
      exclude_paths: z.array(z.string()).max(20).optional()
        .describe("Skip URLs matching these path glob patterns (e.g. '/admin/*', max 20)"),
    },
  • src/index.ts:742-753 (registration)
    Registration of the x402_crawl_site tool in the MCP server.
    server.tool(
      "x402_crawl_site",
      `Crawl a website via BFS and return per-page extraction results (markdown, links, tables, images, metadata).
    Price: $0.10 USDC per crawl (paid mode) | Free test: returns fixture data.
    
    Crawls up to max_pages pages starting from the seed URL, up to max_depth link hops deep.
    Same extraction pipeline as x402_scrape_url — each page returns markdown, links, tables, images, metadata.
    Optional include_paths/exclude_paths glob filters (e.g. '/blog/*') restrict which URLs are followed.
    Hard limits: max 15 pages, max depth 5. Response includes pages_requested, pages_crawled, pages_skipped.
    Without X402_PRIVATE_KEY, only the free test endpoint is available.
    
    Returns: seed_url, pages_requested, pages_crawled, pages_skipped, reasons_skipped, results array.`,
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full behavioral disclosure burden and succeeds comprehensively. It discloses pricing ($0.10 USDC), authentication requirements (X402_PRIVATE_KEY), operational limits (max_pages/max_depth caps), algorithm details (BFS), and return structure (pages_requested, pages_crawled, reasons_skipped, results array) including the distinction between requested and actually crawled pages.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is information-dense but well-structured with clear logical flow: purpose → pricing → mechanics → limits → authentication. Each sentence serves a necessary function given the lack of annotations and output schema. Minor deduction for density—the return value enumeration at the end is necessary but makes the text longer than ideal.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (paid service with auth requirements, crawling constraints, and extraction pipeline) and the absence of both annotations and output schema, the description achieves complete coverage. It compensates for the missing output schema by enumerating return fields and explains behavioral constraints (hard limits, filtering logic) that would typically appear in annotations.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

While the input schema has 100% description coverage (baseline 3), the description adds valuable semantic context explaining how parameters interact: 'Crawls up to max_pages pages starting from the seed URL, up to max_depth link hops deep.' It also clarifies that include/exclude_paths restrict 'which URLs are followed' during the crawl, adding behavioral meaning beyond the schema's syntax documentation.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description opens with a specific verb-resource combination ('Crawl a website via BFS') and clarifies the return format ('per-page extraction results'). It distinguishes from sibling x402_scrape_url by noting they share the 'same extraction pipeline,' implying this tool is for multi-page crawling while the sibling handles single URLs.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description establishes context by contrasting with x402_scrape_url ('Same extraction pipeline') and noting hard limits (15 pages, depth 5). It clearly signals the pricing model and test mode requirements. However, it stops short of explicitly stating when to choose this over x402_scrape_url (e.g., 'use this for site-wide crawling, use scrape_url for single pages').

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/jameswilliamwisdom/x402-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server