Skip to main content
Glama

crawl

Initiate bulk web crawling with structured data extraction. Specify a starting URL, page limit, and optional JSON schema to gather web content programmatically.

Instructions

Start an async bulk crawl with optional extraction. Returns a job ID to poll with job_status. Costs 1 credit per page.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesStarting URL to crawl
max_pagesNoMaximum pages to crawl (default: 10)
schemaNoJSON schema for structured extraction per page

Implementation Reference

  • The handler function for the 'crawl' tool. It constructs the request body with url, max_pages, and optional schema, then calls apiPost('/crawl', body) to start an async bulk crawl job and returns the result formatted as JSON.
    async ({ url, max_pages, schema }) => {
      const body: Record<string, unknown> = { url, max_pages };
      if (schema) body.schema = schema;
      return jsonResult(await apiPost("/crawl", body));
    }
  • Zod schema definition for the 'crawl' tool inputs: url (string, required), max_pages (number, optional with default 10), and schema (record, optional) for structured extraction per page.
    {
      url: z.string().describe("Starting URL to crawl"),
      max_pages: z.number().optional().default(10).describe("Maximum pages to crawl (default: 10)"),
      schema: z.record(z.unknown()).optional().describe("JSON schema for structured extraction per page"),
    },
  • src/index.ts:144-157 (registration)
    Registration of the 'crawl' tool with the MCP server using server.tool(). Defines the tool name, description, input schema, and handler function.
    server.tool(
      "crawl",
      "Start an async bulk crawl with optional extraction. Returns a job ID to poll with job_status. Costs 1 credit per page.",
      {
        url: z.string().describe("Starting URL to crawl"),
        max_pages: z.number().optional().default(10).describe("Maximum pages to crawl (default: 10)"),
        schema: z.record(z.unknown()).optional().describe("JSON schema for structured extraction per page"),
      },
      async ({ url, max_pages, schema }) => {
        const body: Record<string, unknown> = { url, max_pages };
        if (schema) body.schema = schema;
        return jsonResult(await apiPost("/crawl", body));
      }
    );
  • Helper function apiPost that makes HTTP POST requests to the SearchClaw API. Handles JSON serialization, headers, 30-second timeout, and error handling for non-OK responses.
    async function apiPost(path: string, body: Record<string, unknown>) {
      const controller = new AbortController();
      const timeout = setTimeout(() => controller.abort(), 30000);
      try {
        const response = await fetch(`${API_BASE}${path}`, {
          method: "POST",
          headers: { ...headers, "Content-Type": "application/json" },
          body: JSON.stringify(body),
          signal: controller.signal,
        });
        if (!response.ok) {
          const text = await response.text();
          throw new Error(`SearchClaw API error ${response.status}: ${text}`);
        }
        return response.json();
      } finally {
        clearTimeout(timeout);
      }
    }
  • Helper function jsonResult that formats API response data into MCP-compliant content format with type 'text' and JSON stringification with 2-space indentation.
    function jsonResult(data: unknown) {
      return { content: [{ type: "text" as const, text: JSON.stringify(data, null, 2) }] };
    }
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden and does well by disclosing key behavioral traits: it's an async operation, returns a job ID for polling, and has a cost implication ('Costs 1 credit per page'). It doesn't cover rate limits, authentication needs, or error handling, but provides substantial operational context.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences with zero waste: the first states purpose and key behavior, the second adds critical cost information. Every word earns its place, and the most important information (async nature and cost) is front-loaded.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a 3-parameter tool with no annotations and no output schema, the description provides good operational context (async, polling, cost). It doesn't explain return values beyond the job ID or error scenarios, but given the schema coverage and clear purpose, it's reasonably complete for an agent to use effectively.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents all parameters thoroughly. The description adds no additional parameter semantics beyond what's in the schema (e.g., it doesn't explain how 'schema' relates to 'optional extraction'). Baseline 3 is appropriate when the schema does the heavy lifting.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the specific action ('Start an async bulk crawl'), the resource being acted upon (web pages), and the optional capability ('with optional extraction'). It distinguishes from sibling tools like 'browse' (likely single-page) and 'extract' (likely focused on extraction without crawling).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context for when to use this tool ('Start an async bulk crawl'), mentions the polling mechanism ('Returns a job ID to poll with job_status'), and implies an alternative to single-page operations. However, it doesn't explicitly state when NOT to use it or compare it to all relevant siblings like 'browse' or 'extract'.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/CSteenkamp/searchclaw-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server