Skip to main content
Glama

crawl_webpages

Systematically collect content from websites by exploring linked pages for data collection, content indexing, or site mapping.

Instructions

Crawl a website starting from a URL and explore linked pages. This tool allows systematic collection of content from multiple pages within a domain. Use this for larger data collection tasks, content indexing, or site mapping.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe URL of the webpage to crawl.
sessionOptionsNoOptions for the browser session. Avoid setting these if not mentioned explicitly
outputFormatYesThe format of the output
followLinksYesWhether to follow links on the crawled webpages
maxPagesNo
ignoreSitemapNo

Implementation Reference

  • The main handler function that executes the tool logic: destructures params, gets client with API key if available, calls client.crawl.startAndWait with crawl parameters, processes results into CallToolResult with text, resources, images.
    export async function crawlWebpagesTool(
      params: crawlWebpagesToolParamSchemaType,
      extra: RequestHandlerExtra<ServerRequest, ServerNotification>
    ): Promise<CallToolResult> {
      const {
        url,
        sessionOptions,
        outputFormat,
        ignoreSitemap,
        followLinks,
        maxPages,
      } = params;
    
      let apiKey: string | undefined = undefined;
      if (extra.authInfo && extra.authInfo.extra?.isSSE) {
        apiKey = extra.authInfo.token;
      }
    
      try {
        const client = await getClient({ hbApiKey: apiKey });
    
        const result = await client.crawl.startAndWait({
          url,
          sessionOptions,
          scrapeOptions: {
            formats: outputFormat,
          },
          maxPages,
          ignoreSitemap,
          followLinks,
        });
    
        if (result.error) {
          return {
            isError: true,
            content: [
              {
                type: "text",
                text: result.error,
              },
            ],
          };
        }
    
        const response: CallToolResult = {
          content: [],
          isError: false,
        };
    
        result.data?.forEach((page) => {
          if (page?.markdown) {
            response.content.push({
              type: "text",
              text: page.markdown,
            });
          }
    
          if (page?.html) {
            response.content.push({
              type: "text",
              text: page.html,
            });
          }
    
          if (page?.links) {
            page.links.forEach((link) => {
              response.content.push({
                type: "resource",
                resource: {
                  uri: link,
                  text: link,
                },
              });
            });
          }
    
          if (page?.screenshot) {
            response.content.push({
              type: "image",
              data: page.screenshot,
              mimeType: "image/webp",
            });
          }
        });
    
        return response;
      } catch (error) {
        return {
          content: [{ type: "text", text: `${error}` }],
          isError: true,
        };
      }
    }
  • Zod schema definition for the input parameters of the crawl_webpages tool, including url, sessionOptions, outputFormat, followLinks, maxPages, ignoreSitemap.
    export const crawlWebpagesToolParamSchemaRaw = {
      url: z.string().url().describe("The URL of the webpage to crawl."),
      sessionOptions: sessionOptionsSchema,
      outputFormat: z
        .array(z.enum(["markdown", "html", "links", "screenshot"]))
        .min(1)
        .describe("The format of the output"),
      followLinks: z
        .boolean()
        .describe("Whether to follow links on the crawled webpages"),
      maxPages: z
        .number()
        .int()
        .positive()
        .finite()
        .safe()
        .min(1)
        .max(100)
        .default(10),
      ignoreSitemap: z.boolean().default(false),
    };
    
    export const crawlWebpagesToolParamSchema = z.object(
      crawlWebpagesToolParamSchemaRaw
    );
    
    export type crawlWebpagesToolParamSchemaType = z.infer<
      typeof crawlWebpagesToolParamSchema
    >;
  • Registration of the crawl_webpages tool with the MCP server using server.tool, providing name, description, param schema, and handler function.
    crawlWebpagesToolName,
    crawlWebpagesToolDescription,
    crawlWebpagesToolParamSchemaRaw,
    crawlWebpagesTool
  • Exports the tool name and description constants used for registration.
    export const crawlWebpagesToolName = "crawl_webpages";
    export const crawlWebpagesToolDescription =
      "Crawl a website starting from a URL and explore linked pages. This tool allows systematic collection of content from multiple pages within a domain. Use this for larger data collection tasks, content indexing, or site mapping.";
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations, the description carries full burden but provides minimal behavioral disclosure. It mentions 'systematic collection' and 'explore linked pages' but omits critical details: rate limits, authentication needs, potential destructive effects (e.g., server load), timeouts, or output format specifics. For a complex 6-parameter crawling tool, this is inadequate.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three concise sentences front-load the core purpose, then provide usage context. No wasted words, though the final sentence could be more tightly integrated. Efficient for the tool's complexity, but not perfectly structured.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a complex crawling tool with 6 parameters, nested objects, no annotations, and no output schema, the description is insufficient. It lacks behavioral warnings (e.g., ethical crawling, rate limits), output details, and fails to compensate for missing annotation coverage. The usage guidelines help but don't address operational constraints.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 67%, providing decent parameter documentation. The description adds marginal value by implying 'starting from a URL' (maps to 'url' parameter) and 'explore linked pages' (hints at 'followLinks'), but doesn't explain other parameters like 'outputFormat' or 'maxPages'. Baseline 3 is appropriate given schema coverage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Crawl a website starting from a URL and explore linked pages' with specific verbs ('crawl', 'explore') and resource ('website', 'linked pages'). It distinguishes from sibling 'scrape_webpage' by emphasizing multi-page collection, though not explicitly named. The 'systematic collection of content from multiple pages' further clarifies scope.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context for when to use: 'for larger data collection tasks, content indexing, or site mapping.' This implicitly distinguishes from single-page scraping tools like 'scrape_webpage' and suggests scale. However, it lacks explicit exclusions or named alternatives, preventing a perfect score.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/hyperbrowserai/mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server