crawl_webpages
Systematically collect content from websites by exploring linked pages for data collection, content indexing, or site mapping.
Instructions
Crawl a website starting from a URL and explore linked pages. This tool allows systematic collection of content from multiple pages within a domain. Use this for larger data collection tasks, content indexing, or site mapping.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | The URL of the webpage to crawl. | |
| sessionOptions | No | Options for the browser session. Avoid setting these if not mentioned explicitly | |
| outputFormat | Yes | The format of the output | |
| followLinks | Yes | Whether to follow links on the crawled webpages | |
| maxPages | No | ||
| ignoreSitemap | No |
Implementation Reference
- src/tools/crawl-webpages.ts:10-102 (handler)The main handler function that executes the tool logic: destructures params, gets client with API key if available, calls client.crawl.startAndWait with crawl parameters, processes results into CallToolResult with text, resources, images.
export async function crawlWebpagesTool( params: crawlWebpagesToolParamSchemaType, extra: RequestHandlerExtra<ServerRequest, ServerNotification> ): Promise<CallToolResult> { const { url, sessionOptions, outputFormat, ignoreSitemap, followLinks, maxPages, } = params; let apiKey: string | undefined = undefined; if (extra.authInfo && extra.authInfo.extra?.isSSE) { apiKey = extra.authInfo.token; } try { const client = await getClient({ hbApiKey: apiKey }); const result = await client.crawl.startAndWait({ url, sessionOptions, scrapeOptions: { formats: outputFormat, }, maxPages, ignoreSitemap, followLinks, }); if (result.error) { return { isError: true, content: [ { type: "text", text: result.error, }, ], }; } const response: CallToolResult = { content: [], isError: false, }; result.data?.forEach((page) => { if (page?.markdown) { response.content.push({ type: "text", text: page.markdown, }); } if (page?.html) { response.content.push({ type: "text", text: page.html, }); } if (page?.links) { page.links.forEach((link) => { response.content.push({ type: "resource", resource: { uri: link, text: link, }, }); }); } if (page?.screenshot) { response.content.push({ type: "image", data: page.screenshot, mimeType: "image/webp", }); } }); return response; } catch (error) { return { content: [{ type: "text", text: `${error}` }], isError: true, }; } } - src/tools/tool-types.ts:127-155 (schema)Zod schema definition for the input parameters of the crawl_webpages tool, including url, sessionOptions, outputFormat, followLinks, maxPages, ignoreSitemap.
export const crawlWebpagesToolParamSchemaRaw = { url: z.string().url().describe("The URL of the webpage to crawl."), sessionOptions: sessionOptionsSchema, outputFormat: z .array(z.enum(["markdown", "html", "links", "screenshot"])) .min(1) .describe("The format of the output"), followLinks: z .boolean() .describe("Whether to follow links on the crawled webpages"), maxPages: z .number() .int() .positive() .finite() .safe() .min(1) .max(100) .default(10), ignoreSitemap: z.boolean().default(false), }; export const crawlWebpagesToolParamSchema = z.object( crawlWebpagesToolParamSchemaRaw ); export type crawlWebpagesToolParamSchemaType = z.infer< typeof crawlWebpagesToolParamSchema >; - src/transports/setup_server.ts:86-89 (registration)Registration of the crawl_webpages tool with the MCP server using server.tool, providing name, description, param schema, and handler function.
crawlWebpagesToolName, crawlWebpagesToolDescription, crawlWebpagesToolParamSchemaRaw, crawlWebpagesTool - src/tools/crawl-webpages.ts:104-106 (registration)Exports the tool name and description constants used for registration.
export const crawlWebpagesToolName = "crawl_webpages"; export const crawlWebpagesToolDescription = "Crawl a website starting from a URL and explore linked pages. This tool allows systematic collection of content from multiple pages within a domain. Use this for larger data collection tasks, content indexing, or site mapping.";