crawl-site
Scan and extract all unique URLs from a website by recursively crawling from a given URL up to a specified depth. Designed for web content analysis.
Instructions
Recursively crawls a website starting from a given URL up to a specified maximum depth. It follows links within the same origin and returns a list of all unique URLs found during the crawl.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| maxDepth | No | The maximum depth to crawl relative to the starting URL. 0 means only the starting URL is fetched. Max allowed depth is 5 to prevent excessive crawling. Defaults to 2. | |
| url | Yes | The starting URL for the crawl. Must be a valid HTTP or HTTPS URL. |
Input Schema (JSON Schema)
{
"$schema": "http://json-schema.org/draft-07/schema#",
"additionalProperties": false,
"properties": {
"maxDepth": {
"default": 2,
"description": "The maximum depth to crawl relative to the starting URL. 0 means only the starting URL is fetched. Max allowed depth is 5 to prevent excessive crawling. Defaults to 2.",
"maximum": 5,
"minimum": 0,
"type": "integer"
},
"url": {
"description": "The starting URL for the crawl. Must be a valid HTTP or HTTPS URL.",
"format": "uri",
"type": "string"
}
},
"required": [
"url"
],
"type": "object"
}
Implementation Reference
- src/tools/crawlSiteTool.ts:27-68 (handler)The handler function that executes the core logic for the 'crawl-site' tool. It destructures the input arguments, invokes the CrawlSiteService to perform the crawl, formats the JSON result for MCP response, and maps errors to appropriate McpError types.const processRequest = async (args: CrawlSiteToolArgs) => { // Zod handles default for maxDepth if not provided const { url, maxDepth } = args; logger.debug(`Received ${TOOL_NAME} request`, { url, maxDepth }); try { // Call the service method const result = await serviceInstance.crawlWebsite(url, maxDepth); // Format the successful output for MCP return { content: [{ type: "text" as const, text: JSON.stringify(result, null, 2) }] }; } catch (error) { const logContext = { args, errorDetails: error instanceof Error ? { name: error.name, message: error.message, stack: error.stack } : String(error) }; logger.error(`Error processing ${TOOL_NAME}`, logContext); // Map service-specific errors to McpError if (error instanceof ValidationError) { throw new McpError(ErrorCode.InvalidParams, `Validation failed: ${error.message}`, error.details); } if (error instanceof ServiceError) { throw new McpError(ErrorCode.InternalError, error.message, error.details); } if (error instanceof McpError) { throw error; // Re-throw existing McpErrors } // Catch-all for unexpected errors throw new McpError( ErrorCode.InternalError, error instanceof Error ? `An unexpected error occurred in ${TOOL_NAME}: ${error.message}` : `An unexpected error occurred in ${TOOL_NAME}.` ); } };
- Zod schema definition for the 'crawl-site' tool inputs (TOOL_PARAMS), along with the tool name and description used during registration.export const TOOL_NAME = "crawl-site"; export const TOOL_DESCRIPTION = `Recursively crawls a website starting from a given URL up to a specified maximum depth. It follows links within the same origin and returns a list of all unique URLs found during the crawl.`; export const TOOL_PARAMS = { url: z.string().url().describe("The starting URL for the crawl. Must be a valid HTTP or HTTPS URL."), maxDepth: z.number().int().min(0).max(5).optional().default(2).describe("The maximum depth to crawl relative to the starting URL. 0 means only the starting URL is fetched. Max allowed depth is 5 to prevent excessive crawling. Defaults to 2."), };
- src/tools/crawlSiteTool.ts:71-76 (registration)Registers the 'crawl-site' tool with the MCP server by calling server.tool() with the name, description, input schema, and handler function.server.tool( TOOL_NAME, TOOL_DESCRIPTION, TOOL_PARAMS, processRequest );
- Supporting service method implementing the recursive website crawling logic using the crawlPage utility, including input validation, visited URL tracking, result formatting, and error handling.public async crawlWebsite(startUrl: string, maxDepth: number): Promise<CrawlResult> { // Basic validation if (!startUrl || typeof startUrl !== 'string') { throw new ValidationError('Invalid input: startUrl string is required.'); } if (typeof maxDepth !== 'number' || maxDepth < 0) { throw new ValidationError('Invalid input: maxDepth must be a non-negative number.'); } logger.info(`Starting crawl for: ${startUrl} up to depth ${maxDepth}`); try { const visited = new Set<string>(); // Call the utility function const urls = await crawlPage(startUrl, 0, maxDepth, visited); // Ensure uniqueness (though crawlPage should handle it, belt-and-suspenders) const uniqueUrls = Array.from(new Set(urls)); const result: CrawlResult = { crawled_urls: uniqueUrls, total_urls: uniqueUrls.length, }; logger.info(`Finished crawl for ${startUrl}. Found ${result.total_urls} unique URLs.`); return result; } catch (error) { // Catch errors specifically from crawlPage or its dependencies (like fetchHtml) logger.error(`Error during crawlWebsite execution for ${startUrl}`, { error: error instanceof Error ? error.message : String(error), startUrl, maxDepth }); // Wrap unexpected errors in a ServiceError throw new ServiceError(`Crawling failed for ${startUrl}: ${error instanceof Error ? error.message : String(error)}`, error); } }
- src/tools/index.ts:29-29 (registration)Invocation of the crawlSiteTool registration function within the central registerTools function.crawlSiteTool(server);