Skip to main content
Glama
bsmi021

MCP Webscan Server

by bsmi021

crawl-site

Scan and extract all unique URLs from a website by recursively crawling from a given URL up to a specified depth. Designed for web content analysis.

Instructions

Recursively crawls a website starting from a given URL up to a specified maximum depth. It follows links within the same origin and returns a list of all unique URLs found during the crawl.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
maxDepthNoThe maximum depth to crawl relative to the starting URL. 0 means only the starting URL is fetched. Max allowed depth is 5 to prevent excessive crawling. Defaults to 2.
urlYesThe starting URL for the crawl. Must be a valid HTTP or HTTPS URL.

Implementation Reference

  • The handler function that executes the core logic for the 'crawl-site' tool. It destructures the input arguments, invokes the CrawlSiteService to perform the crawl, formats the JSON result for MCP response, and maps errors to appropriate McpError types.
    const processRequest = async (args: CrawlSiteToolArgs) => {
        // Zod handles default for maxDepth if not provided
        const { url, maxDepth } = args;
        logger.debug(`Received ${TOOL_NAME} request`, { url, maxDepth });
    
        try {
            // Call the service method
            const result = await serviceInstance.crawlWebsite(url, maxDepth);
    
            // Format the successful output for MCP
            return {
                content: [{
                    type: "text" as const,
                    text: JSON.stringify(result, null, 2)
                }]
            };
    
        } catch (error) {
            const logContext = {
                args,
                errorDetails: error instanceof Error ? { name: error.name, message: error.message, stack: error.stack } : String(error)
            };
            logger.error(`Error processing ${TOOL_NAME}`, logContext);
    
            // Map service-specific errors to McpError
            if (error instanceof ValidationError) {
                throw new McpError(ErrorCode.InvalidParams, `Validation failed: ${error.message}`, error.details);
            }
            if (error instanceof ServiceError) {
                throw new McpError(ErrorCode.InternalError, error.message, error.details);
            }
            if (error instanceof McpError) {
                throw error; // Re-throw existing McpErrors
            }
    
            // Catch-all for unexpected errors
            throw new McpError(
                ErrorCode.InternalError,
                error instanceof Error ? `An unexpected error occurred in ${TOOL_NAME}: ${error.message}` : `An unexpected error occurred in ${TOOL_NAME}.`
            );
        }
    };
  • Zod schema definition for the 'crawl-site' tool inputs (TOOL_PARAMS), along with the tool name and description used during registration.
    export const TOOL_NAME = "crawl-site";
    
    export const TOOL_DESCRIPTION = `Recursively crawls a website starting from a given URL up to a specified maximum depth. It follows links within the same origin and returns a list of all unique URLs found during the crawl.`;
    
    export const TOOL_PARAMS = {
        url: z.string().url().describe("The starting URL for the crawl. Must be a valid HTTP or HTTPS URL."),
        maxDepth: z.number().int().min(0).max(5).optional().default(2).describe("The maximum depth to crawl relative to the starting URL. 0 means only the starting URL is fetched. Max allowed depth is 5 to prevent excessive crawling. Defaults to 2."),
    };
  • Registers the 'crawl-site' tool with the MCP server by calling server.tool() with the name, description, input schema, and handler function.
    server.tool(
        TOOL_NAME,
        TOOL_DESCRIPTION,
        TOOL_PARAMS,
        processRequest
    );
  • Supporting service method implementing the recursive website crawling logic using the crawlPage utility, including input validation, visited URL tracking, result formatting, and error handling.
    public async crawlWebsite(startUrl: string, maxDepth: number): Promise<CrawlResult> {
        // Basic validation
        if (!startUrl || typeof startUrl !== 'string') {
            throw new ValidationError('Invalid input: startUrl string is required.');
        }
        if (typeof maxDepth !== 'number' || maxDepth < 0) {
            throw new ValidationError('Invalid input: maxDepth must be a non-negative number.');
        }
    
        logger.info(`Starting crawl for: ${startUrl} up to depth ${maxDepth}`);
    
        try {
            const visited = new Set<string>();
            // Call the utility function
            const urls = await crawlPage(startUrl, 0, maxDepth, visited);
    
            // Ensure uniqueness (though crawlPage should handle it, belt-and-suspenders)
            const uniqueUrls = Array.from(new Set(urls));
    
            const result: CrawlResult = {
                crawled_urls: uniqueUrls,
                total_urls: uniqueUrls.length,
            };
    
            logger.info(`Finished crawl for ${startUrl}. Found ${result.total_urls} unique URLs.`);
            return result;
    
        } catch (error) {
            // Catch errors specifically from crawlPage or its dependencies (like fetchHtml)
            logger.error(`Error during crawlWebsite execution for ${startUrl}`, { error: error instanceof Error ? error.message : String(error), startUrl, maxDepth });
    
            // Wrap unexpected errors in a ServiceError
            throw new ServiceError(`Crawling failed for ${startUrl}: ${error instanceof Error ? error.message : String(error)}`, error);
        }
    }
  • Invocation of the crawlSiteTool registration function within the central registerTools function.
    crawlSiteTool(server);
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries the full burden. It discloses key behaviors: recursion, same-origin link following, and returning unique URLs. However, it lacks details on rate limits, timeouts, authentication needs, or error handling, which are important for a crawling operation. The description does not contradict any annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two sentences, front-loaded with the core purpose and followed by behavioral details. Every sentence earns its place by explaining the crawling process and output without redundancy or fluff, making it highly efficient and easy to parse.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (crawling with recursion and depth limits), no annotations, and no output schema, the description is adequate but incomplete. It covers the basic operation and output format (list of unique URLs), but lacks information on pagination, response structure, or potential side effects, which could hinder an agent's ability to use it effectively in varied contexts.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema fully documents parameters. The description adds minimal value beyond the schema by mentioning 'starting URL' and 'maximum depth' in context, but does not provide additional semantics like examples or edge cases. With high schema coverage, the baseline is 3, but the description slightly enhances understanding, warranting a 4.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the specific action ('recursively crawls'), resource ('a website'), and scope ('starting from a given URL up to a specified maximum depth'). It distinguishes from siblings by specifying it follows links within the same origin and returns unique URLs, unlike tools like 'fetch-page' (single page) or 'extract-links' (link extraction without crawling).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage for discovering all URLs on a site via crawling, but does not explicitly state when to use this tool versus alternatives like 'check-links' (likely for link validation) or 'generate-site-map' (for structured sitemaps). No exclusions or prerequisites are mentioned, leaving some ambiguity about optimal use cases.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/bsmi021/mcp-server-webscan'

If you have feedback or need assistance with the MCP directory API, please join our Discord server