Skip to main content
Glama
bsmi021

MCP Webscan Server

by bsmi021

extract-links

Extract and analyze hyperlinks from web pages, organizing URLs, anchor text, and contextual information into a structured format. Supports site mapping, SEO analysis, broken link checking, and targeted crawling preparation. Handles relative and absolute URLs with optional base URL and output limits.

Instructions

Extract and analyze all hyperlinks from a web page, organizing them into a structured format with URLs, anchor text, and contextual information. Performance-optimized with stream processing and worker threads for efficient handling of large pages. Works with either a direct URL or raw HTML content. Handles relative and absolute URLs properly by supporting an optional base URL parameter. Results can be limited to prevent overwhelming output for link-dense pages. Returns a comprehensive link inventory that includes destination URLs, link text, titles (if available), and whether links are internal or external to the source domain. Useful for site mapping, content analysis, broken link checking, SEO analysis, and as a preparatory step for targeted crawling operations.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
baseUrlNoOptional base URL to resolve relative links against. If provided, only links starting with this base URL will be returned. Useful for focusing on internal links.
limitNoMaximum number of links to return. Defaults to 100. Max allowed is 5000.
urlYesThe fully qualified URL of the web page from which to extract links. Must be a valid HTTP or HTTPS URL.

Implementation Reference

  • MCP tool handler that destructures arguments, calls ExtractLinksService.extractLinksFromPage, formats results as JSON in content, and handles various errors by throwing appropriate McpErrors.
    const processRequest = async (args: ExtractLinksToolArgs) => {
        // Zod handles default for limit if not provided
        const { url, baseUrl, limit } = args;
        logger.debug(`Received ${TOOL_NAME} request`, { url, baseUrl, limit });
    
        try {
            // Call the service method
            const results = await serviceInstance.extractLinksFromPage(url, baseUrl, limit);
    
            // Format the successful output for MCP
            // Note: The original tool returned a 'links' property alongside content.
            // The standard MCP response only has 'content'. We'll return the JSON string in content.
            // If the 'links' property was specifically needed by the client, this is a breaking change.
            return {
                content: [{
                    type: "text" as const,
                    text: JSON.stringify(results, null, 2)
                }]
                // links: results // This is non-standard for MCP tool responses
            };
    
        } catch (error) {
            const logContext = {
                args,
                errorDetails: error instanceof Error ? { name: error.name, message: error.message, stack: error.stack } : String(error)
            };
            logger.error(`Error processing ${TOOL_NAME}`, logContext);
    
            // Map service-specific errors to McpError
            if (error instanceof ValidationError) {
                throw new McpError(ErrorCode.InvalidParams, `Validation failed: ${error.message}`, error.details);
            }
            if (error instanceof ServiceError) {
                throw new McpError(ErrorCode.InternalError, error.message, error.details);
            }
            if (error instanceof McpError) {
                throw error; // Re-throw existing McpErrors
            }
    
            // Catch-all for unexpected errors
            throw new McpError(
                ErrorCode.InternalError,
                error instanceof Error ? `An unexpected error occurred in ${TOOL_NAME}: ${error.message}` : `An unexpected error occurred in ${TOOL_NAME}.`
            );
        }
    };
  • Zod schema defining the input parameters for the extract-links tool: url (required), baseUrl (optional), limit (optional with default).
    export const TOOL_PARAMS = {
        url: z.string().url().describe("The fully qualified URL of the web page from which to extract links. Must be a valid HTTP or HTTPS URL."),
        baseUrl: z.string().url().optional().describe("Optional base URL to resolve relative links against. If provided, only links starting with this base URL will be returned. Useful for focusing on internal links."),
        limit: z.number().int().min(1).max(5000).optional().default(100).describe("Maximum number of links to return. Defaults to 100. Max allowed is 5000."),
    };
  • Registration of the extract-links tool on the MCP server using server.tool with name, description, params schema, and handler function.
    server.tool(
        TOOL_NAME,
        TOOL_DESCRIPTION,
        TOOL_PARAMS,
        processRequest
    );
  • Core implementation that fetches HTML using cheerio, extracts anchor tags, resolves absolute URLs, filters by baseUrl, deduplicates, limits results, and returns LinkResult array.
    public async extractLinksFromPage(pageUrl: string, baseUrl?: string, limit: number = 100): Promise<LinkResult[]> {
        // Basic validation
        if (!pageUrl || typeof pageUrl !== 'string') {
            throw new ValidationError('Invalid input: pageUrl string is required.');
        }
        if (baseUrl && typeof baseUrl !== 'string') {
            throw new ValidationError('Invalid input: baseUrl must be a string if provided.');
        }
        if (typeof limit !== 'number' || limit <= 0) {
            throw new ValidationError('Invalid input: limit must be a positive number.');
        }
    
        logger.info(`Starting link extraction for: ${pageUrl}`, { baseUrl, limit });
    
        const results: LinkResult[] = [];
        const foundUrls = new Set<string>(); // Track unique absolute URLs found
    
        try {
            const { $ } = await fetchHtml(pageUrl);
            logger.debug(`Successfully fetched HTML for ${pageUrl}`);
    
            const linkElements = $('a[href]').toArray();
            logger.debug(`Found ${linkElements.length} anchor elements on ${pageUrl}`);
    
            for (const element of linkElements) {
                if (results.length >= limit) {
                    logger.info(`Reached link limit (${limit}) for ${pageUrl}. Stopping extraction.`);
                    break; // Stop processing if limit is reached
                }
    
                const link = $(element);
                const href = link.attr('href');
                const text = link.text().trim() || '[No text]'; // Default text if empty
    
                // Basic filtering
                if (!href || href.startsWith('#') || href.startsWith('mailto:') || href.startsWith('tel:')) {
                    logger.debug(`Skipping invalid or local href: ${href}`);
                    continue;
                }
    
                let absoluteUrl: string;
                try {
                    // Resolve URL relative to the page URL
                    absoluteUrl = new URL(href, pageUrl).toString();
                } catch (e) {
                    logger.warn(`Could not resolve href '${href}' on page ${pageUrl}`, { error: e instanceof Error ? e.message : String(e) });
                    // Optionally include invalid hrefs in results if needed, or just skip
                    continue;
                }
    
                // Apply baseUrl filter if provided
                if (baseUrl && !absoluteUrl.startsWith(baseUrl)) {
                    logger.debug(`Skipping URL not matching baseUrl: ${absoluteUrl}`);
                    continue;
                }
    
                // Add to results if unique
                if (!foundUrls.has(absoluteUrl)) {
                    foundUrls.add(absoluteUrl);
                    results.push({ url: absoluteUrl, text: text });
                }
            }
    
        } catch (fetchError) {
            logger.error(`Failed to fetch or process page ${pageUrl} for link extraction`, { error: fetchError instanceof Error ? fetchError.message : String(fetchError) });
            throw new ServiceError(`Failed to fetch or process page ${pageUrl}: ${fetchError instanceof Error ? fetchError.message : String(fetchError)}`, fetchError);
        }
    
        logger.info(`Finished link extraction for ${pageUrl}. Found ${results.length} unique links (up to limit ${limit}).`);
        return results;
    }
  • TypeScript interface defining the arguments for the extract-links tool.
    export interface ExtractLinksArgs {
        url: string;
        baseUrl?: string; // Optional base URL for filtering/resolving
        limit: number; // Note: Zod default handles optionality, TS interface needs it
    }
  • Tool name constant used in registration.
    export const TOOL_NAME = "extract-links";
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It effectively describes key traits: performance optimization ('stream processing and worker threads for efficient handling of large pages'), input flexibility ('works with either a direct URL or raw HTML content'), URL handling ('handles relative and absolute URLs properly'), output control ('results can be limited to prevent overwhelming output'), and return content ('returns a comprehensive link inventory'). However, it does not mention error handling, rate limits, or authentication needs, leaving some gaps.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is appropriately sized and front-loaded, starting with the core purpose and key features. Most sentences add value, such as performance optimization and use cases, though some details (e.g., 'stream processing and worker threads') could be slightly condensed. Overall, it is efficient with minimal waste.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's moderate complexity (3 parameters, no output schema, no annotations), the description is largely complete. It covers purpose, behavioral traits, and usage context effectively. However, without an output schema, it should ideally describe the return format more explicitly (e.g., structure of the 'comprehensive link inventory'), though it does mention included elements like URLs, anchor text, and internal/external classification.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, providing detailed documentation for all parameters (url, baseUrl, limit). The description adds marginal value by mentioning the optional base URL parameter for focusing on internal links and limiting results for link-dense pages, but does not provide additional syntax, format, or usage details beyond what the schema already covers. This meets the baseline of 3 when schema coverage is high.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the specific action ('extract and analyze all hyperlinks from a web page') and resource ('web page'), distinguishing it from siblings like 'check-links' (which likely verifies link status) or 'crawl-site' (which follows links recursively). It explicitly mentions organizing links into a structured format with URLs, anchor text, and contextual information, making the purpose distinct and well-defined.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context for when to use this tool ('useful for site mapping, content analysis, broken link checking, SEO analysis, and as a preparatory step for targeted crawling operations'), but does not explicitly state when not to use it or name specific alternatives among siblings. It implies usage for extracting links from a single page rather than crawling multiple pages, but lacks direct comparisons to tools like 'crawl-site' or 'generate-site-map'.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/bsmi021/mcp-server-webscan'

If you have feedback or need assistance with the MCP directory API, please join our Discord server