Skip to main content
Glama
bsmi021
by bsmi021

extract-links

Extract and analyze hyperlinks from web pages, organizing URLs, anchor text, and contextual information into a structured format. Supports site mapping, SEO analysis, broken link checking, and targeted crawling preparation. Handles relative and absolute URLs with optional base URL and output limits.

Instructions

Extract and analyze all hyperlinks from a web page, organizing them into a structured format with URLs, anchor text, and contextual information. Performance-optimized with stream processing and worker threads for efficient handling of large pages. Works with either a direct URL or raw HTML content. Handles relative and absolute URLs properly by supporting an optional base URL parameter. Results can be limited to prevent overwhelming output for link-dense pages. Returns a comprehensive link inventory that includes destination URLs, link text, titles (if available), and whether links are internal or external to the source domain. Useful for site mapping, content analysis, broken link checking, SEO analysis, and as a preparatory step for targeted crawling operations.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
baseUrlNoOptional base URL to resolve relative links against. If provided, only links starting with this base URL will be returned. Useful for focusing on internal links.
limitNoMaximum number of links to return. Defaults to 100. Max allowed is 5000.
urlYesThe fully qualified URL of the web page from which to extract links. Must be a valid HTTP or HTTPS URL.

Implementation Reference

  • MCP tool handler that destructures arguments, calls ExtractLinksService.extractLinksFromPage, formats results as JSON in content, and handles various errors by throwing appropriate McpErrors.
    const processRequest = async (args: ExtractLinksToolArgs) => { // Zod handles default for limit if not provided const { url, baseUrl, limit } = args; logger.debug(`Received ${TOOL_NAME} request`, { url, baseUrl, limit }); try { // Call the service method const results = await serviceInstance.extractLinksFromPage(url, baseUrl, limit); // Format the successful output for MCP // Note: The original tool returned a 'links' property alongside content. // The standard MCP response only has 'content'. We'll return the JSON string in content. // If the 'links' property was specifically needed by the client, this is a breaking change. return { content: [{ type: "text" as const, text: JSON.stringify(results, null, 2) }] // links: results // This is non-standard for MCP tool responses }; } catch (error) { const logContext = { args, errorDetails: error instanceof Error ? { name: error.name, message: error.message, stack: error.stack } : String(error) }; logger.error(`Error processing ${TOOL_NAME}`, logContext); // Map service-specific errors to McpError if (error instanceof ValidationError) { throw new McpError(ErrorCode.InvalidParams, `Validation failed: ${error.message}`, error.details); } if (error instanceof ServiceError) { throw new McpError(ErrorCode.InternalError, error.message, error.details); } if (error instanceof McpError) { throw error; // Re-throw existing McpErrors } // Catch-all for unexpected errors throw new McpError( ErrorCode.InternalError, error instanceof Error ? `An unexpected error occurred in ${TOOL_NAME}: ${error.message}` : `An unexpected error occurred in ${TOOL_NAME}.` ); } };
  • Zod schema defining the input parameters for the extract-links tool: url (required), baseUrl (optional), limit (optional with default).
    export const TOOL_PARAMS = { url: z.string().url().describe("The fully qualified URL of the web page from which to extract links. Must be a valid HTTP or HTTPS URL."), baseUrl: z.string().url().optional().describe("Optional base URL to resolve relative links against. If provided, only links starting with this base URL will be returned. Useful for focusing on internal links."), limit: z.number().int().min(1).max(5000).optional().default(100).describe("Maximum number of links to return. Defaults to 100. Max allowed is 5000."), };
  • Registration of the extract-links tool on the MCP server using server.tool with name, description, params schema, and handler function.
    server.tool( TOOL_NAME, TOOL_DESCRIPTION, TOOL_PARAMS, processRequest );
  • Core implementation that fetches HTML using cheerio, extracts anchor tags, resolves absolute URLs, filters by baseUrl, deduplicates, limits results, and returns LinkResult array.
    public async extractLinksFromPage(pageUrl: string, baseUrl?: string, limit: number = 100): Promise<LinkResult[]> { // Basic validation if (!pageUrl || typeof pageUrl !== 'string') { throw new ValidationError('Invalid input: pageUrl string is required.'); } if (baseUrl && typeof baseUrl !== 'string') { throw new ValidationError('Invalid input: baseUrl must be a string if provided.'); } if (typeof limit !== 'number' || limit <= 0) { throw new ValidationError('Invalid input: limit must be a positive number.'); } logger.info(`Starting link extraction for: ${pageUrl}`, { baseUrl, limit }); const results: LinkResult[] = []; const foundUrls = new Set<string>(); // Track unique absolute URLs found try { const { $ } = await fetchHtml(pageUrl); logger.debug(`Successfully fetched HTML for ${pageUrl}`); const linkElements = $('a[href]').toArray(); logger.debug(`Found ${linkElements.length} anchor elements on ${pageUrl}`); for (const element of linkElements) { if (results.length >= limit) { logger.info(`Reached link limit (${limit}) for ${pageUrl}. Stopping extraction.`); break; // Stop processing if limit is reached } const link = $(element); const href = link.attr('href'); const text = link.text().trim() || '[No text]'; // Default text if empty // Basic filtering if (!href || href.startsWith('#') || href.startsWith('mailto:') || href.startsWith('tel:')) { logger.debug(`Skipping invalid or local href: ${href}`); continue; } let absoluteUrl: string; try { // Resolve URL relative to the page URL absoluteUrl = new URL(href, pageUrl).toString(); } catch (e) { logger.warn(`Could not resolve href '${href}' on page ${pageUrl}`, { error: e instanceof Error ? e.message : String(e) }); // Optionally include invalid hrefs in results if needed, or just skip continue; } // Apply baseUrl filter if provided if (baseUrl && !absoluteUrl.startsWith(baseUrl)) { logger.debug(`Skipping URL not matching baseUrl: ${absoluteUrl}`); continue; } // Add to results if unique if (!foundUrls.has(absoluteUrl)) { foundUrls.add(absoluteUrl); results.push({ url: absoluteUrl, text: text }); } } } catch (fetchError) { logger.error(`Failed to fetch or process page ${pageUrl} for link extraction`, { error: fetchError instanceof Error ? fetchError.message : String(fetchError) }); throw new ServiceError(`Failed to fetch or process page ${pageUrl}: ${fetchError instanceof Error ? fetchError.message : String(fetchError)}`, fetchError); } logger.info(`Finished link extraction for ${pageUrl}. Found ${results.length} unique links (up to limit ${limit}).`); return results; }
  • TypeScript interface defining the arguments for the extract-links tool.
    export interface ExtractLinksArgs { url: string; baseUrl?: string; // Optional base URL for filtering/resolving limit: number; // Note: Zod default handles optionality, TS interface needs it }
  • Tool name constant used in registration.
    export const TOOL_NAME = "extract-links";

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/bsmi021/mcp-server-webscan'

If you have feedback or need assistance with the MCP directory API, please join our Discord server