Skip to main content
Glama
omgwtfwow

MCP Server for Crawl4AI

by omgwtfwow

extract_links

Extract and categorize all links from a webpage to build sitemaps, analyze site structure, find broken links, or discover resources. Groups links by type for efficient web analysis.

Instructions

[STATELESS] Extract and categorize all page links. Use when: building sitemaps, analyzing site structure, finding broken links, or discovering resources. Groups by internal/external/social/documents. Creates new browser each time. For persistent operations use create_session + crawl.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe URL to extract links from
categorizeNoGroup links by type: internal (same domain), external, social media, documents (PDF/DOC), images. Helpful for link analysis

Implementation Reference

  • Core handler function implementing extract_links tool logic: crawls URL, extracts internal/external links, optionally categorizes further (social, documents, images, scripts), handles JSON/non-HTML cases, returns categorized list.
    async extractLinks(options: { url: string; categorize?: boolean }) { try { // Use crawl endpoint instead of md to get full link data const response = await this.axiosClient.post('/crawl', { urls: [options.url], crawler_config: { cache_mode: 'bypass', }, }); const results = response.data.results || [response.data]; const result: CrawlResultItem = results[0] || {}; // Variables for manually extracted links let manuallyExtractedInternal: string[] = []; let manuallyExtractedExternal: string[] = []; let hasManuallyExtractedLinks = false; // Check if the response is likely JSON or non-HTML content if (!result.links || (result.links.internal.length === 0 && result.links.external.length === 0)) { // Try to detect if this might be a JSON endpoint const markdownContent = result.markdown?.raw_markdown || result.markdown?.fit_markdown || ''; const htmlContent = result.html || ''; // Check for JSON indicators if ( // Check URL pattern options.url.includes('/api/') || options.url.includes('/api.') || // Check content type (often shown in markdown conversion) markdownContent.includes('application/json') || // Check for JSON structure patterns (markdownContent.startsWith('{') && markdownContent.endsWith('}')) || (markdownContent.startsWith('[') && markdownContent.endsWith(']')) || // Check HTML for JSON indicators htmlContent.includes('application/json') || // Common JSON patterns markdownContent.includes('"links"') || markdownContent.includes('"url"') || markdownContent.includes('"data"') ) { return { content: [ { type: 'text', text: `Note: ${options.url} appears to return JSON data rather than HTML. The extract_links tool is designed for HTML pages with <a> tags. To extract URLs from JSON, you would need to parse the JSON structure directly.`, }, ], }; } // If no links found but it's HTML, let's check the markdown content for href patterns if (markdownContent && markdownContent.includes('href=')) { // Extract links manually from markdown if server didn't provide them const hrefPattern = /href=["']([^"']+)["']/g; const foundLinks: string[] = []; let match; while ((match = hrefPattern.exec(markdownContent)) !== null) { foundLinks.push(match[1]); } if (foundLinks.length > 0) { hasManuallyExtractedLinks = true; // Categorize found links const currentDomain = new URL(options.url).hostname; foundLinks.forEach((link) => { try { const linkUrl = new URL(link, options.url); if (linkUrl.hostname === currentDomain) { manuallyExtractedInternal.push(linkUrl.href); } else { manuallyExtractedExternal.push(linkUrl.href); } } catch { // Relative link manuallyExtractedInternal.push(link); } }); } } } // Handle both cases: API-provided links and manually extracted links let internalUrls: string[] = []; let externalUrls: string[] = []; if (result.links && (result.links.internal.length > 0 || result.links.external.length > 0)) { // Use API-provided links internalUrls = result.links.internal.map((link) => (typeof link === 'string' ? link : link.href)); externalUrls = result.links.external.map((link) => (typeof link === 'string' ? link : link.href)); } else if (hasManuallyExtractedLinks) { // Use manually extracted links internalUrls = manuallyExtractedInternal; externalUrls = manuallyExtractedExternal; } const allUrls = [...internalUrls, ...externalUrls]; if (!options.categorize) { return { content: [ { type: 'text', text: `All links from ${options.url}:\n${allUrls.join('\n')}`, }, ], }; } // Categorize links const categorized: Record<string, string[]> = { internal: [], external: [], social: [], documents: [], images: [], scripts: [], }; // Further categorize links const socialDomains = ['facebook.com', 'twitter.com', 'linkedin.com', 'instagram.com', 'youtube.com']; const docExtensions = ['.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx']; const imageExtensions = ['.jpg', '.jpeg', '.png', '.gif', '.svg', '.webp']; const scriptExtensions = ['.js', '.css']; // Categorize internal URLs internalUrls.forEach((href: string) => { if (docExtensions.some((ext) => href.toLowerCase().endsWith(ext))) { categorized.documents.push(href); } else if (imageExtensions.some((ext) => href.toLowerCase().endsWith(ext))) { categorized.images.push(href); } else if (scriptExtensions.some((ext) => href.toLowerCase().endsWith(ext))) { categorized.scripts.push(href); } else { categorized.internal.push(href); } }); // Categorize external URLs externalUrls.forEach((href: string) => { if (socialDomains.some((domain) => href.includes(domain))) { categorized.social.push(href); } else if (docExtensions.some((ext) => href.toLowerCase().endsWith(ext))) { categorized.documents.push(href); } else if (imageExtensions.some((ext) => href.toLowerCase().endsWith(ext))) { categorized.images.push(href); } else if (scriptExtensions.some((ext) => href.toLowerCase().endsWith(ext))) { categorized.scripts.push(href); } else { categorized.external.push(href); } }); // Return based on categorize option (defaults to true) if (options.categorize) { return { content: [ { type: 'text', text: `Link analysis for ${options.url}:\n\n${Object.entries(categorized) .map( ([category, links]: [string, string[]]) => `${category} (${links.length}):\n${links.slice(0, 10).join('\n')}${links.length > 10 ? '\n...' : ''}`, ) .join('\n\n')}`, }, ], }; } else { // Return simple list without categorization const allLinks = [...internalUrls, ...externalUrls]; return { content: [ { type: 'text', text: `All links from ${options.url} (${allLinks.length} total):\n\n${allLinks.slice(0, 50).join('\n')}${allLinks.length > 50 ? '\n...' : ''}`, }, ], }; } } catch (error) { throw this.formatError(error, 'extract links'); } }
  • Zod input schema for extract_links: requires valid url, optional categorize boolean (defaults to true).
    export const ExtractLinksSchema = createStatelessSchema( z.object({ url: z.string().url(), categorize: z.boolean().optional().default(true), }), 'extract_links', );
  • src/server.ts:862-868 (registration)
    Tool registration in MCP server: switch case handling 'extract_links' calls, validates args with ExtractLinksSchema, delegates to utilityHandlers.extractLinks
    case 'extract_links': return await this.validateAndExecute( 'extract_links', args, ExtractLinksSchema as z.ZodSchema<z.infer<typeof ExtractLinksSchema>>, async (validatedArgs) => this.utilityHandlers.extractLinks(validatedArgs), );
  • src/server.ts:289-307 (registration)
    Tool metadata registration in listTools response: defines name, description, and inputSchema for extract_links.
    name: 'extract_links', description: '[STATELESS] Extract and categorize all page links. Use when: building sitemaps, analyzing site structure, finding broken links, or discovering resources. Groups by internal/external/social/documents. Creates new browser each time. For persistent operations use create_session + crawl.', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'The URL to extract links from', }, categorize: { type: 'boolean', description: 'Group links by type: internal (same domain), external, social media, documents (PDF/DOC), images. Helpful for link analysis', default: true, }, }, required: ['url'], },

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/omgwtfwow/mcp-crawl4ai-ts'

If you have feedback or need assistance with the MCP directory API, please join our Discord server