Skip to main content
Glama
omgwtfwow

MCP Server for Crawl4AI

by omgwtfwow

extract_links

Extract and categorize all links from any webpage to build sitemaps, analyze site structure, find broken links, or discover resources by grouping them as internal, external, social media, or documents.

Instructions

[STATELESS] Extract and categorize all page links. Use when: building sitemaps, analyzing site structure, finding broken links, or discovering resources. Groups by internal/external/social/documents. Creates new browser each time. For persistent operations use create_session + crawl.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
categorizeNoGroup links by type: internal (same domain), external, social media, documents (PDF/DOC), images. Helpful for link analysis
urlYesThe URL to extract links from

Implementation Reference

  • Core handler function that implements extract_links: crawls the given URL, extracts internal/external links using Crawl4AI API, manually extracts from markdown if needed, detects JSON endpoints, categorizes links (internal/external/social/documents/images/scripts) if requested, returns formatted text list.
    async extractLinks(options: { url: string; categorize?: boolean }) { try { // Use crawl endpoint instead of md to get full link data const response = await this.axiosClient.post('/crawl', { urls: [options.url], crawler_config: { cache_mode: 'bypass', }, }); const results = response.data.results || [response.data]; const result: CrawlResultItem = results[0] || {}; // Variables for manually extracted links let manuallyExtractedInternal: string[] = []; let manuallyExtractedExternal: string[] = []; let hasManuallyExtractedLinks = false; // Check if the response is likely JSON or non-HTML content if (!result.links || (result.links.internal.length === 0 && result.links.external.length === 0)) { // Try to detect if this might be a JSON endpoint const markdownContent = result.markdown?.raw_markdown || result.markdown?.fit_markdown || ''; const htmlContent = result.html || ''; // Check for JSON indicators if ( // Check URL pattern options.url.includes('/api/') || options.url.includes('/api.') || // Check content type (often shown in markdown conversion) markdownContent.includes('application/json') || // Check for JSON structure patterns (markdownContent.startsWith('{') && markdownContent.endsWith('}')) || (markdownContent.startsWith('[') && markdownContent.endsWith(']')) || // Check HTML for JSON indicators htmlContent.includes('application/json') || // Common JSON patterns markdownContent.includes('"links"') || markdownContent.includes('"url"') || markdownContent.includes('"data"') ) { return { content: [ { type: 'text', text: `Note: ${options.url} appears to return JSON data rather than HTML. The extract_links tool is designed for HTML pages with <a> tags. To extract URLs from JSON, you would need to parse the JSON structure directly.`, }, ], }; } // If no links found but it's HTML, let's check the markdown content for href patterns if (markdownContent && markdownContent.includes('href=')) { // Extract links manually from markdown if server didn't provide them const hrefPattern = /href=["']([^"']+)["']/g; const foundLinks: string[] = []; let match; while ((match = hrefPattern.exec(markdownContent)) !== null) { foundLinks.push(match[1]); } if (foundLinks.length > 0) { hasManuallyExtractedLinks = true; // Categorize found links const currentDomain = new URL(options.url).hostname; foundLinks.forEach((link) => { try { const linkUrl = new URL(link, options.url); if (linkUrl.hostname === currentDomain) { manuallyExtractedInternal.push(linkUrl.href); } else { manuallyExtractedExternal.push(linkUrl.href); } } catch { // Relative link manuallyExtractedInternal.push(link); } }); } } } // Handle both cases: API-provided links and manually extracted links let internalUrls: string[] = []; let externalUrls: string[] = []; if (result.links && (result.links.internal.length > 0 || result.links.external.length > 0)) { // Use API-provided links internalUrls = result.links.internal.map((link) => (typeof link === 'string' ? link : link.href)); externalUrls = result.links.external.map((link) => (typeof link === 'string' ? link : link.href)); } else if (hasManuallyExtractedLinks) { // Use manually extracted links internalUrls = manuallyExtractedInternal; externalUrls = manuallyExtractedExternal; } const allUrls = [...internalUrls, ...externalUrls]; if (!options.categorize) { return { content: [ { type: 'text', text: `All links from ${options.url}:\n${allUrls.join('\n')}`, }, ], }; } // Categorize links const categorized: Record<string, string[]> = { internal: [], external: [], social: [], documents: [], images: [], scripts: [], }; // Further categorize links const socialDomains = ['facebook.com', 'twitter.com', 'linkedin.com', 'instagram.com', 'youtube.com']; const docExtensions = ['.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx']; const imageExtensions = ['.jpg', '.jpeg', '.png', '.gif', '.svg', '.webp']; const scriptExtensions = ['.js', '.css']; // Categorize internal URLs internalUrls.forEach((href: string) => { if (docExtensions.some((ext) => href.toLowerCase().endsWith(ext))) { categorized.documents.push(href); } else if (imageExtensions.some((ext) => href.toLowerCase().endsWith(ext))) { categorized.images.push(href); } else if (scriptExtensions.some((ext) => href.toLowerCase().endsWith(ext))) { categorized.scripts.push(href); } else { categorized.internal.push(href); } }); // Categorize external URLs externalUrls.forEach((href: string) => { if (socialDomains.some((domain) => href.includes(domain))) { categorized.social.push(href); } else if (docExtensions.some((ext) => href.toLowerCase().endsWith(ext))) { categorized.documents.push(href); } else if (imageExtensions.some((ext) => href.toLowerCase().endsWith(ext))) { categorized.images.push(href); } else if (scriptExtensions.some((ext) => href.toLowerCase().endsWith(ext))) { categorized.scripts.push(href); } else { categorized.external.push(href); } }); // Return based on categorize option (defaults to true) if (options.categorize) { return { content: [ { type: 'text', text: `Link analysis for ${options.url}:\n\n${Object.entries(categorized) .map( ([category, links]: [string, string[]]) => `${category} (${links.length}):\n${links.slice(0, 10).join('\n')}${links.length > 10 ? '\n...' : ''}`, ) .join('\n\n')}`, }, ], }; } else { // Return simple list without categorization const allLinks = [...internalUrls, ...externalUrls]; return { content: [ { type: 'text', text: `All links from ${options.url} (${allLinks.length} total):\n\n${allLinks.slice(0, 50).join('\n')}${allLinks.length > 50 ? '\n...' : ''}`, }, ], }; } } catch (error) { throw this.formatError(error, 'extract links'); } }
  • Zod schema defining input validation for extract_links tool: requires 'url' (valid URL string), optional 'categorize' boolean (defaults to true).
    export const ExtractLinksSchema = createStatelessSchema( z.object({ url: z.string().url(), categorize: z.boolean().optional().default(true), }), 'extract_links', );
  • src/server.ts:862-868 (registration)
    Tool registration in server request handler: matches 'extract_links' tool calls, validates args with ExtractLinksSchema, delegates execution to utilityHandlers.extractLinks.
    case 'extract_links': return await this.validateAndExecute( 'extract_links', args, ExtractLinksSchema as z.ZodSchema<z.infer<typeof ExtractLinksSchema>>, async (validatedArgs) => this.utilityHandlers.extractLinks(validatedArgs), );
  • src/server.ts:289-307 (registration)
    Tool metadata registration in listTools response: defines name, description, and inputSchema for client/tool discovery.
    name: 'extract_links', description: '[STATELESS] Extract and categorize all page links. Use when: building sitemaps, analyzing site structure, finding broken links, or discovering resources. Groups by internal/external/social/documents. Creates new browser each time. For persistent operations use create_session + crawl.', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'The URL to extract links from', }, categorize: { type: 'boolean', description: 'Group links by type: internal (same domain), external, social media, documents (PDF/DOC), images. Helpful for link analysis', default: true, }, }, required: ['url'], },

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/omgwtfwow/mcp-crawl4ai-ts'

If you have feedback or need assistance with the MCP directory API, please join our Discord server