Skip to main content
Glama
omgwtfwow

MCP Server for Crawl4AI

by omgwtfwow

extract_links

Extract and categorize all links from a webpage to build sitemaps, analyze site structure, find broken links, or discover resources. Groups links by type for efficient web analysis.

Instructions

[STATELESS] Extract and categorize all page links. Use when: building sitemaps, analyzing site structure, finding broken links, or discovering resources. Groups by internal/external/social/documents. Creates new browser each time. For persistent operations use create_session + crawl.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe URL to extract links from
categorizeNoGroup links by type: internal (same domain), external, social media, documents (PDF/DOC), images. Helpful for link analysis

Implementation Reference

  • Core handler function implementing extract_links tool logic: crawls URL, extracts internal/external links, optionally categorizes further (social, documents, images, scripts), handles JSON/non-HTML cases, returns categorized list.
    async extractLinks(options: { url: string; categorize?: boolean }) {
      try {
        // Use crawl endpoint instead of md to get full link data
        const response = await this.axiosClient.post('/crawl', {
          urls: [options.url],
          crawler_config: {
            cache_mode: 'bypass',
          },
        });
    
        const results = response.data.results || [response.data];
        const result: CrawlResultItem = results[0] || {};
    
        // Variables for manually extracted links
        let manuallyExtractedInternal: string[] = [];
        let manuallyExtractedExternal: string[] = [];
        let hasManuallyExtractedLinks = false;
    
        // Check if the response is likely JSON or non-HTML content
        if (!result.links || (result.links.internal.length === 0 && result.links.external.length === 0)) {
          // Try to detect if this might be a JSON endpoint
          const markdownContent = result.markdown?.raw_markdown || result.markdown?.fit_markdown || '';
          const htmlContent = result.html || '';
    
          // Check for JSON indicators
          if (
            // Check URL pattern
            options.url.includes('/api/') ||
            options.url.includes('/api.') ||
            // Check content type (often shown in markdown conversion)
            markdownContent.includes('application/json') ||
            // Check for JSON structure patterns
            (markdownContent.startsWith('{') && markdownContent.endsWith('}')) ||
            (markdownContent.startsWith('[') && markdownContent.endsWith(']')) ||
            // Check HTML for JSON indicators
            htmlContent.includes('application/json') ||
            // Common JSON patterns
            markdownContent.includes('"links"') ||
            markdownContent.includes('"url"') ||
            markdownContent.includes('"data"')
          ) {
            return {
              content: [
                {
                  type: 'text',
                  text: `Note: ${options.url} appears to return JSON data rather than HTML. The extract_links tool is designed for HTML pages with <a> tags. To extract URLs from JSON, you would need to parse the JSON structure directly.`,
                },
              ],
            };
          }
          // If no links found but it's HTML, let's check the markdown content for href patterns
          if (markdownContent && markdownContent.includes('href=')) {
            // Extract links manually from markdown if server didn't provide them
            const hrefPattern = /href=["']([^"']+)["']/g;
            const foundLinks: string[] = [];
            let match;
            while ((match = hrefPattern.exec(markdownContent)) !== null) {
              foundLinks.push(match[1]);
            }
            if (foundLinks.length > 0) {
              hasManuallyExtractedLinks = true;
              // Categorize found links
              const currentDomain = new URL(options.url).hostname;
    
              foundLinks.forEach((link) => {
                try {
                  const linkUrl = new URL(link, options.url);
                  if (linkUrl.hostname === currentDomain) {
                    manuallyExtractedInternal.push(linkUrl.href);
                  } else {
                    manuallyExtractedExternal.push(linkUrl.href);
                  }
                } catch {
                  // Relative link
                  manuallyExtractedInternal.push(link);
                }
              });
            }
          }
        }
    
        // Handle both cases: API-provided links and manually extracted links
        let internalUrls: string[] = [];
        let externalUrls: string[] = [];
    
        if (result.links && (result.links.internal.length > 0 || result.links.external.length > 0)) {
          // Use API-provided links
          internalUrls = result.links.internal.map((link) => (typeof link === 'string' ? link : link.href));
          externalUrls = result.links.external.map((link) => (typeof link === 'string' ? link : link.href));
        } else if (hasManuallyExtractedLinks) {
          // Use manually extracted links
          internalUrls = manuallyExtractedInternal;
          externalUrls = manuallyExtractedExternal;
        }
    
        const allUrls = [...internalUrls, ...externalUrls];
    
        if (!options.categorize) {
          return {
            content: [
              {
                type: 'text',
                text: `All links from ${options.url}:\n${allUrls.join('\n')}`,
              },
            ],
          };
        }
    
        // Categorize links
        const categorized: Record<string, string[]> = {
          internal: [],
          external: [],
          social: [],
          documents: [],
          images: [],
          scripts: [],
        };
    
        // Further categorize links
        const socialDomains = ['facebook.com', 'twitter.com', 'linkedin.com', 'instagram.com', 'youtube.com'];
        const docExtensions = ['.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx'];
        const imageExtensions = ['.jpg', '.jpeg', '.png', '.gif', '.svg', '.webp'];
        const scriptExtensions = ['.js', '.css'];
    
        // Categorize internal URLs
        internalUrls.forEach((href: string) => {
          if (docExtensions.some((ext) => href.toLowerCase().endsWith(ext))) {
            categorized.documents.push(href);
          } else if (imageExtensions.some((ext) => href.toLowerCase().endsWith(ext))) {
            categorized.images.push(href);
          } else if (scriptExtensions.some((ext) => href.toLowerCase().endsWith(ext))) {
            categorized.scripts.push(href);
          } else {
            categorized.internal.push(href);
          }
        });
    
        // Categorize external URLs
        externalUrls.forEach((href: string) => {
          if (socialDomains.some((domain) => href.includes(domain))) {
            categorized.social.push(href);
          } else if (docExtensions.some((ext) => href.toLowerCase().endsWith(ext))) {
            categorized.documents.push(href);
          } else if (imageExtensions.some((ext) => href.toLowerCase().endsWith(ext))) {
            categorized.images.push(href);
          } else if (scriptExtensions.some((ext) => href.toLowerCase().endsWith(ext))) {
            categorized.scripts.push(href);
          } else {
            categorized.external.push(href);
          }
        });
    
        // Return based on categorize option (defaults to true)
        if (options.categorize) {
          return {
            content: [
              {
                type: 'text',
                text: `Link analysis for ${options.url}:\n\n${Object.entries(categorized)
                  .map(
                    ([category, links]: [string, string[]]) =>
                      `${category} (${links.length}):\n${links.slice(0, 10).join('\n')}${links.length > 10 ? '\n...' : ''}`,
                  )
                  .join('\n\n')}`,
              },
            ],
          };
        } else {
          // Return simple list without categorization
          const allLinks = [...internalUrls, ...externalUrls];
          return {
            content: [
              {
                type: 'text',
                text: `All links from ${options.url} (${allLinks.length} total):\n\n${allLinks.slice(0, 50).join('\n')}${allLinks.length > 50 ? '\n...' : ''}`,
              },
            ],
          };
        }
      } catch (error) {
        throw this.formatError(error, 'extract links');
      }
    }
  • Zod input schema for extract_links: requires valid url, optional categorize boolean (defaults to true).
    export const ExtractLinksSchema = createStatelessSchema(
      z.object({
        url: z.string().url(),
        categorize: z.boolean().optional().default(true),
      }),
      'extract_links',
    );
  • src/server.ts:862-868 (registration)
    Tool registration in MCP server: switch case handling 'extract_links' calls, validates args with ExtractLinksSchema, delegates to utilityHandlers.extractLinks
    case 'extract_links':
      return await this.validateAndExecute(
        'extract_links',
        args,
        ExtractLinksSchema as z.ZodSchema<z.infer<typeof ExtractLinksSchema>>,
        async (validatedArgs) => this.utilityHandlers.extractLinks(validatedArgs),
      );
  • src/server.ts:289-307 (registration)
    Tool metadata registration in listTools response: defines name, description, and inputSchema for extract_links.
    name: 'extract_links',
    description:
      '[STATELESS] Extract and categorize all page links. Use when: building sitemaps, analyzing site structure, finding broken links, or discovering resources. Groups by internal/external/social/documents. Creates new browser each time. For persistent operations use create_session + crawl.',
    inputSchema: {
      type: 'object',
      properties: {
        url: {
          type: 'string',
          description: 'The URL to extract links from',
        },
        categorize: {
          type: 'boolean',
          description:
            'Group links by type: internal (same domain), external, social media, documents (PDF/DOC), images. Helpful for link analysis',
          default: true,
        },
      },
      required: ['url'],
    },

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/omgwtfwow/mcp-crawl4ai-ts'

If you have feedback or need assistance with the MCP directory API, please join our Discord server