Skip to main content
Glama
omgwtfwow

MCP Server for Crawl4AI

by omgwtfwow

extract_links

Extract and categorize all links from a webpage to build sitemaps, analyze site structure, find broken links, or discover resources. Groups links by type for efficient web analysis.

Instructions

[STATELESS] Extract and categorize all page links. Use when: building sitemaps, analyzing site structure, finding broken links, or discovering resources. Groups by internal/external/social/documents. Creates new browser each time. For persistent operations use create_session + crawl.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe URL to extract links from
categorizeNoGroup links by type: internal (same domain), external, social media, documents (PDF/DOC), images. Helpful for link analysis

Implementation Reference

  • Core handler function implementing extract_links tool logic: crawls URL, extracts internal/external links, optionally categorizes further (social, documents, images, scripts), handles JSON/non-HTML cases, returns categorized list.
    async extractLinks(options: { url: string; categorize?: boolean }) {
      try {
        // Use crawl endpoint instead of md to get full link data
        const response = await this.axiosClient.post('/crawl', {
          urls: [options.url],
          crawler_config: {
            cache_mode: 'bypass',
          },
        });
    
        const results = response.data.results || [response.data];
        const result: CrawlResultItem = results[0] || {};
    
        // Variables for manually extracted links
        let manuallyExtractedInternal: string[] = [];
        let manuallyExtractedExternal: string[] = [];
        let hasManuallyExtractedLinks = false;
    
        // Check if the response is likely JSON or non-HTML content
        if (!result.links || (result.links.internal.length === 0 && result.links.external.length === 0)) {
          // Try to detect if this might be a JSON endpoint
          const markdownContent = result.markdown?.raw_markdown || result.markdown?.fit_markdown || '';
          const htmlContent = result.html || '';
    
          // Check for JSON indicators
          if (
            // Check URL pattern
            options.url.includes('/api/') ||
            options.url.includes('/api.') ||
            // Check content type (often shown in markdown conversion)
            markdownContent.includes('application/json') ||
            // Check for JSON structure patterns
            (markdownContent.startsWith('{') && markdownContent.endsWith('}')) ||
            (markdownContent.startsWith('[') && markdownContent.endsWith(']')) ||
            // Check HTML for JSON indicators
            htmlContent.includes('application/json') ||
            // Common JSON patterns
            markdownContent.includes('"links"') ||
            markdownContent.includes('"url"') ||
            markdownContent.includes('"data"')
          ) {
            return {
              content: [
                {
                  type: 'text',
                  text: `Note: ${options.url} appears to return JSON data rather than HTML. The extract_links tool is designed for HTML pages with <a> tags. To extract URLs from JSON, you would need to parse the JSON structure directly.`,
                },
              ],
            };
          }
          // If no links found but it's HTML, let's check the markdown content for href patterns
          if (markdownContent && markdownContent.includes('href=')) {
            // Extract links manually from markdown if server didn't provide them
            const hrefPattern = /href=["']([^"']+)["']/g;
            const foundLinks: string[] = [];
            let match;
            while ((match = hrefPattern.exec(markdownContent)) !== null) {
              foundLinks.push(match[1]);
            }
            if (foundLinks.length > 0) {
              hasManuallyExtractedLinks = true;
              // Categorize found links
              const currentDomain = new URL(options.url).hostname;
    
              foundLinks.forEach((link) => {
                try {
                  const linkUrl = new URL(link, options.url);
                  if (linkUrl.hostname === currentDomain) {
                    manuallyExtractedInternal.push(linkUrl.href);
                  } else {
                    manuallyExtractedExternal.push(linkUrl.href);
                  }
                } catch {
                  // Relative link
                  manuallyExtractedInternal.push(link);
                }
              });
            }
          }
        }
    
        // Handle both cases: API-provided links and manually extracted links
        let internalUrls: string[] = [];
        let externalUrls: string[] = [];
    
        if (result.links && (result.links.internal.length > 0 || result.links.external.length > 0)) {
          // Use API-provided links
          internalUrls = result.links.internal.map((link) => (typeof link === 'string' ? link : link.href));
          externalUrls = result.links.external.map((link) => (typeof link === 'string' ? link : link.href));
        } else if (hasManuallyExtractedLinks) {
          // Use manually extracted links
          internalUrls = manuallyExtractedInternal;
          externalUrls = manuallyExtractedExternal;
        }
    
        const allUrls = [...internalUrls, ...externalUrls];
    
        if (!options.categorize) {
          return {
            content: [
              {
                type: 'text',
                text: `All links from ${options.url}:\n${allUrls.join('\n')}`,
              },
            ],
          };
        }
    
        // Categorize links
        const categorized: Record<string, string[]> = {
          internal: [],
          external: [],
          social: [],
          documents: [],
          images: [],
          scripts: [],
        };
    
        // Further categorize links
        const socialDomains = ['facebook.com', 'twitter.com', 'linkedin.com', 'instagram.com', 'youtube.com'];
        const docExtensions = ['.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx'];
        const imageExtensions = ['.jpg', '.jpeg', '.png', '.gif', '.svg', '.webp'];
        const scriptExtensions = ['.js', '.css'];
    
        // Categorize internal URLs
        internalUrls.forEach((href: string) => {
          if (docExtensions.some((ext) => href.toLowerCase().endsWith(ext))) {
            categorized.documents.push(href);
          } else if (imageExtensions.some((ext) => href.toLowerCase().endsWith(ext))) {
            categorized.images.push(href);
          } else if (scriptExtensions.some((ext) => href.toLowerCase().endsWith(ext))) {
            categorized.scripts.push(href);
          } else {
            categorized.internal.push(href);
          }
        });
    
        // Categorize external URLs
        externalUrls.forEach((href: string) => {
          if (socialDomains.some((domain) => href.includes(domain))) {
            categorized.social.push(href);
          } else if (docExtensions.some((ext) => href.toLowerCase().endsWith(ext))) {
            categorized.documents.push(href);
          } else if (imageExtensions.some((ext) => href.toLowerCase().endsWith(ext))) {
            categorized.images.push(href);
          } else if (scriptExtensions.some((ext) => href.toLowerCase().endsWith(ext))) {
            categorized.scripts.push(href);
          } else {
            categorized.external.push(href);
          }
        });
    
        // Return based on categorize option (defaults to true)
        if (options.categorize) {
          return {
            content: [
              {
                type: 'text',
                text: `Link analysis for ${options.url}:\n\n${Object.entries(categorized)
                  .map(
                    ([category, links]: [string, string[]]) =>
                      `${category} (${links.length}):\n${links.slice(0, 10).join('\n')}${links.length > 10 ? '\n...' : ''}`,
                  )
                  .join('\n\n')}`,
              },
            ],
          };
        } else {
          // Return simple list without categorization
          const allLinks = [...internalUrls, ...externalUrls];
          return {
            content: [
              {
                type: 'text',
                text: `All links from ${options.url} (${allLinks.length} total):\n\n${allLinks.slice(0, 50).join('\n')}${allLinks.length > 50 ? '\n...' : ''}`,
              },
            ],
          };
        }
      } catch (error) {
        throw this.formatError(error, 'extract links');
      }
    }
  • Zod input schema for extract_links: requires valid url, optional categorize boolean (defaults to true).
    export const ExtractLinksSchema = createStatelessSchema(
      z.object({
        url: z.string().url(),
        categorize: z.boolean().optional().default(true),
      }),
      'extract_links',
    );
  • src/server.ts:862-868 (registration)
    Tool registration in MCP server: switch case handling 'extract_links' calls, validates args with ExtractLinksSchema, delegates to utilityHandlers.extractLinks
    case 'extract_links':
      return await this.validateAndExecute(
        'extract_links',
        args,
        ExtractLinksSchema as z.ZodSchema<z.infer<typeof ExtractLinksSchema>>,
        async (validatedArgs) => this.utilityHandlers.extractLinks(validatedArgs),
      );
  • src/server.ts:289-307 (registration)
    Tool metadata registration in listTools response: defines name, description, and inputSchema for extract_links.
    name: 'extract_links',
    description:
      '[STATELESS] Extract and categorize all page links. Use when: building sitemaps, analyzing site structure, finding broken links, or discovering resources. Groups by internal/external/social/documents. Creates new browser each time. For persistent operations use create_session + crawl.',
    inputSchema: {
      type: 'object',
      properties: {
        url: {
          type: 'string',
          description: 'The URL to extract links from',
        },
        categorize: {
          type: 'boolean',
          description:
            'Group links by type: internal (same domain), external, social media, documents (PDF/DOC), images. Helpful for link analysis',
          default: true,
        },
      },
      required: ['url'],
    },
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It effectively describes key traits: stateless operation ('[STATELESS]'), creation of a new browser each time, and categorization behavior. However, it lacks details on error handling, performance limits, or authentication needs, leaving some gaps for a tool that interacts with external URLs.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is highly concise and well-structured, with each sentence adding value: it states the tool's purpose, usage contexts, categorization details, behavioral note, and alternative. There is no redundant or unnecessary information, making it efficient and front-loaded for quick comprehension.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's moderate complexity (2 parameters, no output schema, no annotations), the description is mostly complete, covering purpose, usage, behavior, and alternatives. However, it lacks output details (e.g., format of categorized links) and doesn't address potential issues like rate limits or URL validity, which could be helpful for an agent.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents both parameters thoroughly. The description adds no additional parameter semantics beyond what the schema provides, such as explaining 'categorize' further or detailing URL validation. Thus, it meets the baseline but doesn't enhance understanding.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the verb ('extract and categorize') and resource ('all page links'), distinguishing it from siblings like 'crawl' or 'get_html' by focusing on link extraction rather than broader crawling or content retrieval. It specifies the output format (grouped by internal/external/social/documents), making the purpose explicit and differentiated.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit usage scenarios ('building sitemaps, analyzing site structure, finding broken links, or discovering resources') and clear alternatives ('For persistent operations use create_session + crawl'), helping the agent decide when to use this tool versus others like 'crawl' or 'manage_session'. It also notes stateless behavior, guiding against use for ongoing tasks.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/omgwtfwow/mcp-crawl4ai-ts'

If you have feedback or need assistance with the MCP directory API, please join our Discord server