Skip to main content
Glama
omgwtfwow

MCP Server for Crawl4AI

by omgwtfwow

parse_sitemap

Extract URLs from XML sitemaps to discover all site pages, plan crawl strategies, or check sitemap validity with optional regex filtering.

Instructions

[STATELESS] Extract URLs from XML sitemaps. Use when: discovering all site pages, planning crawl strategies, or checking sitemap validity. Supports regex filtering. Try sitemap.xml or robots.txt first. Creates new browser each time.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesURL of the sitemap (e.g., https://example.com/sitemap.xml)
filter_patternNoOptional regex pattern to filter URLs

Implementation Reference

  • Main handler function that implements the parse_sitemap tool. Fetches sitemap XML directly using axios, extracts URLs from <loc> tags using regex, applies optional regex filter, limits output to first 100 URLs, and returns formatted text response.
    async parseSitemap(options: { url: string; filter_pattern?: string }) {
      try {
        // Fetch the sitemap directly (not through Crawl4AI server)
        const axios = (await import('axios')).default;
        const response = await axios.get(options.url, {
          timeout: 30000,
          headers: {
            'User-Agent': 'Mozilla/5.0 (compatible; MCP-Crawl4AI/1.0)',
          },
        });
        const sitemapContent = response.data;
    
        // Parse XML content - simple regex approach for basic sitemaps
        const urlMatches = sitemapContent.match(/<loc>(.*?)<\/loc>/g) || [];
        const urls = urlMatches.map((match: string) => match.replace(/<\/?loc>/g, ''));
    
        // Apply filter if provided
        let filteredUrls = urls;
        if (options.filter_pattern) {
          const filterRegex = new RegExp(options.filter_pattern);
          filteredUrls = urls.filter((url: string) => filterRegex.test(url));
        }
    
        return {
          content: [
            {
              type: 'text',
              text: `Sitemap parsed successfully:\n\nTotal URLs found: ${urls.length}\nFiltered URLs: ${filteredUrls.length}\n\nURLs:\n${filteredUrls.slice(0, 100).join('\n')}${filteredUrls.length > 100 ? '\n... and ' + (filteredUrls.length - 100) + ' more' : ''}`,
            },
          ],
        };
      } catch (error) {
        throw this.formatError(error, 'parse sitemap');
      }
    }
  • Zod schema defining input validation for parse_sitemap tool: requires a valid URL string, optional filter_pattern regex string.
    export const ParseSitemapSchema = createStatelessSchema(
      z.object({
        url: z.string().url(),
        filter_pattern: z.string().optional(),
      }),
      'parse_sitemap',
    );
  • src/server.ts:875-878 (registration)
    Tool call handler registration in the switch statement: validates args with ParseSitemapSchema and delegates to crawlHandlers.parseSitemap.
    case 'parse_sitemap':
      return await this.validateAndExecute('parse_sitemap', args, ParseSitemapSchema, async (validatedArgs) =>
        this.crawlHandlers.parseSitemap(validatedArgs),
      );
  • src/server.ts:344-362 (registration)
    Tool metadata registration in listTools response: defines name, description, and inputSchema for parse_sitemap.
    {
      name: 'parse_sitemap',
      description:
        '[STATELESS] Extract URLs from XML sitemaps. Use when: discovering all site pages, planning crawl strategies, or checking sitemap validity. Supports regex filtering. Try sitemap.xml or robots.txt first. Creates new browser each time.',
      inputSchema: {
        type: 'object',
        properties: {
          url: {
            type: 'string',
            description: 'URL of the sitemap (e.g., https://example.com/sitemap.xml)',
          },
          filter_pattern: {
            type: 'string',
            description: 'Optional regex pattern to filter URLs',
          },
        },
        required: ['url'],
      },
    },
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden and does well by disclosing key behavioral traits: the stateless nature ('[STATELESS]'), that it 'Creates new browser each time' (implying isolated execution), and supports 'regex filtering' for customization. It doesn't mention error handling or rate limits, keeping it from a perfect score.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is front-loaded with key information, uses bullet-like phrasing efficiently, and every sentence earns its place by adding distinct value (purpose, usage guidelines, behavioral notes). No wasted words.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's moderate complexity (2 parameters, no output schema, no annotations), the description is largely complete: it covers purpose, usage, key behavior, and hints at parameters. It lacks details on output format or error cases, but for a stateless extraction tool, this is sufficient though not exhaustive.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents both parameters fully. The description adds minimal value beyond the schema by mentioning 'Supports regex filtering' which loosely relates to 'filter_pattern', but doesn't provide additional syntax or format details. Baseline 3 is appropriate when the schema does the heavy lifting.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the specific action ('Extract URLs from XML sitemaps') and resource ('XML sitemaps'), distinguishing it from siblings like 'extract_links' or 'crawl' by focusing specifically on sitemap parsing rather than general link extraction or crawling.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly provides when-to-use guidance ('Use when: discovering all site pages, planning crawl strategies, or checking sitemap validity'), when-not-to-use alternatives ('Try sitemap.xml or robots.txt first'), and practical context for application.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/omgwtfwow/mcp-crawl4ai-ts'

If you have feedback or need assistance with the MCP directory API, please join our Discord server