Skip to main content
Glama
omgwtfwow

MCP Server for Crawl4AI

by omgwtfwow

parse_sitemap

Extract URLs from XML sitemaps to discover all site pages, plan crawl strategies, or check sitemap validity with optional regex filtering.

Instructions

[STATELESS] Extract URLs from XML sitemaps. Use when: discovering all site pages, planning crawl strategies, or checking sitemap validity. Supports regex filtering. Try sitemap.xml or robots.txt first. Creates new browser each time.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesURL of the sitemap (e.g., https://example.com/sitemap.xml)
filter_patternNoOptional regex pattern to filter URLs

Implementation Reference

  • Main handler function that implements the parse_sitemap tool. Fetches sitemap XML directly using axios, extracts URLs from <loc> tags using regex, applies optional regex filter, limits output to first 100 URLs, and returns formatted text response.
    async parseSitemap(options: { url: string; filter_pattern?: string }) { try { // Fetch the sitemap directly (not through Crawl4AI server) const axios = (await import('axios')).default; const response = await axios.get(options.url, { timeout: 30000, headers: { 'User-Agent': 'Mozilla/5.0 (compatible; MCP-Crawl4AI/1.0)', }, }); const sitemapContent = response.data; // Parse XML content - simple regex approach for basic sitemaps const urlMatches = sitemapContent.match(/<loc>(.*?)<\/loc>/g) || []; const urls = urlMatches.map((match: string) => match.replace(/<\/?loc>/g, '')); // Apply filter if provided let filteredUrls = urls; if (options.filter_pattern) { const filterRegex = new RegExp(options.filter_pattern); filteredUrls = urls.filter((url: string) => filterRegex.test(url)); } return { content: [ { type: 'text', text: `Sitemap parsed successfully:\n\nTotal URLs found: ${urls.length}\nFiltered URLs: ${filteredUrls.length}\n\nURLs:\n${filteredUrls.slice(0, 100).join('\n')}${filteredUrls.length > 100 ? '\n... and ' + (filteredUrls.length - 100) + ' more' : ''}`, }, ], }; } catch (error) { throw this.formatError(error, 'parse sitemap'); } }
  • Zod schema defining input validation for parse_sitemap tool: requires a valid URL string, optional filter_pattern regex string.
    export const ParseSitemapSchema = createStatelessSchema( z.object({ url: z.string().url(), filter_pattern: z.string().optional(), }), 'parse_sitemap', );
  • src/server.ts:875-878 (registration)
    Tool call handler registration in the switch statement: validates args with ParseSitemapSchema and delegates to crawlHandlers.parseSitemap.
    case 'parse_sitemap': return await this.validateAndExecute('parse_sitemap', args, ParseSitemapSchema, async (validatedArgs) => this.crawlHandlers.parseSitemap(validatedArgs), );
  • src/server.ts:344-362 (registration)
    Tool metadata registration in listTools response: defines name, description, and inputSchema for parse_sitemap.
    { name: 'parse_sitemap', description: '[STATELESS] Extract URLs from XML sitemaps. Use when: discovering all site pages, planning crawl strategies, or checking sitemap validity. Supports regex filtering. Try sitemap.xml or robots.txt first. Creates new browser each time.', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'URL of the sitemap (e.g., https://example.com/sitemap.xml)', }, filter_pattern: { type: 'string', description: 'Optional regex pattern to filter URLs', }, }, required: ['url'], }, },

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/omgwtfwow/mcp-crawl4ai-ts'

If you have feedback or need assistance with the MCP directory API, please join our Discord server