parse_sitemap

parse_sitemap

Extract URLs from XML sitemaps to discover all site pages, plan crawl strategies, or verify sitemap validity. Supports regex filtering for targeted URL extraction.

Instructions

[STATELESS] Extract URLs from XML sitemaps. Use when: discovering all site pages, planning crawl strategies, or checking sitemap validity. Supports regex filtering. Try sitemap.xml or robots.txt first. Creates new browser each time.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`filter_pattern`	No	Optional regex pattern to filter URLs
`url`	Yes	URL of the sitemap (e.g., https://example.com/sitemap.xml)

Implementation Reference

src/handlers/crawl-handlers.ts:295-329 (handler)
Core handler function that implements the parse_sitemap tool logic: fetches sitemap XML, parses URLs using regex on <loc> tags, applies optional regex filter, limits output to first 100 URLs, returns formatted text summary.
async parseSitemap(options: { url: string; filter_pattern?: string }) { try { // Fetch the sitemap directly (not through Crawl4AI server) const axios = (await import('axios')).default; const response = await axios.get(options.url, { timeout: 30000, headers: { 'User-Agent': 'Mozilla/5.0 (compatible; MCP-Crawl4AI/1.0)', }, }); const sitemapContent = response.data; // Parse XML content - simple regex approach for basic sitemaps const urlMatches = sitemapContent.match(/<loc>(.*?)<\/loc>/g) || []; const urls = urlMatches.map((match: string) => match.replace(/<\/?loc>/g, '')); // Apply filter if provided let filteredUrls = urls; if (options.filter_pattern) { const filterRegex = new RegExp(options.filter_pattern); filteredUrls = urls.filter((url: string) => filterRegex.test(url)); } return { content: [ { type: 'text', text: `Sitemap parsed successfully:\n\nTotal URLs found: ${urls.length}\nFiltered URLs: ${filteredUrls.length}\n\nURLs:\n${filteredUrls.slice(0, 100).join('\n')}${filteredUrls.length > 100 ? '\n... and ' + (filteredUrls.length - 100) + ' more' : ''}`, }, ], }; } catch (error) { throw this.formatError(error, 'parse sitemap'); } }
src/schemas/validation-schemas.ts:141-147 (schema)
Zod schema defining input validation for parse_sitemap tool: requires valid URL, optional filter_pattern string.
export const ParseSitemapSchema = createStatelessSchema( z.object({ url: z.string().url(), filter_pattern: z.string().optional(), }), 'parse_sitemap', );
src/server.ts:875-878 (registration)
Server switch statement registration: validates args with ParseSitemapSchema and delegates execution to CrawlHandlers.parseSitemap.
case 'parse_sitemap': return await this.validateAndExecute('parse_sitemap', args, ParseSitemapSchema, async (validatedArgs) => this.crawlHandlers.parseSitemap(validatedArgs), );
src/server.ts:344-362 (registration)
Tool metadata registration in listTools response: defines name, description, and inputSchema for the parse_sitemap tool.
{ name: 'parse_sitemap', description: '[STATELESS] Extract URLs from XML sitemaps. Use when: discovering all site pages, planning crawl strategies, or checking sitemap validity. Supports regex filtering. Try sitemap.xml or robots.txt first. Creates new browser each time.', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'URL of the sitemap (e.g., https://example.com/sitemap.xml)', }, filter_pattern: { type: 'string', description: 'Optional regex pattern to filter URLs', }, }, required: ['url'], }, },

MCP Server for Crawl4AI

Instructions

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API