parse_sitemap
Extract URLs from XML sitemaps to discover all site pages, plan crawl strategies, or verify sitemap validity. Supports regex filtering for targeted URL extraction.
Instructions
[STATELESS] Extract URLs from XML sitemaps. Use when: discovering all site pages, planning crawl strategies, or checking sitemap validity. Supports regex filtering. Try sitemap.xml or robots.txt first. Creates new browser each time.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| filter_pattern | No | Optional regex pattern to filter URLs | |
| url | Yes | URL of the sitemap (e.g., https://example.com/sitemap.xml) |
Implementation Reference
- src/handlers/crawl-handlers.ts:295-329 (handler)Core handler function that implements the parse_sitemap tool logic: fetches sitemap XML, parses URLs using regex on <loc> tags, applies optional regex filter, limits output to first 100 URLs, returns formatted text summary.async parseSitemap(options: { url: string; filter_pattern?: string }) { try { // Fetch the sitemap directly (not through Crawl4AI server) const axios = (await import('axios')).default; const response = await axios.get(options.url, { timeout: 30000, headers: { 'User-Agent': 'Mozilla/5.0 (compatible; MCP-Crawl4AI/1.0)', }, }); const sitemapContent = response.data; // Parse XML content - simple regex approach for basic sitemaps const urlMatches = sitemapContent.match(/<loc>(.*?)<\/loc>/g) || []; const urls = urlMatches.map((match: string) => match.replace(/<\/?loc>/g, '')); // Apply filter if provided let filteredUrls = urls; if (options.filter_pattern) { const filterRegex = new RegExp(options.filter_pattern); filteredUrls = urls.filter((url: string) => filterRegex.test(url)); } return { content: [ { type: 'text', text: `Sitemap parsed successfully:\n\nTotal URLs found: ${urls.length}\nFiltered URLs: ${filteredUrls.length}\n\nURLs:\n${filteredUrls.slice(0, 100).join('\n')}${filteredUrls.length > 100 ? '\n... and ' + (filteredUrls.length - 100) + ' more' : ''}`, }, ], }; } catch (error) { throw this.formatError(error, 'parse sitemap'); } }
- Zod schema defining input validation for parse_sitemap tool: requires valid URL, optional filter_pattern string.export const ParseSitemapSchema = createStatelessSchema( z.object({ url: z.string().url(), filter_pattern: z.string().optional(), }), 'parse_sitemap', );
- src/server.ts:875-878 (registration)Server switch statement registration: validates args with ParseSitemapSchema and delegates execution to CrawlHandlers.parseSitemap.case 'parse_sitemap': return await this.validateAndExecute('parse_sitemap', args, ParseSitemapSchema, async (validatedArgs) => this.crawlHandlers.parseSitemap(validatedArgs), );
- src/server.ts:344-362 (registration)Tool metadata registration in listTools response: defines name, description, and inputSchema for the parse_sitemap tool.{ name: 'parse_sitemap', description: '[STATELESS] Extract URLs from XML sitemaps. Use when: discovering all site pages, planning crawl strategies, or checking sitemap validity. Supports regex filtering. Try sitemap.xml or robots.txt first. Creates new browser each time.', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'URL of the sitemap (e.g., https://example.com/sitemap.xml)', }, filter_pattern: { type: 'string', description: 'Optional regex pattern to filter URLs', }, }, required: ['url'], }, },