fetch_website_nested
Crawl and fetch website content with nested URL structures, converting it into clean, structured markdown. Specify depth, page limits, URL patterns, and domain restrictions for precise content extraction.
Instructions
Fetch website content with nested URL crawling and convert to clean markdown
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| excludePatterns | No | Regex patterns for URLs to exclude | |
| includePatterns | No | Regex patterns for URLs to include (if specified, only matching URLs will be processed) | |
| maxDepth | No | Maximum depth to crawl (default: 2) | |
| maxPages | No | Maximum number of pages to fetch (default: 50) | |
| sameDomainOnly | No | Only crawl URLs from the same domain (default: true) | |
| timeout | No | Request timeout in milliseconds (default: 10000) | |
| url | Yes | The starting URL to fetch and crawl |
Implementation Reference
- src/server.ts:393-431 (handler)The tool execution handler for 'fetch_website_nested'. Extracts parameters from the request, validates URL, constructs FetchOptions, calls AdvancedWebScraper.scrapeWebsite, and returns markdown content.case "fetch_website_nested": { const { url, maxDepth = 2, maxPages = 50, sameDomainOnly = true, excludePatterns = [], includePatterns = [], timeout = 10000, } = args as any; if (!url) { throw new Error("URL is required"); } try { const options: FetchOptions = { maxDepth, maxPages, sameDomainOnly, excludePatterns, includePatterns, timeout, }; const markdown = await scraper.scrapeWebsite(url, options); return { content: [ { type: "text", text: markdown, }, ], }; } catch (error) { throw new Error(`Failed to fetch website: ${error}`); } }
- src/server.ts:301-343 (schema)Tool definition including name, description, and input schema for 'fetch_website_nested'.{ name: "fetch_website_nested", description: "Fetch website content with nested URL crawling and convert to clean markdown", inputSchema: { type: "object", properties: { url: { type: "string", description: "The starting URL to fetch and crawl", }, maxDepth: { type: "number", description: "Maximum depth to crawl (default: 2)", default: 2, }, maxPages: { type: "number", description: "Maximum number of pages to fetch (default: 50)", default: 50, }, sameDomainOnly: { type: "boolean", description: "Only crawl URLs from the same domain (default: true)", default: true, }, excludePatterns: { type: "array", items: { type: "string" }, description: "Regex patterns for URLs to exclude", }, includePatterns: { type: "array", items: { type: "string" }, description: "Regex patterns for URLs to include (if specified, only matching URLs will be processed)", }, timeout: { type: "number", description: "Request timeout in milliseconds (default: 10000)", default: 10000, }, }, required: ["url"], },
- src/server.ts:382-386 (registration)Registers the listTools handler that returns the TOOLS array containing 'fetch_website_nested'.server.setRequestHandler(ListToolsRequestSchema, async () => { return { tools: TOOLS, }; });
- src/server.ts:219-260 (helper)Core helper method in AdvancedWebScraper that implements the nested crawling logic: processes queue of URLs up to maxDepth and maxPages, fetches content, extracts links, and formats as markdown.async scrapeWebsite(startUrl: string, options: FetchOptions = {}): Promise<string> { const { maxDepth = 2, maxPages = 50, sameDomainOnly = true, timeout = 10000 } = options; this.baseUrl = startUrl; this.visitedUrls.clear(); const allContent: PageContent[] = []; const urlsToProcess: Array<{ url: string; depth: number }> = [{ url: startUrl, depth: 0 }]; while (urlsToProcess.length > 0 && allContent.length < maxPages) { const { url, depth } = urlsToProcess.shift()!; if (depth > maxDepth || this.visitedUrls.has(url)) { continue; } const pageContent = await this.fetchPageContent(url, depth, options); if (pageContent) { allContent.push(pageContent); // Add child URLs for processing if (depth < maxDepth) { for (const link of pageContent.links) { if (!this.visitedUrls.has(link)) { urlsToProcess.push({ url: link, depth: depth + 1 }); } } } } // Small delay to be respectful await new Promise(resolve => setTimeout(resolve, 500)); } return this.formatAsMarkdown(allContent, startUrl); }
- src/server.ts:379-379 (helper)Instantiates the AdvancedWebScraper class used by the tool handler.const scraper = new AdvancedWebScraper();