crawl_docs
Crawls enabled documentation sources, optionally forcing a full re-crawl to ignore previous crawl records.
Instructions
Start crawling enabled docs
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| force | No | Whether to force re-crawl all docs, ignoring previous crawl records |
Implementation Reference
- src/index.ts:169-322 (handler)The crawlAndSaveDocs function that executes the crawl logic for 'crawl_docs'. It iterates over enabled docs, uses Puppeteer to navigate to each doc's crawlerStart URL, extracts all links matching the crawlerPrefix, scrapes content from each page (converting to Markdown), and saves them as .md files.
async function crawlAndSaveDocs(force: boolean = false): Promise<void> { await fs.ensureDir(docDir); console.error('========== START CRAWLING =========='); for (const doc of docs) { if (!docConfig[doc.name]) { console.error(`Skipping doc ${doc.name} - not enabled`); continue; } // Skip if already crawled and not forcing re-crawl if (!force && await fs.pathExists(configPath)) { const config = await fs.readJson(configPath); if (config.crawledDocs && config.crawledDocs[doc.name]) { console.error(`Skipping doc ${doc.name} - already crawled at ${config.crawledDocs[doc.name]}`); continue; } } try { // Create doc directory - FIX: use the correct path from docDir parameter const docDirPath = path.join(docDir, doc.name); await fs.ensureDir(docDirPath); // Launch browser and open new page const browser = await puppeteer.launch({ // WSL-friendly options to avoid GPU issues args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu'], headless: true }); try { const page = await browser.newPage(); // Navigate to start page console.error(`Processing doc: ${doc.name}`); console.error(`Crawler start: ${doc.crawlerStart}, Crawler prefix: ${doc.crawlerPrefix}`); await page.goto(doc.crawlerStart, { waitUntil: 'networkidle2' }); // Extract all links const links = Array.from(new Set( await page.evaluate((prefix) => { const anchors = Array.from(document.querySelectorAll('a[href]')); return anchors .map(a => { const href = a.getAttribute('href'); if (!href) return null; try { const url = new URL(href, window.location.origin); return url.toString(); } catch (error) { console.error(`Failed to parse href ${href}:`, error); return null; } }) .filter(link => link && link.startsWith(prefix)); }, doc.crawlerPrefix) )); if (links.length > 0) { console.error(`Found ${links.length} valid links to process`); for (const link of links) { if (!link) continue; try { console.log(`Processing link: ${link}`); const newPage = await browser.newPage(); await newPage.goto(link, { waitUntil: 'networkidle2' }); // Extract content as Markdown const content = await newPage.evaluate(() => { // Get page title const title = document.title; // Find main content element const main = document.querySelector('main') || document.querySelector('article') || document.querySelector('.main-content') || document.body; // Convert content to Markdown let markdown = `# ${title}\n\n`; // Convert headings main.querySelectorAll('h1, h2, h3, h4, h5, h6').forEach(heading => { const level = parseInt(heading.tagName[1]); const text = heading.textContent?.trim(); if (text) { markdown += '#'.repeat(level) + ' ' + text + '\n\n'; } }); // Convert paragraphs main.querySelectorAll('p').forEach(p => { const text = p.textContent?.trim(); if (text) { markdown += text + '\n\n'; } }); // Convert code blocks main.querySelectorAll('pre').forEach(pre => { const text = pre.textContent?.trim(); if (text) { markdown += '```\n' + text + '\n```\n\n'; } }); // Convert lists main.querySelectorAll('ul, ol').forEach(list => { const isOrdered = list.tagName === 'OL'; list.querySelectorAll('li').forEach((li, index) => { const text = li.textContent?.trim(); if (text) { markdown += isOrdered ? `${index + 1}. ` : '- '; markdown += text + '\n'; } }); markdown += '\n'; }); return markdown.trim(); }); await newPage.close(); // Save Markdown file // Create safe file name from URL path const url = new URL(link); const pathParts = url.pathname.split('/').filter(part => part.length > 0); let fileName = pathParts.join('_'); // Add extension if not present if (!fileName.endsWith('.md')) { fileName += '.md'; } // FIX: Use docDirPath instead of docDir const filePath = path.join(docDirPath, fileName); await fs.writeFile(filePath, content); console.log(`Successfully saved ${filePath}`); await updateCrawledDoc(doc.name); } catch (error) { console.error(`Failed to process page ${link}:`, error); } } } else { console.error('No valid links found'); } } finally { await browser.close(); } } catch (error) { console.error(`Failed to process doc ${doc.name}:`, error); } } } - src/index.ts:435-447 (registration)Tool registration for 'crawl_docs' in the ListToolsRequestSchema handler. Defines name, description, and inputSchema (with optional 'force' boolean parameter).
{ name: "crawl_docs", description: "Start crawling enabled docs", inputSchema: { type: "object", properties: { force: { type: "boolean", description: "Whether to force re-crawl all docs, ignoring previous crawl records" } } } }, - src/index.ts:563-572 (handler)The switch case handler for 'crawl_docs' in the CallToolRequestSchema handler. Extracts the 'force' argument and calls crawlAndSaveDocs(force).
case "crawl_docs": { const force = Boolean(request.params.arguments?.force); await crawlAndSaveDocs(force); return { content: [{ type: "text", text: "Crawling completed" }] }; }