crawl_docs
Crawl and index documentation to enable search functionality within the MCP server. Use the force parameter to re-crawl all documents when updates occur.
Instructions
Start crawling enabled docs
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| force | No | Whether to force re-crawl all docs, ignoring previous crawl records |
Implementation Reference
- src/index.ts:563-572 (handler)The handler logic for the 'crawl_docs' tool call. It parses the optional 'force' boolean from arguments and invokes the core crawlAndSaveDocs function, returning a success message.case "crawl_docs": { const force = Boolean(request.params.arguments?.force); await crawlAndSaveDocs(force); return { content: [{ type: "text", text: "Crawling completed" }] }; }
- src/index.ts:435-447 (registration)Registration of the 'crawl_docs' tool in the ListToolsRequestSchema response. Includes name, description, and input schema.{ name: "crawl_docs", description: "Start crawling enabled docs", inputSchema: { type: "object", properties: { force: { type: "boolean", description: "Whether to force re-crawl all docs, ignoring previous crawl records" } } } },
- src/index.ts:438-446 (schema)Input schema definition for 'crawl_docs' tool.inputSchema: { type: "object", properties: { force: { type: "boolean", description: "Whether to force re-crawl all docs, ignoring previous crawl records" } } }
- src/index.ts:169-322 (helper)Core helper function that executes the document crawling. For each enabled doc, launches Puppeteer, finds relevant links from start page, scrapes each page to Markdown, saves files locally, and tracks crawl status. Respects 'force' flag and skips already crawled.async function crawlAndSaveDocs(force: boolean = false): Promise<void> { await fs.ensureDir(docDir); console.error('========== START CRAWLING =========='); for (const doc of docs) { if (!docConfig[doc.name]) { console.error(`Skipping doc ${doc.name} - not enabled`); continue; } // Skip if already crawled and not forcing re-crawl if (!force && await fs.pathExists(configPath)) { const config = await fs.readJson(configPath); if (config.crawledDocs && config.crawledDocs[doc.name]) { console.error(`Skipping doc ${doc.name} - already crawled at ${config.crawledDocs[doc.name]}`); continue; } } try { // Create doc directory - FIX: use the correct path from docDir parameter const docDirPath = path.join(docDir, doc.name); await fs.ensureDir(docDirPath); // Launch browser and open new page const browser = await puppeteer.launch({ // WSL-friendly options to avoid GPU issues args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu'], headless: true }); try { const page = await browser.newPage(); // Navigate to start page console.error(`Processing doc: ${doc.name}`); console.error(`Crawler start: ${doc.crawlerStart}, Crawler prefix: ${doc.crawlerPrefix}`); await page.goto(doc.crawlerStart, { waitUntil: 'networkidle2' }); // Extract all links const links = Array.from(new Set( await page.evaluate((prefix) => { const anchors = Array.from(document.querySelectorAll('a[href]')); return anchors .map(a => { const href = a.getAttribute('href'); if (!href) return null; try { const url = new URL(href, window.location.origin); return url.toString(); } catch (error) { console.error(`Failed to parse href ${href}:`, error); return null; } }) .filter(link => link && link.startsWith(prefix)); }, doc.crawlerPrefix) )); if (links.length > 0) { console.error(`Found ${links.length} valid links to process`); for (const link of links) { if (!link) continue; try { console.log(`Processing link: ${link}`); const newPage = await browser.newPage(); await newPage.goto(link, { waitUntil: 'networkidle2' }); // Extract content as Markdown const content = await newPage.evaluate(() => { // Get page title const title = document.title; // Find main content element const main = document.querySelector('main') || document.querySelector('article') || document.querySelector('.main-content') || document.body; // Convert content to Markdown let markdown = `# ${title}\n\n`; // Convert headings main.querySelectorAll('h1, h2, h3, h4, h5, h6').forEach(heading => { const level = parseInt(heading.tagName[1]); const text = heading.textContent?.trim(); if (text) { markdown += '#'.repeat(level) + ' ' + text + '\n\n'; } }); // Convert paragraphs main.querySelectorAll('p').forEach(p => { const text = p.textContent?.trim(); if (text) { markdown += text + '\n\n'; } }); // Convert code blocks main.querySelectorAll('pre').forEach(pre => { const text = pre.textContent?.trim(); if (text) { markdown += '```\n' + text + '\n```\n\n'; } }); // Convert lists main.querySelectorAll('ul, ol').forEach(list => { const isOrdered = list.tagName === 'OL'; list.querySelectorAll('li').forEach((li, index) => { const text = li.textContent?.trim(); if (text) { markdown += isOrdered ? `${index + 1}. ` : '- '; markdown += text + '\n'; } }); markdown += '\n'; }); return markdown.trim(); }); await newPage.close(); // Save Markdown file // Create safe file name from URL path const url = new URL(link); const pathParts = url.pathname.split('/').filter(part => part.length > 0); let fileName = pathParts.join('_'); // Add extension if not present if (!fileName.endsWith('.md')) { fileName += '.md'; } // FIX: Use docDirPath instead of docDir const filePath = path.join(docDirPath, fileName); await fs.writeFile(filePath, content); console.log(`Successfully saved ${filePath}`); await updateCrawledDoc(doc.name); } catch (error) { console.error(`Failed to process page ${link}:`, error); } } } else { console.error('No valid links found'); } } finally { await browser.close(); } } catch (error) { console.error(`Failed to process doc ${doc.name}:`, error); } } }