crawl_docs

Crawls enabled documentation sources, optionally forcing a full re-crawl to ignore previous crawl records.

Instructions

Start crawling enabled docs

Input Schema

TableJSON Schema

Name	Required	Description	Default
`force`	No	Whether to force re-crawl all docs, ignoring previous crawl records

Implementation Reference

src/index.ts:169-322 (handler)

The crawlAndSaveDocs function that executes the crawl logic for 'crawl_docs'. It iterates over enabled docs, uses Puppeteer to navigate to each doc's crawlerStart URL, extracts all links matching the crawlerPrefix, scrapes content from each page (converting to Markdown), and saves them as .md files.

async function crawlAndSaveDocs(force: boolean = false): Promise<void> {
  await fs.ensureDir(docDir);
  console.error('========== START CRAWLING ==========');
  for (const doc of docs) {
    if (!docConfig[doc.name]) {
      console.error(`Skipping doc ${doc.name} - not enabled`);
      continue;
    }

    // Skip if already crawled and not forcing re-crawl
    if (!force && await fs.pathExists(configPath)) {
      const config = await fs.readJson(configPath);
      if (config.crawledDocs && config.crawledDocs[doc.name]) {
        console.error(`Skipping doc ${doc.name} - already crawled at ${config.crawledDocs[doc.name]}`);
        continue;
      }
    }

    try {
      // Create doc directory - FIX: use the correct path from docDir parameter
      const docDirPath = path.join(docDir, doc.name);
      await fs.ensureDir(docDirPath);

      // Launch browser and open new page
      const browser = await puppeteer.launch({
        // WSL-friendly options to avoid GPU issues
        args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu'],
        headless: true
      });
      
      try {
        const page = await browser.newPage();
        
        // Navigate to start page
        console.error(`Processing doc: ${doc.name}`);
        console.error(`Crawler start: ${doc.crawlerStart}, Crawler prefix: ${doc.crawlerPrefix}`);
        await page.goto(doc.crawlerStart, { waitUntil: 'networkidle2' });

        // Extract all links
        const links = Array.from(new Set(
          await page.evaluate((prefix) => {
            const anchors = Array.from(document.querySelectorAll('a[href]'));
            return anchors
              .map(a => {
                const href = a.getAttribute('href');
                if (!href) return null;
                try {
                  const url = new URL(href, window.location.origin);
                  return url.toString();
                } catch (error) {
                  console.error(`Failed to parse href ${href}:`, error);
                  return null;
                }
              })
              .filter(link => link && link.startsWith(prefix));
          }, doc.crawlerPrefix)
        ));

        if (links.length > 0) {
          console.error(`Found ${links.length} valid links to process`);
          
          for (const link of links) {
            if (!link) continue;
            
            try {
              console.log(`Processing link: ${link}`);
              const newPage = await browser.newPage();
              await newPage.goto(link, { waitUntil: 'networkidle2' });
              // Extract content as Markdown
              const content = await newPage.evaluate(() => {
                // Get page title
                const title = document.title;
                
                // Find main content element
                const main = document.querySelector('main') ||
                           document.querySelector('article') ||
                           document.querySelector('.main-content') ||
                           document.body;

                // Convert content to Markdown
                let markdown = `# ${title}\n\n`;
                
                // Convert headings
                main.querySelectorAll('h1, h2, h3, h4, h5, h6').forEach(heading => {
                  const level = parseInt(heading.tagName[1]);
                  const text = heading.textContent?.trim();
                  if (text) {
                    markdown += '#'.repeat(level) + ' ' + text + '\n\n';
                  }
                });

                // Convert paragraphs
                main.querySelectorAll('p').forEach(p => {
                  const text = p.textContent?.trim();
                  if (text) {
                    markdown += text + '\n\n';
                  }
                });

                // Convert code blocks
                main.querySelectorAll('pre').forEach(pre => {
                  const text = pre.textContent?.trim();
                  if (text) {
                    markdown += '```\n' + text + '\n```\n\n';
                  }
                });

                // Convert lists
                main.querySelectorAll('ul, ol').forEach(list => {
                  const isOrdered = list.tagName === 'OL';
                  list.querySelectorAll('li').forEach((li, index) => {
                    const text = li.textContent?.trim();
                    if (text) {
                      markdown += isOrdered ? `${index + 1}. ` : '- ';
                      markdown += text + '\n';
                    }
                  });
                  markdown += '\n';
                });

                return markdown.trim();
              });
              await newPage.close();
              
              // Save Markdown file
              // Create safe file name from URL path
              const url = new URL(link);
              const pathParts = url.pathname.split('/').filter(part => part.length > 0);
              let fileName = pathParts.join('_');
              
              // Add extension if not present
              if (!fileName.endsWith('.md')) {
                fileName += '.md';
              }
              // FIX: Use docDirPath instead of docDir
              const filePath = path.join(docDirPath, fileName);
              await fs.writeFile(filePath, content);
              console.log(`Successfully saved ${filePath}`);
              await updateCrawledDoc(doc.name);
            } catch (error) {
              console.error(`Failed to process page ${link}:`, error);
            }
          }
        } else {
          console.error('No valid links found');
        }
      } finally {
        await browser.close();
      }
    } catch (error) {
      console.error(`Failed to process doc ${doc.name}:`, error);
    }
  }
}

src/index.ts:435-447 (registration)

Tool registration for 'crawl_docs' in the ListToolsRequestSchema handler. Defines name, description, and inputSchema (with optional 'force' boolean parameter).

{
  name: "crawl_docs",
  description: "Start crawling enabled docs",
  inputSchema: {
    type: "object",
    properties: {
      force: {
        type: "boolean",
        description: "Whether to force re-crawl all docs, ignoring previous crawl records"
      }
    }
  }
},

src/index.ts:563-572 (handler)

The switch case handler for 'crawl_docs' in the CallToolRequestSchema handler. Extracts the 'force' argument and calls crawlAndSaveDocs(force).

case "crawl_docs": {
  const force = Boolean(request.params.arguments?.force);
  await crawlAndSaveDocs(force);
  return {
    content: [{
      type: "text",
      text: "Crawling completed"
    }]
  };
}

open-docs-mcp

crawl_docs

Instructions

Input Schema

Implementation Reference

Tool Definition Quality

Other Tools

Latest Blog Posts

MCP directory API