Skip to main content
Glama

crawl_docs

Crawl and index documentation to enable search functionality within the MCP server. Use the force parameter to re-crawl all documents when updates occur.

Instructions

Start crawling enabled docs

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
forceNoWhether to force re-crawl all docs, ignoring previous crawl records

Implementation Reference

  • The handler logic for the 'crawl_docs' tool call. It parses the optional 'force' boolean from arguments and invokes the core crawlAndSaveDocs function, returning a success message.
    case "crawl_docs": {
      const force = Boolean(request.params.arguments?.force);
      await crawlAndSaveDocs(force);
      return {
        content: [{
          type: "text",
          text: "Crawling completed"
        }]
      };
    }
  • src/index.ts:435-447 (registration)
    Registration of the 'crawl_docs' tool in the ListToolsRequestSchema response. Includes name, description, and input schema.
    {
      name: "crawl_docs",
      description: "Start crawling enabled docs",
      inputSchema: {
        type: "object",
        properties: {
          force: {
            type: "boolean",
            description: "Whether to force re-crawl all docs, ignoring previous crawl records"
          }
        }
      }
    },
  • Input schema definition for 'crawl_docs' tool.
    inputSchema: {
      type: "object",
      properties: {
        force: {
          type: "boolean",
          description: "Whether to force re-crawl all docs, ignoring previous crawl records"
        }
      }
    }
  • Core helper function that executes the document crawling. For each enabled doc, launches Puppeteer, finds relevant links from start page, scrapes each page to Markdown, saves files locally, and tracks crawl status. Respects 'force' flag and skips already crawled.
    async function crawlAndSaveDocs(force: boolean = false): Promise<void> {
      await fs.ensureDir(docDir);
      console.error('========== START CRAWLING ==========');
      for (const doc of docs) {
        if (!docConfig[doc.name]) {
          console.error(`Skipping doc ${doc.name} - not enabled`);
          continue;
        }
    
        // Skip if already crawled and not forcing re-crawl
        if (!force && await fs.pathExists(configPath)) {
          const config = await fs.readJson(configPath);
          if (config.crawledDocs && config.crawledDocs[doc.name]) {
            console.error(`Skipping doc ${doc.name} - already crawled at ${config.crawledDocs[doc.name]}`);
            continue;
          }
        }
    
        try {
          // Create doc directory - FIX: use the correct path from docDir parameter
          const docDirPath = path.join(docDir, doc.name);
          await fs.ensureDir(docDirPath);
    
          // Launch browser and open new page
          const browser = await puppeteer.launch({
            // WSL-friendly options to avoid GPU issues
            args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu'],
            headless: true
          });
          
          try {
            const page = await browser.newPage();
            
            // Navigate to start page
            console.error(`Processing doc: ${doc.name}`);
            console.error(`Crawler start: ${doc.crawlerStart}, Crawler prefix: ${doc.crawlerPrefix}`);
            await page.goto(doc.crawlerStart, { waitUntil: 'networkidle2' });
    
            // Extract all links
            const links = Array.from(new Set(
              await page.evaluate((prefix) => {
                const anchors = Array.from(document.querySelectorAll('a[href]'));
                return anchors
                  .map(a => {
                    const href = a.getAttribute('href');
                    if (!href) return null;
                    try {
                      const url = new URL(href, window.location.origin);
                      return url.toString();
                    } catch (error) {
                      console.error(`Failed to parse href ${href}:`, error);
                      return null;
                    }
                  })
                  .filter(link => link && link.startsWith(prefix));
              }, doc.crawlerPrefix)
            ));
    
            if (links.length > 0) {
              console.error(`Found ${links.length} valid links to process`);
              
              for (const link of links) {
                if (!link) continue;
                
                try {
                  console.log(`Processing link: ${link}`);
                  const newPage = await browser.newPage();
                  await newPage.goto(link, { waitUntil: 'networkidle2' });
                  // Extract content as Markdown
                  const content = await newPage.evaluate(() => {
                    // Get page title
                    const title = document.title;
                    
                    // Find main content element
                    const main = document.querySelector('main') ||
                               document.querySelector('article') ||
                               document.querySelector('.main-content') ||
                               document.body;
    
                    // Convert content to Markdown
                    let markdown = `# ${title}\n\n`;
                    
                    // Convert headings
                    main.querySelectorAll('h1, h2, h3, h4, h5, h6').forEach(heading => {
                      const level = parseInt(heading.tagName[1]);
                      const text = heading.textContent?.trim();
                      if (text) {
                        markdown += '#'.repeat(level) + ' ' + text + '\n\n';
                      }
                    });
    
                    // Convert paragraphs
                    main.querySelectorAll('p').forEach(p => {
                      const text = p.textContent?.trim();
                      if (text) {
                        markdown += text + '\n\n';
                      }
                    });
    
                    // Convert code blocks
                    main.querySelectorAll('pre').forEach(pre => {
                      const text = pre.textContent?.trim();
                      if (text) {
                        markdown += '```\n' + text + '\n```\n\n';
                      }
                    });
    
                    // Convert lists
                    main.querySelectorAll('ul, ol').forEach(list => {
                      const isOrdered = list.tagName === 'OL';
                      list.querySelectorAll('li').forEach((li, index) => {
                        const text = li.textContent?.trim();
                        if (text) {
                          markdown += isOrdered ? `${index + 1}. ` : '- ';
                          markdown += text + '\n';
                        }
                      });
                      markdown += '\n';
                    });
    
                    return markdown.trim();
                  });
                  await newPage.close();
                  
                  // Save Markdown file
                  // Create safe file name from URL path
                  const url = new URL(link);
                  const pathParts = url.pathname.split('/').filter(part => part.length > 0);
                  let fileName = pathParts.join('_');
                  
                  // Add extension if not present
                  if (!fileName.endsWith('.md')) {
                    fileName += '.md';
                  }
                  // FIX: Use docDirPath instead of docDir
                  const filePath = path.join(docDirPath, fileName);
                  await fs.writeFile(filePath, content);
                  console.log(`Successfully saved ${filePath}`);
                  await updateCrawledDoc(doc.name);
                } catch (error) {
                  console.error(`Failed to process page ${link}:`, error);
                }
              }
            } else {
              console.error('No valid links found');
            }
          } finally {
            await browser.close();
          }
        } catch (error) {
          console.error(`Failed to process doc ${doc.name}:`, error);
        }
      }
    }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/askme765cs/open-docs-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server