Skip to main content
Glama

crawl_docs

Crawl enabled documents to update and manage content efficiently. Use the force parameter to re-crawl all docs, bypassing previous crawl records.

Instructions

Start crawling enabled docs

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
forceNoWhether to force re-crawl all docs, ignoring previous crawl records

Implementation Reference

  • src/index.ts:435-447 (registration)
    Registration of the 'crawl_docs' tool within the ListToolsRequestSchema handler, including name, description, and input schema definition.
    { name: "crawl_docs", description: "Start crawling enabled docs", inputSchema: { type: "object", properties: { force: { type: "boolean", description: "Whether to force re-crawl all docs, ignoring previous crawl records" } } } },
  • Execution handler for the 'crawl_docs' tool in the CallToolRequestSchema switch statement. Parses the 'force' argument and delegates to the crawlAndSaveDocs function.
    case "crawl_docs": { const force = Boolean(request.params.arguments?.force); await crawlAndSaveDocs(force); return { content: [{ type: "text", text: "Crawling completed" }] }; }
  • Core implementation of the 'crawl_docs' tool logic. Crawls enabled documentation sites using Puppeteer: launches headless browser, navigates to crawler start URL, extracts links matching crawler prefix, scrapes content from each page converting to Markdown format, and saves files locally in doc-specific directories.
    async function crawlAndSaveDocs(force: boolean = false): Promise<void> { await fs.ensureDir(docDir); console.error('========== START CRAWLING =========='); for (const doc of docs) { if (!docConfig[doc.name]) { console.error(`Skipping doc ${doc.name} - not enabled`); continue; } // Skip if already crawled and not forcing re-crawl if (!force && await fs.pathExists(configPath)) { const config = await fs.readJson(configPath); if (config.crawledDocs && config.crawledDocs[doc.name]) { console.error(`Skipping doc ${doc.name} - already crawled at ${config.crawledDocs[doc.name]}`); continue; } } try { // Create doc directory - FIX: use the correct path from docDir parameter const docDirPath = path.join(docDir, doc.name); await fs.ensureDir(docDirPath); // Launch browser and open new page const browser = await puppeteer.launch({ // WSL-friendly options to avoid GPU issues args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu'], headless: true }); try { const page = await browser.newPage(); // Navigate to start page console.error(`Processing doc: ${doc.name}`); console.error(`Crawler start: ${doc.crawlerStart}, Crawler prefix: ${doc.crawlerPrefix}`); await page.goto(doc.crawlerStart, { waitUntil: 'networkidle2' }); // Extract all links const links = Array.from(new Set( await page.evaluate((prefix) => { const anchors = Array.from(document.querySelectorAll('a[href]')); return anchors .map(a => { const href = a.getAttribute('href'); if (!href) return null; try { const url = new URL(href, window.location.origin); return url.toString(); } catch (error) { console.error(`Failed to parse href ${href}:`, error); return null; } }) .filter(link => link && link.startsWith(prefix)); }, doc.crawlerPrefix) )); if (links.length > 0) { console.error(`Found ${links.length} valid links to process`); for (const link of links) { if (!link) continue; try { console.log(`Processing link: ${link}`); const newPage = await browser.newPage(); await newPage.goto(link, { waitUntil: 'networkidle2' }); // Extract content as Markdown const content = await newPage.evaluate(() => { // Get page title const title = document.title; // Find main content element const main = document.querySelector('main') || document.querySelector('article') || document.querySelector('.main-content') || document.body; // Convert content to Markdown let markdown = `# ${title}\n\n`; // Convert headings main.querySelectorAll('h1, h2, h3, h4, h5, h6').forEach(heading => { const level = parseInt(heading.tagName[1]); const text = heading.textContent?.trim(); if (text) { markdown += '#'.repeat(level) + ' ' + text + '\n\n'; } }); // Convert paragraphs main.querySelectorAll('p').forEach(p => { const text = p.textContent?.trim(); if (text) { markdown += text + '\n\n'; } }); // Convert code blocks main.querySelectorAll('pre').forEach(pre => { const text = pre.textContent?.trim(); if (text) { markdown += '```\n' + text + '\n```\n\n'; } }); // Convert lists main.querySelectorAll('ul, ol').forEach(list => { const isOrdered = list.tagName === 'OL'; list.querySelectorAll('li').forEach((li, index) => { const text = li.textContent?.trim(); if (text) { markdown += isOrdered ? `${index + 1}. ` : '- '; markdown += text + '\n'; } }); markdown += '\n'; }); return markdown.trim(); }); await newPage.close(); // Save Markdown file // Create safe file name from URL path const url = new URL(link); const pathParts = url.pathname.split('/').filter(part => part.length > 0); let fileName = pathParts.join('_'); // Add extension if not present if (!fileName.endsWith('.md')) { fileName += '.md'; } // FIX: Use docDirPath instead of docDir const filePath = path.join(docDirPath, fileName); await fs.writeFile(filePath, content); console.log(`Successfully saved ${filePath}`); await updateCrawledDoc(doc.name); } catch (error) { console.error(`Failed to process page ${link}:`, error); } } } else { console.error('No valid links found'); } } finally { await browser.close(); } } catch (error) { console.error(`Failed to process doc ${doc.name}:`, error); } } }

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/askme765cs/open-docs-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server