scrape_single_page
Scrape a Salesforce documentation page by URL and return its content as markdown.
Instructions
Scrape a single Salesforce documentation page. Returns markdown. If you do not know the exact URL, you should first use a Web Search tool (like Brave or DuckDuckGo) to search for 'site:developer.salesforce.com/docs [topic]' or 'site:help.salesforce.com [topic]', then pass the retrieved URL here.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | ||
| category | No |
Implementation Reference
- src/index.ts:118-133 (handler)The CallToolRequestSchema handler that executes the 'scrape_single_page' tool. It parses the URL and category from args via ScrapePageSchema, calls scrapePage() from scraper.ts, saves the result to the database via saveDocument(), and returns the markdown content.
if (name === "scrape_single_page") { const { url, category } = ScrapePageSchema.parse(args); console.error(`Scraping ${url}...`); const result = await scrapePage(url); if (result.error) { return { content: [{ type: "text", text: `Failed to scrape: ${result.error}` }] }; } // Save automatically to local DB await saveDocument(url, result.title, result.markdown, result.hash, category); return { content: [{ type: "text", text: `# ${result.title}\n\n${result.markdown}` }] }; } - src/index.ts:18-21 (schema)ScrapePageSchema: Zod schema defining the input for 'scrape_single_page' — requires a valid 'url' string and optional 'category' string defaulting to 'general'.
const ScrapePageSchema = z.object({ url: z.string().url(), category: z.string().optional().default("general") }); - src/index.ts:48-59 (registration)Tool registration entry in ListToolsRequestSchema handler. Declares the tool name 'scrape_single_page', its description, and JSON Schema input (url required, category optional).
{ name: "scrape_single_page", description: "Scrape a single Salesforce documentation page. Returns markdown. If you do not know the exact URL, you should first use a Web Search tool (like Brave or DuckDuckGo) to search for 'site:developer.salesforce.com/docs [topic]' or 'site:help.salesforce.com [topic]', then pass the retrieved URL here.", inputSchema: { type: "object", properties: { url: { type: "string" }, category: { type: "string" } }, required: ["url"] } }, - src/scraper.ts:164-460 (helper)The scrapePage() function in scraper.ts, which is the core scraping logic called by the handler. Uses puppeteer to render pages, extracts content via multiple DOM strategies (iframes, shadow DOM, etc.), converts HTML to markdown using TurndownService, and returns a ScrapedPage object with title, markdown, hash, and childLinks.
export async function scrapePage(url: string, baseDomain?: string): Promise<ScrapedPage> { // 1. Aura SPA Fast-Path directly hitting the backend Salesforce APIs const auraResult = await scrapeAuraArticle(url, baseDomain); if (auraResult) { return auraResult; } // 1.5 Native PDF Extraction if (url.toLowerCase().endsWith('.pdf')) { try { console.log(`[PDF Extraction] Downloading ${url}...`); const pdfResponse = await fetch(url); if (!pdfResponse.ok) { return { url, title: 'Error', markdown: '', hash: '', error: `PDF HTTP Error ${pdfResponse.status}: ${pdfResponse.statusText}`, childLinks: [] }; } const buffer = await pdfResponse.arrayBuffer(); const data = await pdf(Buffer.from(buffer)); // Generate a simple markdown representation const title = url.split('/').pop() || 'PDF Document'; const markdown = `# ${title}\n\n${data.text}`; const hash = crypto.createHash('sha256').update(markdown).digest('hex'); return { url, title, markdown, hash, childLinks: [] // PDFs don't typically yield crawlable HTML links natively this way }; } catch (e: any) { return { url, title: 'Error', markdown: '', hash: '', error: `PDF Parse Error: ${e.message}`, childLinks: [] }; } } // 2. Headless Chrome Fallback for everything else (LWC, Standard Web, etc.) const browserInstance = await getBrowser(); const page = await browserInstance.newPage(); try { await page.setViewport({ width: 1280, height: 800 }); // User agent to look normal await page.setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'); // Wait until network is idle specifically to handle SPA renders and iframe loads const response = await page.goto(url, { waitUntil: 'networkidle0', timeout: 60000 }); // BUG-04 check: If the page returns an HTTP error code natively, fail fast. if (response && !response.ok()) { return { url, title: 'Error', markdown: '', hash: '', error: `HTTP Error ${response.status()}: ${response.statusText()}`, childLinks: [] }; } // Wait for specific Salesforce content locators to appear to avoid grabbing 'Loading...' pages try { if (url.includes('help.salesforce.com')) { // Wait for the main body to render something meaningful (avoiding specific legacy classes) await page.waitForSelector('body', { timeout: 15000 }); } else if (url.includes('developer.salesforce.com')) { await page.waitForFunction(() => { return document.querySelector('doc-content-layout') || document.querySelector('doc-xml-content') || document.querySelector('iframe'); }, { timeout: 10000 }); } } catch (e) { console.warn(`Timeout waiting for specific content selectors on ${url}`); } // Additional wait just in case visual components are still sliding in await new Promise(r => setTimeout(r, 2000)); // Take an opportunistic screenshot if development debugging layout issues if (url.includes('help.salesforce.com')) { await page.screenshot({ path: 'help_debug_test.png' }).catch(() => { }); } // In-page extraction script const extraction = await page.evaluate(() => { // Flattened DOM Extraction Logic to bypass TS __name bugs let title = document.querySelector('title')?.innerText || 'Untitled'; let finalHtml = ''; const childLinks = new Set<string>(); // Collect all same-site hierarchical links using an iterative deep shadow DOM search const rootsToProcess = [document as Document | ShadowRoot | Element]; while (rootsToProcess.length > 0) { const currentRoot = rootsToProcess.pop()!; // Grab links const aTags = currentRoot.querySelectorAll('a'); aTags.forEach(a => { if (a.href && !a.href.startsWith('java') && !a.href.startsWith('mailto')) { childLinks.add(a.href); } }); const allElements = currentRoot.querySelectorAll('*'); for (let i = 0; i < allElements.length; i++) { const el = allElements[i]; if (el.shadowRoot) { rootsToProcess.push(el.shadowRoot); } } } // BUG-04: Catch soft 404s rendered by the SPA if (title.includes('404 Error')) { return { html: '', title: 'Error', error: 'HTTP 404 - Page Not Found', childLinks: Array.from(childLinks) }; } // Catch SPA shells that failed to load content BEFORE generic tag fallbacks const bodyHtml = document.body.innerText; if (bodyHtml.includes('Sorry to interrupt')) { return { html: '', title: 'Error', error: 'Found no accessible documentation content on this page. It may require authentication, be a soft 404, rendering timed out, or JavaScript rendering is required.', childLinks: [] }; } // --- STRATEGY 1: Iframe (Older Developer Guides) --- const iframe = document.querySelector('iframe'); if (iframe && iframe.contentDocument && iframe.contentDocument.body) { const docHtml = iframe.contentDocument.querySelector('#doc')?.innerHTML || iframe.contentDocument.querySelector('body')?.innerHTML || ''; const docTitle = iframe.contentDocument.querySelector('title')?.innerText || iframe.contentDocument.querySelector('h1')?.innerText || 'Untitled'; iframe.contentDocument.querySelectorAll('a').forEach(a => { if (a.href && !a.href.startsWith('java') && !a.href.startsWith('mailto')) { childLinks.add(a.href); } }); if (docHtml.length > 500) { return { html: docHtml, title: docTitle, childLinks: Array.from(childLinks) }; } } // --- STRATEGY 2: Help.salesforce.com Shadow DOM Search --- let sldsText: Element | null = null; const searchRoots = [document as Document | ShadowRoot | Element]; while (searchRoots.length > 0 && !sldsText) { const current = searchRoots.pop()!; const found = current.querySelector('.slds-text-longform'); if (found) { sldsText = found; break; } const all = current.querySelectorAll('*'); for (let i = 0; i < all.length; i++) { if (all[i].shadowRoot) searchRoots.push(all[i].shadowRoot!); } } if (sldsText) { title = title.replace(' | Salesforce', '').trim(); return { html: sldsText.innerHTML, title, childLinks: Array.from(childLinks) }; } // --- STRATEGY 3: legacy doc-xml-content --- const docXmlContent = document.querySelector('doc-xml-content'); if (docXmlContent?.shadowRoot) { const docContent = docXmlContent.shadowRoot.querySelector('doc-content'); if (docContent?.shadowRoot) { const innerHtml = docContent.shadowRoot.innerHTML; const h1Match = innerHtml.match(/<h1[^>]*>(.*?)<\/h1>/); if (h1Match) title = h1Match[1].replace(/<[^>]*>?/gm, ''); docContent.shadowRoot.querySelectorAll('a').forEach(a => { if (a.href && !a.href.startsWith('java') && !a.href.startsWith('mailto')) { childLinks.add(a.href); } }); return { html: innerHtml, title, childLinks: Array.from(childLinks) }; } } // --- STRATEGY 4: Modern doc-amf-reference --- const docRef = document.querySelector('doc-amf-reference'); if (docRef) { const markdownContent = docRef.querySelector('.markdown-content'); if (markdownContent) { // Quick and dirty extraction, bypass complex legacy parser const h1 = markdownContent.querySelector('h1'); if (h1) title = h1.textContent?.trim() || title; return { html: markdownContent.innerHTML, title, childLinks: Array.from(childLinks) }; } } const docLayout = document.querySelector('doc-content-layout'); if (docLayout?.shadowRoot) { const slot = docLayout.shadowRoot.querySelector('.content-body slot') as HTMLSlotElement | null; if (slot) { const assignedElements = slot.assignedElements(); if (assignedElements.length > 0) { let guideHtml = ''; for (const el of assignedElements) { if (el.tagName?.toLowerCase() === 'h1') title = el.textContent?.trim() || title; guideHtml += el.outerHTML; } return { html: guideHtml, title, childLinks: Array.from(childLinks) }; } } } // --- STRATEGY 5: Fallbacks --- const container = document.querySelector('article') || document.querySelector('main'); if (container) { title = document.querySelector('h1')?.innerText || title; return { html: container.innerHTML, title, childLinks: Array.from(childLinks) }; } // Complete fallback - BUG-01 & BUG-02 const isHelpSite = window.location.href.includes('help.salesforce.com'); if (isHelpSite || document.body.innerHTML.length > 100000) { return { html: '', title: 'Error', error: 'Found no accessible documentation content on this page. It may require authentication, be a soft 404, rendering timed out, or JavaScript rendering is required.', childLinks: [] }; } return { html: document.body.innerHTML, title, childLinks: Array.from(childLinks) }; }); if (!extraction.html || extraction.html.trim() === '') { return { url, title: extraction.title || 'Untitled', markdown: '', hash: '', error: (extraction as any).error || 'No content found on page', childLinks: extraction.childLinks || [] }; } // Convert to markdown let markdown = turndownService.turndown(extraction.html); // Filter child links to stay within the domain/base if provided, to avoid massive spidering let validLinks = extraction.childLinks; if (baseDomain) { validLinks = validLinks.filter(l => l.startsWith(baseDomain)); } const hash = crypto.createHash('sha256').update(markdown).digest('hex'); return { url, title: extraction.title, markdown, hash, childLinks: validLinks, }; } catch (error: any) { return { url, title: 'Error', markdown: '', hash: '', error: error.message, childLinks: [], }; } finally { await page.close(); } }