scrape_single_page

Scrape a Salesforce documentation page by URL and return its content as markdown.

Instructions

Scrape a single Salesforce documentation page. Returns markdown. If you do not know the exact URL, you should first use a Web Search tool (like Brave or DuckDuckGo) to search for 'site:developer.salesforce.com/docs [topic]' or 'site:help.salesforce.com [topic]', then pass the retrieved URL here.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`url`	Yes
`category`	No

Implementation Reference

src/index.ts:118-133 (handler)

The CallToolRequestSchema handler that executes the 'scrape_single_page' tool. It parses the URL and category from args via ScrapePageSchema, calls scrapePage() from scraper.ts, saves the result to the database via saveDocument(), and returns the markdown content.

if (name === "scrape_single_page") {
    const { url, category } = ScrapePageSchema.parse(args);
    console.error(`Scraping ${url}...`);
    const result = await scrapePage(url);

    if (result.error) {
        return { content: [{ type: "text", text: `Failed to scrape: ${result.error}` }] };
    }

    // Save automatically to local DB
    await saveDocument(url, result.title, result.markdown, result.hash, category);

    return {
        content: [{ type: "text", text: `# ${result.title}\n\n${result.markdown}` }]
    };
}

src/index.ts:18-21 (schema)
ScrapePageSchema: Zod schema defining the input for 'scrape_single_page' — requires a valid 'url' string and optional 'category' string defaulting to 'general'.
```
const ScrapePageSchema = z.object({
    url: z.string().url(),
    category: z.string().optional().default("general")
});
```

src/index.ts:48-59 (registration)

Tool registration entry in ListToolsRequestSchema handler. Declares the tool name 'scrape_single_page', its description, and JSON Schema input (url required, category optional).

{
    name: "scrape_single_page",
    description: "Scrape a single Salesforce documentation page. Returns markdown. If you do not know the exact URL, you should first use a Web Search tool (like Brave or DuckDuckGo) to search for 'site:developer.salesforce.com/docs [topic]' or 'site:help.salesforce.com [topic]', then pass the retrieved URL here.",
    inputSchema: {
        type: "object",
        properties: {
            url: { type: "string" },
            category: { type: "string" }
        },
        required: ["url"]
    }
},

src/scraper.ts:164-460 (helper)

The scrapePage() function in scraper.ts, which is the core scraping logic called by the handler. Uses puppeteer to render pages, extracts content via multiple DOM strategies (iframes, shadow DOM, etc.), converts HTML to markdown using TurndownService, and returns a ScrapedPage object with title, markdown, hash, and childLinks.

export async function scrapePage(url: string, baseDomain?: string): Promise<ScrapedPage> {
    // 1. Aura SPA Fast-Path directly hitting the backend Salesforce APIs
    const auraResult = await scrapeAuraArticle(url, baseDomain);
    if (auraResult) {
        return auraResult;
    }

    // 1.5 Native PDF Extraction
    if (url.toLowerCase().endsWith('.pdf')) {
        try {
            console.log(`[PDF Extraction] Downloading ${url}...`);
            const pdfResponse = await fetch(url);
            if (!pdfResponse.ok) {
                return {
                    url,
                    title: 'Error',
                    markdown: '',
                    hash: '',
                    error: `PDF HTTP Error ${pdfResponse.status}: ${pdfResponse.statusText}`,
                    childLinks: []
                };
            }
            const buffer = await pdfResponse.arrayBuffer();
            const data = await pdf(Buffer.from(buffer));

            // Generate a simple markdown representation
            const title = url.split('/').pop() || 'PDF Document';
            const markdown = `# ${title}\n\n${data.text}`;
            const hash = crypto.createHash('sha256').update(markdown).digest('hex');

            return {
                url,
                title,
                markdown,
                hash,
                childLinks: [] // PDFs don't typically yield crawlable HTML links natively this way
            };
        } catch (e: any) {
            return {
                url,
                title: 'Error',
                markdown: '',
                hash: '',
                error: `PDF Parse Error: ${e.message}`,
                childLinks: []
            };
        }
    }

    // 2. Headless Chrome Fallback for everything else (LWC, Standard Web, etc.)
    const browserInstance = await getBrowser();
    const page = await browserInstance.newPage();

    try {
        await page.setViewport({ width: 1280, height: 800 });
        // User agent to look normal
        await page.setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36');

        // Wait until network is idle specifically to handle SPA renders and iframe loads
        const response = await page.goto(url, { waitUntil: 'networkidle0', timeout: 60000 });

        // BUG-04 check: If the page returns an HTTP error code natively, fail fast.
        if (response && !response.ok()) {
            return {
                url,
                title: 'Error',
                markdown: '',
                hash: '',
                error: `HTTP Error ${response.status()}: ${response.statusText()}`,
                childLinks: []
            };
        }

        // Wait for specific Salesforce content locators to appear to avoid grabbing 'Loading...' pages
        try {
            if (url.includes('help.salesforce.com')) {
                // Wait for the main body to render something meaningful (avoiding specific legacy classes)
                await page.waitForSelector('body', { timeout: 15000 });
            } else if (url.includes('developer.salesforce.com')) {
                await page.waitForFunction(() => {
                    return document.querySelector('doc-content-layout') ||
                        document.querySelector('doc-xml-content') ||
                        document.querySelector('iframe');
                }, { timeout: 10000 });
            }
        } catch (e) {
            console.warn(`Timeout waiting for specific content selectors on ${url}`);
        }

        // Additional wait just in case visual components are still sliding in
        await new Promise(r => setTimeout(r, 2000));

        // Take an opportunistic screenshot if development debugging layout issues
        if (url.includes('help.salesforce.com')) {
            await page.screenshot({ path: 'help_debug_test.png' }).catch(() => { });
        }

        // In-page extraction script
        const extraction = await page.evaluate(() => {
            // Flattened DOM Extraction Logic to bypass TS __name bugs
            let title = document.querySelector('title')?.innerText || 'Untitled';
            let finalHtml = '';
            const childLinks = new Set<string>();

            // Collect all same-site hierarchical links using an iterative deep shadow DOM search
            const rootsToProcess = [document as Document | ShadowRoot | Element];

            while (rootsToProcess.length > 0) {
                const currentRoot = rootsToProcess.pop()!;

                // Grab links
                const aTags = currentRoot.querySelectorAll('a');
                aTags.forEach(a => {
                    if (a.href && !a.href.startsWith('java') && !a.href.startsWith('mailto')) {
                        childLinks.add(a.href);
                    }
                });

                const allElements = currentRoot.querySelectorAll('*');
                for (let i = 0; i < allElements.length; i++) {
                    const el = allElements[i];
                    if (el.shadowRoot) {
                        rootsToProcess.push(el.shadowRoot);
                    }
                }
            }

            // BUG-04: Catch soft 404s rendered by the SPA
            if (title.includes('404 Error')) {
                return { html: '', title: 'Error', error: 'HTTP 404 - Page Not Found', childLinks: Array.from(childLinks) };
            }

            // Catch SPA shells that failed to load content BEFORE generic tag fallbacks
            const bodyHtml = document.body.innerText;
            if (bodyHtml.includes('Sorry to interrupt')) {
                return {
                    html: '',
                    title: 'Error',
                    error: 'Found no accessible documentation content on this page. It may require authentication, be a soft 404, rendering timed out, or JavaScript rendering is required.',
                    childLinks: []
                };
            }

            // --- STRATEGY 1: Iframe (Older Developer Guides) ---
            const iframe = document.querySelector('iframe');
            if (iframe && iframe.contentDocument && iframe.contentDocument.body) {
                const docHtml = iframe.contentDocument.querySelector('#doc')?.innerHTML ||
                    iframe.contentDocument.querySelector('body')?.innerHTML || '';
                const docTitle = iframe.contentDocument.querySelector('title')?.innerText ||
                    iframe.contentDocument.querySelector('h1')?.innerText || 'Untitled';

                iframe.contentDocument.querySelectorAll('a').forEach(a => {
                    if (a.href && !a.href.startsWith('java') && !a.href.startsWith('mailto')) {
                        childLinks.add(a.href);
                    }
                });

                if (docHtml.length > 500) {
                    return { html: docHtml, title: docTitle, childLinks: Array.from(childLinks) };
                }
            }

            // --- STRATEGY 2: Help.salesforce.com Shadow DOM Search ---
            let sldsText: Element | null = null;
            const searchRoots = [document as Document | ShadowRoot | Element];
            while (searchRoots.length > 0 && !sldsText) {
                const current = searchRoots.pop()!;
                const found = current.querySelector('.slds-text-longform');
                if (found) {
                    sldsText = found;
                    break;
                }
                const all = current.querySelectorAll('*');
                for (let i = 0; i < all.length; i++) {
                    if (all[i].shadowRoot) searchRoots.push(all[i].shadowRoot!);
                }
            }

            if (sldsText) {
                title = title.replace(' | Salesforce', '').trim();
                return { html: sldsText.innerHTML, title, childLinks: Array.from(childLinks) };
            }

            // --- STRATEGY 3: legacy doc-xml-content ---
            const docXmlContent = document.querySelector('doc-xml-content');
            if (docXmlContent?.shadowRoot) {
                const docContent = docXmlContent.shadowRoot.querySelector('doc-content');
                if (docContent?.shadowRoot) {
                    const innerHtml = docContent.shadowRoot.innerHTML;
                    const h1Match = innerHtml.match(/<h1[^>]*>(.*?)<\/h1>/);
                    if (h1Match) title = h1Match[1].replace(/<[^>]*>?/gm, '');

                    docContent.shadowRoot.querySelectorAll('a').forEach(a => {
                        if (a.href && !a.href.startsWith('java') && !a.href.startsWith('mailto')) {
                            childLinks.add(a.href);
                        }
                    });

                    return { html: innerHtml, title, childLinks: Array.from(childLinks) };
                }
            }

            // --- STRATEGY 4: Modern doc-amf-reference ---
            const docRef = document.querySelector('doc-amf-reference');
            if (docRef) {
                const markdownContent = docRef.querySelector('.markdown-content');
                if (markdownContent) {
                    // Quick and dirty extraction, bypass complex legacy parser
                    const h1 = markdownContent.querySelector('h1');
                    if (h1) title = h1.textContent?.trim() || title;
                    return { html: markdownContent.innerHTML, title, childLinks: Array.from(childLinks) };
                }
            }

            const docLayout = document.querySelector('doc-content-layout');
            if (docLayout?.shadowRoot) {
                const slot = docLayout.shadowRoot.querySelector('.content-body slot') as HTMLSlotElement | null;
                if (slot) {
                    const assignedElements = slot.assignedElements();
                    if (assignedElements.length > 0) {
                        let guideHtml = '';
                        for (const el of assignedElements) {
                            if (el.tagName?.toLowerCase() === 'h1') title = el.textContent?.trim() || title;
                            guideHtml += el.outerHTML;
                        }
                        return { html: guideHtml, title, childLinks: Array.from(childLinks) };
                    }
                }
            }

            // --- STRATEGY 5: Fallbacks ---
            const container = document.querySelector('article') || document.querySelector('main');
            if (container) {
                title = document.querySelector('h1')?.innerText || title;
                return { html: container.innerHTML, title, childLinks: Array.from(childLinks) };
            }

            // Complete fallback - BUG-01 & BUG-02
            const isHelpSite = window.location.href.includes('help.salesforce.com');
            if (isHelpSite || document.body.innerHTML.length > 100000) {
                return {
                    html: '',
                    title: 'Error',
                    error: 'Found no accessible documentation content on this page. It may require authentication, be a soft 404, rendering timed out, or JavaScript rendering is required.',
                    childLinks: []
                };
            }

            return {
                html: document.body.innerHTML,
                title,
                childLinks: Array.from(childLinks)
            };
        });

        if (!extraction.html || extraction.html.trim() === '') {
            return {
                url,
                title: extraction.title || 'Untitled',
                markdown: '',
                hash: '',
                error: (extraction as any).error || 'No content found on page',
                childLinks: extraction.childLinks || []
            };
        }

        // Convert to markdown
        let markdown = turndownService.turndown(extraction.html);

        // Filter child links to stay within the domain/base if provided, to avoid massive spidering
        let validLinks = extraction.childLinks;
        if (baseDomain) {
            validLinks = validLinks.filter(l => l.startsWith(baseDomain));
        }

        const hash = crypto.createHash('sha256').update(markdown).digest('hex');

        return {
            url,
            title: extraction.title,
            markdown,
            hash,
            childLinks: validLinks,
        };
    } catch (error: any) {
        return {
            url,
            title: 'Error',
            markdown: '',
            hash: '',
            error: error.message,
            childLinks: [],
        };
    } finally {
        await page.close();
    }
}

Unified Salesforce Documentation MCP Server

scrape_single_page

Instructions

Input Schema

Implementation Reference

Tool Definition Quality

Other Tools

Latest Blog Posts

MCP directory API