Skip to main content
Glama
TMTrevisan

Unified Salesforce Documentation MCP Server

by TMTrevisan

scrape_single_page

Scrape a Salesforce documentation page by URL and return its content as markdown.

Instructions

Scrape a single Salesforce documentation page. Returns markdown. If you do not know the exact URL, you should first use a Web Search tool (like Brave or DuckDuckGo) to search for 'site:developer.salesforce.com/docs [topic]' or 'site:help.salesforce.com [topic]', then pass the retrieved URL here.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYes
categoryNo

Implementation Reference

  • The CallToolRequestSchema handler that executes the 'scrape_single_page' tool. It parses the URL and category from args via ScrapePageSchema, calls scrapePage() from scraper.ts, saves the result to the database via saveDocument(), and returns the markdown content.
    if (name === "scrape_single_page") {
        const { url, category } = ScrapePageSchema.parse(args);
        console.error(`Scraping ${url}...`);
        const result = await scrapePage(url);
    
        if (result.error) {
            return { content: [{ type: "text", text: `Failed to scrape: ${result.error}` }] };
        }
    
        // Save automatically to local DB
        await saveDocument(url, result.title, result.markdown, result.hash, category);
    
        return {
            content: [{ type: "text", text: `# ${result.title}\n\n${result.markdown}` }]
        };
    }
  • ScrapePageSchema: Zod schema defining the input for 'scrape_single_page' — requires a valid 'url' string and optional 'category' string defaulting to 'general'.
    const ScrapePageSchema = z.object({
        url: z.string().url(),
        category: z.string().optional().default("general")
    });
  • src/index.ts:48-59 (registration)
    Tool registration entry in ListToolsRequestSchema handler. Declares the tool name 'scrape_single_page', its description, and JSON Schema input (url required, category optional).
    {
        name: "scrape_single_page",
        description: "Scrape a single Salesforce documentation page. Returns markdown. If you do not know the exact URL, you should first use a Web Search tool (like Brave or DuckDuckGo) to search for 'site:developer.salesforce.com/docs [topic]' or 'site:help.salesforce.com [topic]', then pass the retrieved URL here.",
        inputSchema: {
            type: "object",
            properties: {
                url: { type: "string" },
                category: { type: "string" }
            },
            required: ["url"]
        }
    },
  • The scrapePage() function in scraper.ts, which is the core scraping logic called by the handler. Uses puppeteer to render pages, extracts content via multiple DOM strategies (iframes, shadow DOM, etc.), converts HTML to markdown using TurndownService, and returns a ScrapedPage object with title, markdown, hash, and childLinks.
    export async function scrapePage(url: string, baseDomain?: string): Promise<ScrapedPage> {
        // 1. Aura SPA Fast-Path directly hitting the backend Salesforce APIs
        const auraResult = await scrapeAuraArticle(url, baseDomain);
        if (auraResult) {
            return auraResult;
        }
    
        // 1.5 Native PDF Extraction
        if (url.toLowerCase().endsWith('.pdf')) {
            try {
                console.log(`[PDF Extraction] Downloading ${url}...`);
                const pdfResponse = await fetch(url);
                if (!pdfResponse.ok) {
                    return {
                        url,
                        title: 'Error',
                        markdown: '',
                        hash: '',
                        error: `PDF HTTP Error ${pdfResponse.status}: ${pdfResponse.statusText}`,
                        childLinks: []
                    };
                }
                const buffer = await pdfResponse.arrayBuffer();
                const data = await pdf(Buffer.from(buffer));
    
                // Generate a simple markdown representation
                const title = url.split('/').pop() || 'PDF Document';
                const markdown = `# ${title}\n\n${data.text}`;
                const hash = crypto.createHash('sha256').update(markdown).digest('hex');
    
                return {
                    url,
                    title,
                    markdown,
                    hash,
                    childLinks: [] // PDFs don't typically yield crawlable HTML links natively this way
                };
            } catch (e: any) {
                return {
                    url,
                    title: 'Error',
                    markdown: '',
                    hash: '',
                    error: `PDF Parse Error: ${e.message}`,
                    childLinks: []
                };
            }
        }
    
        // 2. Headless Chrome Fallback for everything else (LWC, Standard Web, etc.)
        const browserInstance = await getBrowser();
        const page = await browserInstance.newPage();
    
        try {
            await page.setViewport({ width: 1280, height: 800 });
            // User agent to look normal
            await page.setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36');
    
            // Wait until network is idle specifically to handle SPA renders and iframe loads
            const response = await page.goto(url, { waitUntil: 'networkidle0', timeout: 60000 });
    
            // BUG-04 check: If the page returns an HTTP error code natively, fail fast.
            if (response && !response.ok()) {
                return {
                    url,
                    title: 'Error',
                    markdown: '',
                    hash: '',
                    error: `HTTP Error ${response.status()}: ${response.statusText()}`,
                    childLinks: []
                };
            }
    
            // Wait for specific Salesforce content locators to appear to avoid grabbing 'Loading...' pages
            try {
                if (url.includes('help.salesforce.com')) {
                    // Wait for the main body to render something meaningful (avoiding specific legacy classes)
                    await page.waitForSelector('body', { timeout: 15000 });
                } else if (url.includes('developer.salesforce.com')) {
                    await page.waitForFunction(() => {
                        return document.querySelector('doc-content-layout') ||
                            document.querySelector('doc-xml-content') ||
                            document.querySelector('iframe');
                    }, { timeout: 10000 });
                }
            } catch (e) {
                console.warn(`Timeout waiting for specific content selectors on ${url}`);
            }
    
            // Additional wait just in case visual components are still sliding in
            await new Promise(r => setTimeout(r, 2000));
    
            // Take an opportunistic screenshot if development debugging layout issues
            if (url.includes('help.salesforce.com')) {
                await page.screenshot({ path: 'help_debug_test.png' }).catch(() => { });
            }
    
            // In-page extraction script
            const extraction = await page.evaluate(() => {
                // Flattened DOM Extraction Logic to bypass TS __name bugs
                let title = document.querySelector('title')?.innerText || 'Untitled';
                let finalHtml = '';
                const childLinks = new Set<string>();
    
                // Collect all same-site hierarchical links using an iterative deep shadow DOM search
                const rootsToProcess = [document as Document | ShadowRoot | Element];
    
                while (rootsToProcess.length > 0) {
                    const currentRoot = rootsToProcess.pop()!;
    
                    // Grab links
                    const aTags = currentRoot.querySelectorAll('a');
                    aTags.forEach(a => {
                        if (a.href && !a.href.startsWith('java') && !a.href.startsWith('mailto')) {
                            childLinks.add(a.href);
                        }
                    });
    
                    const allElements = currentRoot.querySelectorAll('*');
                    for (let i = 0; i < allElements.length; i++) {
                        const el = allElements[i];
                        if (el.shadowRoot) {
                            rootsToProcess.push(el.shadowRoot);
                        }
                    }
                }
    
                // BUG-04: Catch soft 404s rendered by the SPA
                if (title.includes('404 Error')) {
                    return { html: '', title: 'Error', error: 'HTTP 404 - Page Not Found', childLinks: Array.from(childLinks) };
                }
    
                // Catch SPA shells that failed to load content BEFORE generic tag fallbacks
                const bodyHtml = document.body.innerText;
                if (bodyHtml.includes('Sorry to interrupt')) {
                    return {
                        html: '',
                        title: 'Error',
                        error: 'Found no accessible documentation content on this page. It may require authentication, be a soft 404, rendering timed out, or JavaScript rendering is required.',
                        childLinks: []
                    };
                }
    
                // --- STRATEGY 1: Iframe (Older Developer Guides) ---
                const iframe = document.querySelector('iframe');
                if (iframe && iframe.contentDocument && iframe.contentDocument.body) {
                    const docHtml = iframe.contentDocument.querySelector('#doc')?.innerHTML ||
                        iframe.contentDocument.querySelector('body')?.innerHTML || '';
                    const docTitle = iframe.contentDocument.querySelector('title')?.innerText ||
                        iframe.contentDocument.querySelector('h1')?.innerText || 'Untitled';
    
                    iframe.contentDocument.querySelectorAll('a').forEach(a => {
                        if (a.href && !a.href.startsWith('java') && !a.href.startsWith('mailto')) {
                            childLinks.add(a.href);
                        }
                    });
    
                    if (docHtml.length > 500) {
                        return { html: docHtml, title: docTitle, childLinks: Array.from(childLinks) };
                    }
                }
    
                // --- STRATEGY 2: Help.salesforce.com Shadow DOM Search ---
                let sldsText: Element | null = null;
                const searchRoots = [document as Document | ShadowRoot | Element];
                while (searchRoots.length > 0 && !sldsText) {
                    const current = searchRoots.pop()!;
                    const found = current.querySelector('.slds-text-longform');
                    if (found) {
                        sldsText = found;
                        break;
                    }
                    const all = current.querySelectorAll('*');
                    for (let i = 0; i < all.length; i++) {
                        if (all[i].shadowRoot) searchRoots.push(all[i].shadowRoot!);
                    }
                }
    
                if (sldsText) {
                    title = title.replace(' | Salesforce', '').trim();
                    return { html: sldsText.innerHTML, title, childLinks: Array.from(childLinks) };
                }
    
                // --- STRATEGY 3: legacy doc-xml-content ---
                const docXmlContent = document.querySelector('doc-xml-content');
                if (docXmlContent?.shadowRoot) {
                    const docContent = docXmlContent.shadowRoot.querySelector('doc-content');
                    if (docContent?.shadowRoot) {
                        const innerHtml = docContent.shadowRoot.innerHTML;
                        const h1Match = innerHtml.match(/<h1[^>]*>(.*?)<\/h1>/);
                        if (h1Match) title = h1Match[1].replace(/<[^>]*>?/gm, '');
    
                        docContent.shadowRoot.querySelectorAll('a').forEach(a => {
                            if (a.href && !a.href.startsWith('java') && !a.href.startsWith('mailto')) {
                                childLinks.add(a.href);
                            }
                        });
    
                        return { html: innerHtml, title, childLinks: Array.from(childLinks) };
                    }
                }
    
                // --- STRATEGY 4: Modern doc-amf-reference ---
                const docRef = document.querySelector('doc-amf-reference');
                if (docRef) {
                    const markdownContent = docRef.querySelector('.markdown-content');
                    if (markdownContent) {
                        // Quick and dirty extraction, bypass complex legacy parser
                        const h1 = markdownContent.querySelector('h1');
                        if (h1) title = h1.textContent?.trim() || title;
                        return { html: markdownContent.innerHTML, title, childLinks: Array.from(childLinks) };
                    }
                }
    
                const docLayout = document.querySelector('doc-content-layout');
                if (docLayout?.shadowRoot) {
                    const slot = docLayout.shadowRoot.querySelector('.content-body slot') as HTMLSlotElement | null;
                    if (slot) {
                        const assignedElements = slot.assignedElements();
                        if (assignedElements.length > 0) {
                            let guideHtml = '';
                            for (const el of assignedElements) {
                                if (el.tagName?.toLowerCase() === 'h1') title = el.textContent?.trim() || title;
                                guideHtml += el.outerHTML;
                            }
                            return { html: guideHtml, title, childLinks: Array.from(childLinks) };
                        }
                    }
                }
    
                // --- STRATEGY 5: Fallbacks ---
                const container = document.querySelector('article') || document.querySelector('main');
                if (container) {
                    title = document.querySelector('h1')?.innerText || title;
                    return { html: container.innerHTML, title, childLinks: Array.from(childLinks) };
                }
    
                // Complete fallback - BUG-01 & BUG-02
                const isHelpSite = window.location.href.includes('help.salesforce.com');
                if (isHelpSite || document.body.innerHTML.length > 100000) {
                    return {
                        html: '',
                        title: 'Error',
                        error: 'Found no accessible documentation content on this page. It may require authentication, be a soft 404, rendering timed out, or JavaScript rendering is required.',
                        childLinks: []
                    };
                }
    
                return {
                    html: document.body.innerHTML,
                    title,
                    childLinks: Array.from(childLinks)
                };
            });
    
            if (!extraction.html || extraction.html.trim() === '') {
                return {
                    url,
                    title: extraction.title || 'Untitled',
                    markdown: '',
                    hash: '',
                    error: (extraction as any).error || 'No content found on page',
                    childLinks: extraction.childLinks || []
                };
            }
    
            // Convert to markdown
            let markdown = turndownService.turndown(extraction.html);
    
            // Filter child links to stay within the domain/base if provided, to avoid massive spidering
            let validLinks = extraction.childLinks;
            if (baseDomain) {
                validLinks = validLinks.filter(l => l.startsWith(baseDomain));
            }
    
            const hash = crypto.createHash('sha256').update(markdown).digest('hex');
    
            return {
                url,
                title: extraction.title,
                markdown,
                hash,
                childLinks: validLinks,
            };
        } catch (error: any) {
            return {
                url,
                title: 'Error',
                markdown: '',
                hash: '',
                error: error.message,
                childLinks: [],
            };
        } finally {
            await page.close();
        }
    }
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries full burden. It states the tool scrapes a page and returns markdown, but does not disclose error handling, rate limits, or authentication requirements. This is adequate but lacks richness.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences cover the function and crucial usage guidance. No filler; every word earns its place.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool simplicity (2 params, no nested objects, no output schema), the description covers the main usage. It could mention the optional 'category' parameter or expected markdown structure, but overall it is sufficient.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 0%; the description adds meaning to 'url' by specifying it must be a Salesforce documentation URL and advising how to find it. However, 'category' is entirely undocumented, leaving a gap.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states that the tool scrapes a single Salesforce documentation page and returns markdown. It specifies the resource (Salesforce docs) and the output format, distinguishing it from siblings which deal with local documents.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly instructs the agent to use a web search tool if the exact URL is unknown, providing specific search queries. This gives clear guidance on when to use this tool versus search tools.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/TMTrevisan/unified-sf-docs-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server