Skip to main content
Glama
TMTrevisan

Unified Salesforce Documentation MCP Server

by TMTrevisan

mass_extract_guide

Spider and scrape hierarchical Salesforce documentation from a root page, storing content in a local SQLite database for offline search.

Instructions

Spiders a root Salesforce documentation page, extracts hierarchical links, and scrapes them in bulk. Stores contents in a local SQLite database for later searching.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
rootUrlYesThe Table of Contents or landing page.
maxPagesNoMaximum number of pages to extract (default 20, max 100).
categoryNo
matchKeywordNoOptional substring. If provided, the crawler will prioritize scraping child links containing this string in their URL.

Implementation Reference

  • Zod schema defining the input validation for the mass_extract_guide tool: rootUrl (required URL), maxPages (default 20, max 100), category (default 'general'), matchKeyword (optional filter).
    const MassExtractSchema = z.object({
        rootUrl: z.string().url(),
        maxPages: z.number().int().min(1).max(100).optional().default(20),
        category: z.string().optional().default("general"),
        matchKeyword: z.string().optional().describe("Optional substring. If provided, the crawler will prioritize scraping child links containing this string.")
    });
  • src/index.ts:60-73 (registration)
    Registration of the 'mass_extract_guide' tool in the ListToolsRequestSchema handler, declaring its name, description, and JSON input schema.
    {
        name: "mass_extract_guide",
        description: "Spiders a root Salesforce documentation page, extracts hierarchical links, and scrapes them in bulk. Stores contents in a local SQLite database for later searching.",
        inputSchema: {
            type: "object",
            properties: {
                rootUrl: { type: "string", description: "The Table of Contents or landing page." },
                maxPages: { type: "number", description: "Maximum number of pages to extract (default 20, max 100)." },
                category: { type: "string" },
                matchKeyword: { type: "string", description: "Optional substring. If provided, the crawler will prioritize scraping child links containing this string in their URL." }
            },
            required: ["rootUrl"]
        }
    },
  • Handler implementation for mass_extract_guide: scrapes the root URL, extracts child links, optionally sorts by keyword match, then scrapes each child link up to maxPages, saving all results to the SQLite database.
    if (name === "mass_extract_guide") {
        const { rootUrl, maxPages, category, matchKeyword } = MassExtractSchema.parse(args);
    
        console.error(`Starting mass extraction at ${rootUrl}`);
    
        // Scrape root to get links
        const rootResult = await scrapePage(rootUrl, new URL(rootUrl).origin);
        if (rootResult.error) {
            return { content: [{ type: "text", text: `Root scrape failed: ${rootResult.error}` }], isError: true };
        }
    
        await saveDocument(rootUrl, rootResult.title, rootResult.markdown, rootResult.hash, category);
    
        // Bug-08: Optional keyword sorting to prioritize relevant pages
        let allLinks = [...new Set(rootResult.childLinks)].filter(l => l !== rootUrl);
        if (matchKeyword) {
            const keywordLower = matchKeyword.toLowerCase();
            allLinks.sort((a, b) => {
                const aMatch = a.toLowerCase().includes(keywordLower) ? -1 : 1;
                const bMatch = b.toLowerCase().includes(keywordLower) ? -1 : 1;
                return aMatch - bMatch;
            });
        }
    
        const queue = allLinks.slice(0, maxPages);
        let successRaw = 1;
        let failureCount = 0;
        const successfulUrls: string[] = [rootUrl];
        const failedUrls: string[] = [];
    
        for (const link of queue) {
            console.error(`Scraping queued link: ${link}`);
            const pg = await scrapePage(link, new URL(rootUrl).origin);
            if (!pg.error) {
                await saveDocument(pg.url, pg.title, pg.markdown, pg.hash, category);
                successRaw++;
                successfulUrls.push(pg.url);
            } else {
                console.error(`Failed on ${link}: ${pg.error}`);
                failureCount++;
                failedUrls.push(link);
            }
        }
    
        let outputText = `Mass extraction complete.\nSuccessfully extracted and saved ${successRaw} pages:\n`;
        for (const u of successfulUrls) {
            outputText += `- ${u}\n`;
        }
        if (failureCount > 0) {
            outputText += `\nFailed to extract ${failureCount} pages:\n`;
            for (const u of failedUrls) {
                outputText += `- ${u}\n`;
            }
        }
        outputText += `\nDatabase updated.`;
    
        return {
            content: [{ type: "text", text: outputText }]
        };
    }
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries the full burden. It discloses that the tool stores contents in a local SQLite database for later searching, which is a key behavioral trait beyond the input schema. However, it does not address side effects like overwriting data, rate limits, or permissions.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two efficient sentences with no waste. Front-loaded with the primary verb 'Spiders'.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a spidering and bulk scraping tool without output schema, the description adequately explains the process and storage. It could be more complete by noting the relationship to sibling tools (e.g., search_local_docs for retrieval), but it is still sufficient.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is high (75%). The tool description does not add parameter-specific details beyond the schema; it only explains the overall process. For the undocumented 'category' parameter, neither schema nor description provides guidance, so the description does not compensate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description uses specific verbs ('spiders', 'extracts', 'scrapes', 'stores') and clearly identifies the resource (Salesforce documentation page) and the bulk hierarchical action. It distinguishes from siblings like scrape_single_page (single page) and export/read/search local docs (different operations).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage for bulk extraction from a root page but does not explicitly state when to use this tool versus alternatives (e.g., scrape_single_page for a single page). No when-not or exclusion criteria are provided.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/TMTrevisan/unified-sf-docs-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server