mass_extract_guide
Spider and scrape hierarchical Salesforce documentation from a root page, storing content in a local SQLite database for offline search.
Instructions
Spiders a root Salesforce documentation page, extracts hierarchical links, and scrapes them in bulk. Stores contents in a local SQLite database for later searching.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| rootUrl | Yes | The Table of Contents or landing page. | |
| maxPages | No | Maximum number of pages to extract (default 20, max 100). | |
| category | No | ||
| matchKeyword | No | Optional substring. If provided, the crawler will prioritize scraping child links containing this string in their URL. |
Implementation Reference
- src/index.ts:23-28 (schema)Zod schema defining the input validation for the mass_extract_guide tool: rootUrl (required URL), maxPages (default 20, max 100), category (default 'general'), matchKeyword (optional filter).
const MassExtractSchema = z.object({ rootUrl: z.string().url(), maxPages: z.number().int().min(1).max(100).optional().default(20), category: z.string().optional().default("general"), matchKeyword: z.string().optional().describe("Optional substring. If provided, the crawler will prioritize scraping child links containing this string.") }); - src/index.ts:60-73 (registration)Registration of the 'mass_extract_guide' tool in the ListToolsRequestSchema handler, declaring its name, description, and JSON input schema.
{ name: "mass_extract_guide", description: "Spiders a root Salesforce documentation page, extracts hierarchical links, and scrapes them in bulk. Stores contents in a local SQLite database for later searching.", inputSchema: { type: "object", properties: { rootUrl: { type: "string", description: "The Table of Contents or landing page." }, maxPages: { type: "number", description: "Maximum number of pages to extract (default 20, max 100)." }, category: { type: "string" }, matchKeyword: { type: "string", description: "Optional substring. If provided, the crawler will prioritize scraping child links containing this string in their URL." } }, required: ["rootUrl"] } }, - src/index.ts:135-194 (handler)Handler implementation for mass_extract_guide: scrapes the root URL, extracts child links, optionally sorts by keyword match, then scrapes each child link up to maxPages, saving all results to the SQLite database.
if (name === "mass_extract_guide") { const { rootUrl, maxPages, category, matchKeyword } = MassExtractSchema.parse(args); console.error(`Starting mass extraction at ${rootUrl}`); // Scrape root to get links const rootResult = await scrapePage(rootUrl, new URL(rootUrl).origin); if (rootResult.error) { return { content: [{ type: "text", text: `Root scrape failed: ${rootResult.error}` }], isError: true }; } await saveDocument(rootUrl, rootResult.title, rootResult.markdown, rootResult.hash, category); // Bug-08: Optional keyword sorting to prioritize relevant pages let allLinks = [...new Set(rootResult.childLinks)].filter(l => l !== rootUrl); if (matchKeyword) { const keywordLower = matchKeyword.toLowerCase(); allLinks.sort((a, b) => { const aMatch = a.toLowerCase().includes(keywordLower) ? -1 : 1; const bMatch = b.toLowerCase().includes(keywordLower) ? -1 : 1; return aMatch - bMatch; }); } const queue = allLinks.slice(0, maxPages); let successRaw = 1; let failureCount = 0; const successfulUrls: string[] = [rootUrl]; const failedUrls: string[] = []; for (const link of queue) { console.error(`Scraping queued link: ${link}`); const pg = await scrapePage(link, new URL(rootUrl).origin); if (!pg.error) { await saveDocument(pg.url, pg.title, pg.markdown, pg.hash, category); successRaw++; successfulUrls.push(pg.url); } else { console.error(`Failed on ${link}: ${pg.error}`); failureCount++; failedUrls.push(link); } } let outputText = `Mass extraction complete.\nSuccessfully extracted and saved ${successRaw} pages:\n`; for (const u of successfulUrls) { outputText += `- ${u}\n`; } if (failureCount > 0) { outputText += `\nFailed to extract ${failureCount} pages:\n`; for (const u of failedUrls) { outputText += `- ${u}\n`; } } outputText += `\nDatabase updated.`; return { content: [{ type: "text", text: outputText }] }; }