scrapeDeep
Extract comprehensive web content, including images, using deep scraping techniques with customizable parameters such as scroll depth, image size, and pagination. Output data to a specified directory for thorough analysis.
Instructions
Maximum extraction web scraping (slower but thorough)
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| downloadImages | No | Whether to download images locally | |
| maxImages | No | Maximum number of images to extract | |
| maxScrolls | No | Maximum number of scroll attempts (default: 20) | |
| minImageSize | No | Minimum width/height for images in pixels | |
| output | No | Output directory for downloaded images | |
| pages | No | Number of pages to scrape (if pagination is present) | |
| scrapeImages | No | Whether to include images in the scrape result | |
| scrollDelay | No | Delay between scrolls in ms (default: 3000) | |
| url | Yes | URL of the webpage to scrape |
Implementation Reference
- src/tools/scrapeDeep.ts:56-116 (handler)The main handler function for the scrapeDeep tool. It destructures parameters, sets deep scraping options, calls prysm.scrape, processes and limits the result to fit MCP constraints, and handles errors.
handler: async (params: ScraperBaseParams): Promise<ScraperResponse> => { const { url, maxScrolls = 20, scrollDelay = 3000, pages = 1, scrapeImages = false, downloadImages = false, maxImages = 100, minImageSize = 100, output, imageOutput } = params; try { // Create options object for the scraper const options = { maxScrolls, scrollDelay, pages, focused: false, standard: false, deep: true, // Use deep mode for thorough extraction scrapeImages: scrapeImages || downloadImages, downloadImages, maxImages, minImageSize, output: output || config.serverOptions.defaultOutputDir, // Use configured default if not provided imageOutput: imageOutput || config.serverOptions.defaultImageOutputDir // Use configured default if not provided }; const result = await prysm.scrape(url, options) as ScraperResponse; // Limit content size to prevent overwhelming the MCP client if (result.content && result.content.length > 0) { // Limit the number of content sections if (result.content.length > 30) { result.content = result.content.slice(0, 30); result.content.push("(Content truncated due to size limitations)"); } // Limit the size of each content section result.content = result.content.map(section => { if (section.length > 10000) { return section.substring(0, 10000) + "... (truncated)"; } return section; }); } // Limit the number of images to return if (result.images && result.images.length > 30) { result.images = result.images.slice(0, 30); } return result; } catch (error) { console.error(`Error scraping ${url}:`, error); // Return a proper error format for MCP return { title: "Scraping Error", content: [`Failed to scrape ${url}: ${error instanceof Error ? error.message : String(error)}`], images: [], metadata: { error: true }, url: url, structureType: "error", paginationType: "none", extractionMethod: "none" }; } } - src/tools/scrapeDeep.ts:10-55 (schema)JSON Schema defining the input parameters for the scrapeDeep tool, including required 'url' and optional scraping options.
parameters: { type: 'object', properties: { url: { type: 'string', description: 'URL of the webpage to scrape' }, maxScrolls: { type: 'number', description: 'Maximum number of scroll attempts (default: 20)' }, scrollDelay: { type: 'number', description: 'Delay between scrolls in ms (default: 3000)' }, pages: { type: 'number', description: 'Number of pages to scrape (if pagination is present)' }, scrapeImages: { type: 'boolean', description: 'Whether to include images in the scrape result' }, downloadImages: { type: 'boolean', description: 'Whether to download images locally' }, maxImages: { type: 'number', description: 'Maximum number of images to extract' }, minImageSize: { type: 'number', description: 'Minimum width/height for images in pixels' }, output: { type: 'string', description: 'Output directory for general results' }, imageOutput: { type: 'string', description: 'Output directory for downloaded images' } }, required: ['url'] }, - src/config.ts:65-71 (registration)Registration of the scrapeDeep tool in the main MCP server configuration's tools array.
tools: [ scrapeFocused, scrapeBalanced, scrapeDeep, // analyzeUrl, formatResult ], - src/tools/index.ts:8-14 (registration)Intermediate registration/export of tool definitions including scrapeDeep in tools/index.ts.
export const toolDefinitions: ToolDefinition[] = [ scrapeFocused, scrapeBalanced, scrapeDeep, // analyzeUrl, formatResult, ];