scrapeDeep
Extract comprehensive web content, including images, using deep scraping techniques with customizable parameters such as scroll depth, image size, and pagination. Output data to a specified directory for thorough analysis.
Instructions
Maximum extraction web scraping (slower but thorough)
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| downloadImages | No | Whether to download images locally | |
| maxImages | No | Maximum number of images to extract | |
| maxScrolls | No | Maximum number of scroll attempts (default: 20) | |
| minImageSize | No | Minimum width/height for images in pixels | |
| output | No | Output directory for downloaded images | |
| pages | No | Number of pages to scrape (if pagination is present) | |
| scrapeImages | No | Whether to include images in the scrape result | |
| scrollDelay | No | Delay between scrolls in ms (default: 3000) | |
| url | Yes | URL of the webpage to scrape |
Implementation Reference
- src/tools/scrapeDeep.ts:56-116 (handler)The main handler function for the scrapeDeep tool. It destructures parameters, sets deep scraping options, calls prysm.scrape, processes and limits the result to fit MCP constraints, and handles errors.handler: async (params: ScraperBaseParams): Promise<ScraperResponse> => { const { url, maxScrolls = 20, scrollDelay = 3000, pages = 1, scrapeImages = false, downloadImages = false, maxImages = 100, minImageSize = 100, output, imageOutput } = params; try { // Create options object for the scraper const options = { maxScrolls, scrollDelay, pages, focused: false, standard: false, deep: true, // Use deep mode for thorough extraction scrapeImages: scrapeImages || downloadImages, downloadImages, maxImages, minImageSize, output: output || config.serverOptions.defaultOutputDir, // Use configured default if not provided imageOutput: imageOutput || config.serverOptions.defaultImageOutputDir // Use configured default if not provided }; const result = await prysm.scrape(url, options) as ScraperResponse; // Limit content size to prevent overwhelming the MCP client if (result.content && result.content.length > 0) { // Limit the number of content sections if (result.content.length > 30) { result.content = result.content.slice(0, 30); result.content.push("(Content truncated due to size limitations)"); } // Limit the size of each content section result.content = result.content.map(section => { if (section.length > 10000) { return section.substring(0, 10000) + "... (truncated)"; } return section; }); } // Limit the number of images to return if (result.images && result.images.length > 30) { result.images = result.images.slice(0, 30); } return result; } catch (error) { console.error(`Error scraping ${url}:`, error); // Return a proper error format for MCP return { title: "Scraping Error", content: [`Failed to scrape ${url}: ${error instanceof Error ? error.message : String(error)}`], images: [], metadata: { error: true }, url: url, structureType: "error", paginationType: "none", extractionMethod: "none" }; } }
- src/tools/scrapeDeep.ts:10-55 (schema)JSON Schema defining the input parameters for the scrapeDeep tool, including required 'url' and optional scraping options.parameters: { type: 'object', properties: { url: { type: 'string', description: 'URL of the webpage to scrape' }, maxScrolls: { type: 'number', description: 'Maximum number of scroll attempts (default: 20)' }, scrollDelay: { type: 'number', description: 'Delay between scrolls in ms (default: 3000)' }, pages: { type: 'number', description: 'Number of pages to scrape (if pagination is present)' }, scrapeImages: { type: 'boolean', description: 'Whether to include images in the scrape result' }, downloadImages: { type: 'boolean', description: 'Whether to download images locally' }, maxImages: { type: 'number', description: 'Maximum number of images to extract' }, minImageSize: { type: 'number', description: 'Minimum width/height for images in pixels' }, output: { type: 'string', description: 'Output directory for general results' }, imageOutput: { type: 'string', description: 'Output directory for downloaded images' } }, required: ['url'] },
- src/config.ts:65-71 (registration)Registration of the scrapeDeep tool in the main MCP server configuration's tools array.tools: [ scrapeFocused, scrapeBalanced, scrapeDeep, // analyzeUrl, formatResult ],
- src/tools/index.ts:8-14 (registration)Intermediate registration/export of tool definitions including scrapeDeep in tools/index.ts.export const toolDefinitions: ToolDefinition[] = [ scrapeFocused, scrapeBalanced, scrapeDeep, // analyzeUrl, formatResult, ];