fetch_webpage

Extract text content from web pages using customizable parameters such as URL, resource blocking, and character limits for efficient data retrieval.

Instructions

Retrieve text content from a web page

Input Schema

TableJSON Schema

Name	Required	Description
`blockResources`	No	Whether to block images, stylesheets, and fonts to improve performance (default: true)
`headers`	No	Custom headers to include in the request
`maxLength`	No	Maximum number of characters to return for content extraction (default: 2000 if not provided)
`password`	No	Password for basic authentication
`resourceTypesToBlock`	No	List of resource types to block (e.g., "image", "stylesheet", "font")
`startIndex`	No	Start character index for content extraction (default: 0)
`timeout`	No	Navigation timeout in milliseconds (default: 60000)
`url`	Yes	The URL of the webpage to fetch
`username`	No	Username for basic authentication

Implementation Reference

src/index.ts:707-815 (handler)
Core handler implementation using Puppeteer: launches headless browser, sets headers/auth, blocks resources if specified, navigates URL, extracts full HTML, removes whitespace, slices content based on startIndex/maxLength, handles optional pagination with nextPageSelector.
private async fetchWebpage(url: string, options: { blockResources: boolean; resourceTypesToBlock?: string[]; timeout: number; maxLength: number; startIndex: number; headers?: Record<string, string>; username?: string; password?: string; nextPageSelector?: string; maxPages: number; }) { const browser = await puppeteer.launch({ headless: true, args: ['--incognito', '--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage'] }); let page; try { page = await browser.newPage(); await page.setDefaultNavigationTimeout(options.timeout); if (options.headers) { await page.setExtraHTTPHeaders(options.headers); } if (options.username && options.password) { await page.authenticate({ username: options.username, password: options.password }); } if (options.blockResources || options.resourceTypesToBlock) { await page.setRequestInterception(true); page.on('request', req => { const resourceType = req.resourceType(); const typesToBlock = options.resourceTypesToBlock ?? []; if (typesToBlock.includes(resourceType)) { req.abort(); } else { req.continue(); } }); } const results: any[] = []; let currentUrl = url; let currentPageNum = 1; while (currentPageNum <= options.maxPages) { console.error(`Fetching content from: ${currentUrl} (page ${currentPageNum})`); await page.goto(currentUrl, { waitUntil: 'domcontentloaded', timeout: options.timeout }); const title = await page.title(); // Efficiently extract and slice content within the browser context // 1. Get the full content from the page const html = await page.content(); // Step 1: Remove all whitespace from the raw HTML content const cleanedHtml = html.replace(/\s/g, ''); // Step 2: Calculate the total character count of the cleaned HTML const totalLength = cleanedHtml.length; // Step 3: Slice the cleaned HTML based on startIndex and maxLength const slicedContent = cleanedHtml.substring(options.startIndex, options.startIndex + options.maxLength); results.push({ url: currentUrl, title, content: slicedContent, // Renamed from slicedContent for consistency with tool handler contentLength: totalLength, // Renamed from totalCharacters for consistency with tool handler fetchedAt: new Date().toISOString(), }); console.log(`DEBUG: totalLength (after whitespace removal): ${totalLength}, startIndex: ${options.startIndex}, slicedContent length: ${slicedContent.length}`); if (!options.nextPageSelector || currentPageNum >= options.maxPages) { break; } // Pagination logic const nextHref = await page.evaluate((sel) => { const el = document.querySelector(sel) as HTMLAnchorElement | HTMLElement; if (el) { if ('href' in el && (el as HTMLAnchorElement).href) { return (el as HTMLAnchorElement).href; } el.click(); // Fallback for non-anchor elements } return null; }, options.nextPageSelector); if (nextHref && nextHref !== currentUrl) { currentUrl = nextHref; currentPageNum++; } else { break; // No more pages } } return results.length === 1 ? results[0] : results; } catch (err) { throw err; } finally { // Always close browser to prevent temp file leaks if (browser) { try { await browser.close(); } catch (e) {} } } }
src/index.ts:104-162 (registration)
Tool registration in the ListTools handler: defines name, description, and detailed inputSchema with parameters like url, startIndex (required), maxLength, headers, pagination options.
{ name: 'fetch_webpage', description: 'Retrieve text content from a web page', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'The URL of the webpage to fetch' }, blockResources: { type: 'boolean', description: 'Whether to block images, stylesheets, and fonts to improve performance (default: true)' }, resourceTypesToBlock: { type: 'array', items: { type: 'string' }, description: 'List of resource types to block (e.g., "image", "stylesheet", "font")' }, timeout: { type: 'number', description: 'Navigation timeout in milliseconds (default: 120000)' }, maxLength: { type: 'number', description: 'Maximum number of characters to return (default: 10000).' }, startIndex: { type: 'number', description: 'Start character index for content extraction (required; default: 0).' }, headers: { type: 'object', description: 'Custom headers to include in the request' }, username: { type: 'string', description: 'Username for basic authentication' }, password: { type: 'string', description: 'Password for basic authentication' }, nextPageSelector: { type: 'string', description: 'CSS selector for next page button/link (for auto-pagination, optional)' }, maxPages: { type: 'number', description: 'Maximum number of pages to crawl (for auto-pagination, optional, default: 1)' }, }, required: ['url', 'startIndex'], additionalProperties: false, description: 'Fetch web content (text, html, mainContent, metadata, supports multi-page crawling, and AI-friendly regex extraction). Debug option for verbose output/logging.' }, },
src/index.ts:267-376 (handler)
MCP CallTool handler for fetch_webpage: validates arguments, handles parameter aliases/defaults, calls the core fetchWebpage method, processes results with pagination instructions and remaining characters info, handles errors.
if (toolName === 'fetch_webpage') { // Cast args to a more specific type after validation if (!isValidFetchWebpageArgs(args)) { throw new McpError(ErrorCode.InvalidParams, 'Invalid fetch_webpage arguments'); } // Map legacy/alias params to current names: // - 'index' -> startIndex try { if (typeof (args as any).index === 'number' && typeof (args as any).startIndex !== 'number') { (args as any).startIndex = (args as any).index; } } catch (e) { // ignore mapping errors; validation below will catch invalid shapes } // Enforce required parameters (runtime validation). const hasStartIndex = typeof (args as any).startIndex === 'number'; const hasMaxLength = typeof (args as any).maxLength === 'number'; if (!hasStartIndex) { throw new McpError(ErrorCode.InvalidParams, "fetch_webpage: required parameter 'startIndex' (or alias 'index') is missing or not a number"); } if (!hasMaxLength) { throw new McpError(ErrorCode.InvalidParams, "fetch_webpage: required parameter 'maxLength' is missing or not a number"); } const validatedArgs = args as FetchWebpageArgs & { nextPageSelector?: string; maxPages?: number; }; const { url, blockResources = false, resourceTypesToBlock, timeout: rawTimeout, maxLength = 4000, startIndex = 0, headers, username, password, nextPageSelector, maxPages = 1 } = validatedArgs; const timeout = Math.min(rawTimeout || 60000, 300000); // Helper function to perform the fetch and process the result const performFetchAndProcess = async (currentArgs: typeof validatedArgs) => { const result: any = await this.fetchWebpage(currentArgs.url, { blockResources: currentArgs.blockResources ?? false, // Provide default resourceTypesToBlock: currentArgs.resourceTypesToBlock, timeout: currentArgs.timeout ?? 30000, // Provide default maxLength: currentArgs.maxLength ?? 4000, // Provide default startIndex: currentArgs.startIndex ?? 0, // Provide default headers: currentArgs.headers, username: currentArgs.username, password: currentArgs.password, nextPageSelector: currentArgs.nextPageSelector, maxPages: currentArgs.maxPages ?? 1, // Provide default }); const totalCharacters = result.contentLength; // total characters of the whitespace-removed HTML const slicedContentLength = result.content.length; // length of the sliced part const remainingCharacters = Math.max(0, totalCharacters - (currentArgs.startIndex + slicedContentLength)); const nextStartIndex = currentArgs.startIndex + slicedContentLength; // The next start index should be right after the current sliced content const instruction = remainingCharacters === 0 ? "All content has been fetched." : `To fetch more content, call fetch_webpage again with startIndex=${nextStartIndex}. If you have enough information, you can stop.`; const finalResponse = { url: result.url, title: result.title, content: result.content, // This is the sliced, whitespace-removed HTML fetchedAt: result.fetchedAt, startIndex: currentArgs.startIndex, maxLength: currentArgs.maxLength, remainingCharacters: remainingCharacters, instruction: instruction, }; return { content: [{ type: 'text', text: JSON.stringify(finalResponse, null, 2) }], }; }; try { return await performFetchAndProcess(validatedArgs); } catch (error: any) { // Catch block for fetch_webpage let errorMessage = `Error fetching webpage: ${error.message}`; // Check for timeout error and suggest disabling blockResources // Removed automatic retry logic as blockResources is now always false. // The default behavior is now to not block resources. // If still a timeout, you might consider a generic message or further options. // For now, it will just return the original error message. // Note: The previous logic for suggesting 'blockResources: false' is removed // because blockResources is now forced to false by default for all calls. console.error('Error fetching webpage:', error); return { content: [{ type: 'text', text: `Error fetching webpage: ${errorMessage}` }], isError: true, }; }
src/index.ts:22-35 (schema)
TypeScript interface defining the structure of fetch_webpage tool arguments.
// Define the interface for the fetch_webpage tool arguments interface FetchWebpageArgs { url: string; blockResources?: boolean; resourceTypesToBlock?: string[]; timeout?: number; maxLength?: number; startIndex?: number; nextPageSelector?: string; maxPages?: number; headers?: Record<string, string>; username?: string; password?: string; }
src/index.ts:57-70 (helper)
Type guard function to validate fetch_webpage arguments against the FetchWebpageArgs interface.
// Validate the arguments for fetch_webpage tool const isValidFetchWebpageArgs = (args: any): args is FetchWebpageArgs => typeof args === 'object' && args !== null && typeof args.url === 'string' && (args.blockResources === undefined || typeof args.blockResources === 'boolean') && (args.resourceTypesToBlock === undefined || (Array.isArray(args.resourceTypesToBlock) && args.resourceTypesToBlock.every((item: any) => typeof item === 'string'))) && (args.timeout === undefined || typeof args.timeout === 'number') && (args.maxLength === undefined || typeof args.maxLength === 'number') && (args.startIndex === undefined || typeof args.startIndex === 'number') && (args.headers === undefined || (typeof args.headers === 'object' && args.headers !== null && !Array.isArray(args.headers))) && (args.username === undefined || typeof args.username === 'string') && (args.password === undefined || typeof args.password === 'string');

Web-curl MCP Server

fetch_webpage

Instructions

Input Schema

Implementation Reference

Other Tools

Related Tools

Latest Blog Posts

MCP directory API