web_content
Extract fully rendered HTML content from dynamic web pages, SPAs, and infinite scroll sites by executing JavaScript and handling AJAX loading.
Instructions
Fetch fully rendered DOM content using browserless.io. Handles AJAX/JavaScript dynamic loading. Optimized for SPAs and infinite scroll pages. Returns the complete rendered HTML after all JavaScript execution, including dynamically loaded content. Each scroll waits for page height changes and network activity to settle.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | The URL to fetch | |
| initialWaitTime | No | Time to wait (in milliseconds) after loading the page before scrolling | |
| scrolls | No | Number of times to scroll down the page | |
| scrollWaitTime | No | Time to wait (in milliseconds) between each scroll action | |
| cleanup | No | Whether to clean up HTML (remove scripts, styles, SVG, forms, etc.) and keep only meaningful text content |
Implementation Reference
- index.ts:82-100 (registration)Registration of the 'web_content' tool including input and output schemas using Zod and the handler function.this.server.registerTool( 'web_content', { title: 'Fetch Web Content', description: 'Fetch fully rendered DOM content using browserless.io. Handles AJAX/JavaScript dynamic loading. Optimized for SPAs and infinite scroll pages. Returns the complete rendered HTML after all JavaScript execution, including dynamically loaded content. Each scroll waits for page height changes and network activity to settle.', inputSchema: { url: z.string().describe('The URL to fetch'), initialWaitTime: z.number().optional().default(DEFAULTS.INITIAL_WAIT).describe('Time to wait (in milliseconds) after loading the page before scrolling'), scrolls: z.number().optional().default(DEFAULTS.SCROLL_COUNT).describe('Number of times to scroll down the page'), scrollWaitTime: z.number().optional().default(DEFAULTS.SCROLL_WAIT).describe('Time to wait (in milliseconds) between each scroll action'), cleanup: z.boolean().optional().default(DEFAULTS.CLEANUP_HTML).describe('Whether to clean up HTML (remove scripts, styles, SVG, forms, etc.) and keep only meaningful text content'), }, outputSchema: { size: z.number().describe('Size of the content in bytes'), content: z.string().describe('The fetched HTML content'), }, }, async (args) => this.handleWebContentRequest(args) );
- index.ts:103-124 (handler)The main handler function registered for the 'web_content' tool. It validates input, fetches content via fetchWebContent, computes size, and returns formatted response.private async handleWebContentRequest(args: FetchWebContentArgs) { if (!args.url) { throw new McpError(ErrorCode.InvalidParams, 'URL is required'); } try { const content = await this.fetchWebContent(args); const size = Buffer.byteLength(content, 'utf8'); return { content: [{ type: 'text' as const, text: content }], structuredContent: { size, content, }, }; } catch (error) { const errorMessage = this.formatError(error); log.error('Tool Error:', errorMessage); throw new McpError(ErrorCode.InternalError, `Failed to fetch web content: ${errorMessage}`); } }
- index.ts:48-54 (schema)TypeScript interface defining the input arguments for the web_content tool handler.interface FetchWebContentArgs { url: string; initialWaitTime?: number; scrolls?: number; scrollWaitTime?: number; cleanup?: boolean; }
- index.ts:136-167 (handler)Core implementation of web content fetching: connects to browserless.io, creates page, navigates, scrolls for dynamic content, waits for stability, extracts HTML, optionally cleans up.private async fetchWebContent(args: FetchWebContentArgs): Promise<string> { const { url, initialWaitTime = DEFAULTS.INITIAL_WAIT, scrolls = DEFAULTS.SCROLL_COUNT, scrollWaitTime = DEFAULTS.SCROLL_WAIT, cleanup = DEFAULTS.CLEANUP_HTML, } = args; log.info(`Fetching: ${url}, initialWait: ${initialWaitTime}ms, scrolls: ${scrolls}, scrollWait: ${scrollWaitTime}ms, cleanup: ${cleanup}`); let page: Page | null = null; try { await this.ensureBrowserConnection(); page = await this.createPage(); await this.navigateToUrl(page, url); await this.waitForInitialLoad(initialWaitTime); await this.performScrolling(page, scrolls, scrollWaitTime); await this.waitForNetworkAndRendering(page, scrolls, scrollWaitTime); const rawContent = await this.extractPageContent(page); const finalContent = cleanup ? this.cleanupHtml(rawContent) : rawContent; await this.closePage(page); log.info('Content fetched successfully'); return finalContent; } catch (error) { await this.closePage(page); throw error; } }
- index.ts:386-461 (helper)Helper function to clean up extracted HTML: removes head, scripts, styles, forms, images, SVGs, unnecessary attributes to leave meaningful content.private cleanupHtml(html: string): string { log.info('Cleaning up HTML content'); const $ = cheerio.load(html); // Remove <head> entirely $('head').remove(); // Remove comments $('*').contents().each((_, elem) => { if (elem.type === 'comment') { $(elem).remove(); } }); // Remove script tags $('script').remove(); // Remove style tags $('style').remove(); // Remove noscript tags $('noscript').remove(); // Remove frame, iframe, form, img, picture, and source elements $('frame').remove(); $('iframe').remove(); $('form').remove(); $('img').remove(); $('picture').remove(); $('source').remove(); // Remove inline styles $('[style]').removeAttr('style'); // Remove class attributes $('[class]').removeAttr('class'); // Remove rel attributes $('[rel]').removeAttr('rel'); // Remove tabindex attributes $('[tabindex]').removeAttr('tabindex'); // Remove SVG and path elements $('svg').remove(); $('path').remove(); $('circle').remove(); $('rect').remove(); $('polygon').remove(); $('polyline').remove(); $('line').remove(); $('ellipse').remove(); $('g').remove(); $('defs').remove(); $('clipPath').remove(); $('mask').remove(); // Remove data-*, aria-*, and on* attributes $('*').each((_, elem) => { const $elem = $(elem); if (elem.type === 'tag' && 'attribs' in elem) { const attrs = elem.attribs; Object.keys(attrs).forEach(attr => { if (attr.startsWith('data-') || attr.startsWith('aria-') || attr.startsWith('on')) { $elem.removeAttr(attr); } }); } }); const cleanedHtml = $.html(); log.info(`Cleaned HTML: ${html.length} -> ${cleanedHtml.length} characters`); return cleanedHtml; }