Skip to main content
Glama

web_content

Extract fully rendered web content from dynamic pages, SPAs, and infinite scroll sites by handling JavaScript execution and capturing dynamically loaded elements with configurable scrolling options.

Instructions

Fetch fully rendered DOM content using browserless.io. Handles AJAX/JavaScript dynamic loading. Optimized for SPAs and infinite scroll pages. Returns the complete rendered HTML after all JavaScript execution, including dynamically loaded content. Each scroll waits for page height changes and network activity to settle.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe URL to fetch
initialWaitTimeNoTime to wait (in milliseconds) after loading the page before scrolling
scrollCountNoNumber of times to scroll down the page
scrollWaitTimeNoTime to wait (in milliseconds) between each scroll action

Implementation Reference

  • index.ts:82-100 (registration)
    Registers the 'web_content' MCP tool with detailed input/output schemas using Zod and points to the handler function.
    this.server.registerTool( 'web_content', { title: 'Fetch Web Content', description: 'Fetch fully rendered DOM content using browserless.io. Handles AJAX/JavaScript dynamic loading. Optimized for SPAs and infinite scroll pages. Returns the complete rendered HTML after all JavaScript execution, including dynamically loaded content. Each scroll waits for page height changes and network activity to settle.', inputSchema: { url: z.string().describe('The URL to fetch'), initialWaitTime: z.number().optional().default(DEFAULTS.INITIAL_WAIT).describe('Time to wait (in milliseconds) after loading the page before scrolling'), scrolls: z.number().optional().default(DEFAULTS.SCROLL_COUNT).describe('Number of times to scroll down the page'), scrollWaitTime: z.number().optional().default(DEFAULTS.SCROLL_WAIT).describe('Time to wait (in milliseconds) between each scroll action'), cleanup: z.boolean().optional().default(DEFAULTS.CLEANUP_HTML).describe('Whether to clean up HTML (remove scripts, styles, SVG, forms, etc.) and keep only meaningful text content'), }, outputSchema: { size: z.number().describe('Size of the content in bytes'), content: z.string().describe('The fetched HTML content'), }, }, async (args) => this.handleWebContentRequest(args) );
  • The direct handler function called by the tool registration, which fetches web content and returns it in MCP format with size.
    private async handleWebContentRequest(args: FetchWebContentArgs) { if (!args.url) { throw new McpError(ErrorCode.InvalidParams, 'URL is required'); } try { const content = await this.fetchWebContent(args); const size = Buffer.byteLength(content, 'utf8'); return { content: [{ type: 'text' as const, text: content }], structuredContent: { size, content, }, }; } catch (error) { const errorMessage = this.formatError(error); log.error('Tool Error:', errorMessage); throw new McpError(ErrorCode.InternalError, `Failed to fetch web content: ${errorMessage}`); } }
  • Core helper function implementing the web content fetching logic: connects to browserless.io, creates page, navigates, scrolls for dynamic content, waits for stability, extracts HTML, optionally cleans it.
    private async fetchWebContent(args: FetchWebContentArgs): Promise<string> { const { url, initialWaitTime = DEFAULTS.INITIAL_WAIT, scrolls = DEFAULTS.SCROLL_COUNT, scrollWaitTime = DEFAULTS.SCROLL_WAIT, cleanup = DEFAULTS.CLEANUP_HTML, } = args; log.info(`Fetching: ${url}, initialWait: ${initialWaitTime}ms, scrolls: ${scrolls}, scrollWait: ${scrollWaitTime}ms, cleanup: ${cleanup}`); let page: Page | null = null; try { await this.ensureBrowserConnection(); page = await this.createPage(); await this.navigateToUrl(page, url); await this.waitForInitialLoad(initialWaitTime); await this.performScrolling(page, scrolls, scrollWaitTime); await this.waitForNetworkAndRendering(page, scrolls, scrollWaitTime); const rawContent = await this.extractPageContent(page); const finalContent = cleanup ? this.cleanupHtml(rawContent) : rawContent; await this.closePage(page); log.info('Content fetched successfully'); return finalContent; } catch (error) { await this.closePage(page); throw error; } }
  • TypeScript interface defining the input parameters for the web_content tool handler.
    interface FetchWebContentArgs { url: string; initialWaitTime?: number; scrolls?: number; scrollWaitTime?: number; cleanup?: boolean; }
  • Helper function to clean up extracted HTML by removing scripts, styles, unnecessary elements and attributes using Cheerio, preserving meaningful content.
    private cleanupHtml(html: string): string { log.info('Cleaning up HTML content'); const $ = cheerio.load(html); // Remove <head> entirely $('head').remove(); // Remove comments $('*').contents().each((_, elem) => { if (elem.type === 'comment') { $(elem).remove(); } }); // Remove script tags $('script').remove(); // Remove style tags $('style').remove(); // Remove noscript tags $('noscript').remove(); // Remove frame, iframe, form, img, picture, and source elements $('frame').remove(); $('iframe').remove(); $('form').remove(); $('img').remove(); $('picture').remove(); $('source').remove(); // Remove inline styles $('[style]').removeAttr('style'); // Remove class attributes $('[class]').removeAttr('class'); // Remove rel attributes $('[rel]').removeAttr('rel'); // Remove tabindex attributes $('[tabindex]').removeAttr('tabindex'); // Remove SVG and path elements $('svg').remove(); $('path').remove(); $('circle').remove(); $('rect').remove(); $('polygon').remove(); $('polyline').remove(); $('line').remove(); $('ellipse').remove(); $('g').remove(); $('defs').remove(); $('clipPath').remove(); $('mask').remove(); // Remove data-*, aria-*, and on* attributes $('*').each((_, elem) => { const $elem = $(elem); if (elem.type === 'tag' && 'attribs' in elem) { const attrs = elem.attribs; Object.keys(attrs).forEach(attr => { if (attr.startsWith('data-') || attr.startsWith('aria-') || attr.startsWith('on')) { $elem.removeAttr(attr); } }); } }); const cleanedHtml = $.html(); log.info(`Cleaned HTML: ${html.length} -> ${cleanedHtml.length} characters`); return cleanedHtml; }

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/bakhtiyork/digest-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server