Skip to main content
Glama
bakhtiyork

Digest MCP Server

by bakhtiyork

Fetch Web Content

web_content

Extract fully rendered HTML content from dynamic web pages, SPAs, and infinite scroll sites by executing JavaScript and handling AJAX loading.

Instructions

Fetch fully rendered DOM content using browserless.io. Handles AJAX/JavaScript dynamic loading. Optimized for SPAs and infinite scroll pages. Returns the complete rendered HTML after all JavaScript execution, including dynamically loaded content. Each scroll waits for page height changes and network activity to settle.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe URL to fetch
initialWaitTimeNoTime to wait (in milliseconds) after loading the page before scrolling
scrollsNoNumber of times to scroll down the page
scrollWaitTimeNoTime to wait (in milliseconds) between each scroll action
cleanupNoWhether to clean up HTML (remove scripts, styles, SVG, forms, etc.) and keep only meaningful text content

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
sizeYesSize of the content in bytes
contentYesThe fetched HTML content

Implementation Reference

  • index.ts:82-100 (registration)
    Registration of the 'web_content' tool including input and output schemas using Zod and the handler function.
    this.server.registerTool(
      'web_content',
      {
        title: 'Fetch Web Content',
        description: 'Fetch fully rendered DOM content using browserless.io. Handles AJAX/JavaScript dynamic loading. Optimized for SPAs and infinite scroll pages. Returns the complete rendered HTML after all JavaScript execution, including dynamically loaded content. Each scroll waits for page height changes and network activity to settle.',
        inputSchema: {
          url: z.string().describe('The URL to fetch'),
          initialWaitTime: z.number().optional().default(DEFAULTS.INITIAL_WAIT).describe('Time to wait (in milliseconds) after loading the page before scrolling'),
          scrolls: z.number().optional().default(DEFAULTS.SCROLL_COUNT).describe('Number of times to scroll down the page'),
          scrollWaitTime: z.number().optional().default(DEFAULTS.SCROLL_WAIT).describe('Time to wait (in milliseconds) between each scroll action'),
          cleanup: z.boolean().optional().default(DEFAULTS.CLEANUP_HTML).describe('Whether to clean up HTML (remove scripts, styles, SVG, forms, etc.) and keep only meaningful text content'),
        },
        outputSchema: {
          size: z.number().describe('Size of the content in bytes'),
          content: z.string().describe('The fetched HTML content'),
        },
      },
      async (args) => this.handleWebContentRequest(args)
    );
  • The main handler function registered for the 'web_content' tool. It validates input, fetches content via fetchWebContent, computes size, and returns formatted response.
    private async handleWebContentRequest(args: FetchWebContentArgs) {
      if (!args.url) {
        throw new McpError(ErrorCode.InvalidParams, 'URL is required');
      }
    
      try {
        const content = await this.fetchWebContent(args);
        const size = Buffer.byteLength(content, 'utf8');
        
        return {
          content: [{ type: 'text' as const, text: content }],
          structuredContent: {
            size,
            content,
          },
        };
      } catch (error) {
        const errorMessage = this.formatError(error);
        log.error('Tool Error:', errorMessage);
        throw new McpError(ErrorCode.InternalError, `Failed to fetch web content: ${errorMessage}`);
      }
    }
  • TypeScript interface defining the input arguments for the web_content tool handler.
    interface FetchWebContentArgs {
      url: string;
      initialWaitTime?: number;
      scrolls?: number;
      scrollWaitTime?: number;
      cleanup?: boolean;
    }
  • Core implementation of web content fetching: connects to browserless.io, creates page, navigates, scrolls for dynamic content, waits for stability, extracts HTML, optionally cleans up.
    private async fetchWebContent(args: FetchWebContentArgs): Promise<string> {
      const {
        url,
        initialWaitTime = DEFAULTS.INITIAL_WAIT,
        scrolls = DEFAULTS.SCROLL_COUNT,
        scrollWaitTime = DEFAULTS.SCROLL_WAIT,
        cleanup = DEFAULTS.CLEANUP_HTML,
      } = args;
    
      log.info(`Fetching: ${url}, initialWait: ${initialWaitTime}ms, scrolls: ${scrolls}, scrollWait: ${scrollWaitTime}ms, cleanup: ${cleanup}`);
    
      let page: Page | null = null;
    
      try {
        await this.ensureBrowserConnection();
        page = await this.createPage();
        await this.navigateToUrl(page, url);
        await this.waitForInitialLoad(initialWaitTime);
        await this.performScrolling(page, scrolls, scrollWaitTime);
        await this.waitForNetworkAndRendering(page, scrolls, scrollWaitTime);
        
        const rawContent = await this.extractPageContent(page);
        const finalContent = cleanup ? this.cleanupHtml(rawContent) : rawContent;
        await this.closePage(page);
        
        log.info('Content fetched successfully');
        return finalContent;
      } catch (error) {
        await this.closePage(page);
        throw error;
      }
    }
  • Helper function to clean up extracted HTML: removes head, scripts, styles, forms, images, SVGs, unnecessary attributes to leave meaningful content.
    private cleanupHtml(html: string): string {
      log.info('Cleaning up HTML content');
      
      const $ = cheerio.load(html);
    
      // Remove <head> entirely
      $('head').remove();
    
      // Remove comments
      $('*').contents().each((_, elem) => {
        if (elem.type === 'comment') {
          $(elem).remove();
        }
      });
    
      // Remove script tags
      $('script').remove();
    
      // Remove style tags
      $('style').remove();
    
      // Remove noscript tags
      $('noscript').remove();
    
      // Remove frame, iframe, form, img, picture, and source elements
      $('frame').remove();
      $('iframe').remove();
      $('form').remove();
      $('img').remove();
      $('picture').remove();
      $('source').remove();
    
      // Remove inline styles
      $('[style]').removeAttr('style');
    
      // Remove class attributes
      $('[class]').removeAttr('class');
    
      // Remove rel attributes
      $('[rel]').removeAttr('rel');
    
      // Remove tabindex attributes
      $('[tabindex]').removeAttr('tabindex');
    
      // Remove SVG and path elements
      $('svg').remove();
      $('path').remove();
      $('circle').remove();
      $('rect').remove();
      $('polygon').remove();
      $('polyline').remove();
      $('line').remove();
      $('ellipse').remove();
      $('g').remove();
      $('defs').remove();
      $('clipPath').remove();
      $('mask').remove();
    
      // Remove data-*, aria-*, and on* attributes
      $('*').each((_, elem) => {
        const $elem = $(elem);
        if (elem.type === 'tag' && 'attribs' in elem) {
          const attrs = elem.attribs;
          Object.keys(attrs).forEach(attr => {
            if (attr.startsWith('data-') || attr.startsWith('aria-') || attr.startsWith('on')) {
              $elem.removeAttr(attr);
            }
          });
        }
      });
    
      const cleanedHtml = $.html();
      log.info(`Cleaned HTML: ${html.length} -> ${cleanedHtml.length} characters`);
      
      return cleanedHtml;
    }
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It effectively describes key behavioral traits: it handles dynamic loading, waits for JavaScript execution, includes scrolling behavior with wait times, and mentions cleanup options. However, it lacks details on error handling, rate limits, or authentication needs, which would be beneficial for a tool with external dependencies.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is appropriately sized and front-loaded, starting with the core purpose and progressively detailing capabilities and behavior. Each sentence adds value without redundancy, making it efficient and easy to understand quickly.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (external service integration, dynamic content handling) and rich schema (100% coverage, output schema exists), the description is mostly complete. It covers the tool's purpose, key behaviors, and context, but could improve by addressing potential limitations or error scenarios. The existence of an output schema means return values need not be explained.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The schema description coverage is 100%, so the schema already documents all parameters thoroughly. The description adds some context by mentioning scrolling and cleanup in general terms, but it does not provide additional semantic meaning beyond what the schema specifies (e.g., explaining why certain defaults are chosen or how parameters interact).

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('Fetch fully rendered DOM content') and resource ('using browserless.io'), and distinguishes its capabilities from basic web scraping by mentioning AJAX/JavaScript handling, SPAs, infinite scroll pages, and complete rendered HTML after JavaScript execution. It provides a comprehensive overview of what the tool does beyond just fetching content.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage scenarios (e.g., for SPAs, infinite scroll pages, dynamic content) but does not explicitly state when to use this tool versus alternatives or provide any exclusions. With no sibling tools mentioned, the lack of explicit guidance is less critical, but it still relies on implication rather than clear directives.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/bakhtiyork/digest-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server