Skip to main content
Glama
code-alchemist01

Development Tools MCP Server

scrape_dynamic_content

Extract JavaScript-rendered content from web pages by simulating browser interaction, enabling developers to access dynamically loaded data for analysis or integration.

Instructions

Scrape JavaScript-rendered content using browser

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesURL to scrape
waitForSelectorNoCSS selector to wait for
waitForTimeoutNoTimeout in milliseconds
timeoutNoPage load timeout

Implementation Reference

  • Core implementation of the scrape_dynamic_content tool handler using Playwright to scrape dynamic JavaScript-rendered web content, extracting title, text, HTML, links, images, and tables.
    async scrapeDynamicContent(config: ScrapingConfig): Promise<ScrapedData> {
      const validation = Validators.validateScrapingConfig(config);
      if (!validation.valid) {
        throw new Error(`Invalid scraping config: ${validation.errors.join(', ')}`);
      }
    
      const browser = await this.getBrowser();
      const page = await browser.newPage();
    
      try {
        // Set headers if provided
        if (config.headers) {
          await page.setExtraHTTPHeaders(config.headers);
        }
    
        // Navigate to URL
        await page.goto(config.url, {
          waitUntil: 'networkidle',
          timeout: config.timeout || 30000,
        });
    
        // Wait for selector if specified
        if (config.waitForSelector) {
          await page.waitForSelector(config.waitForSelector, {
            timeout: config.waitForTimeout || 10000,
          });
        }
    
        // Wait for additional time if specified
        if (config.waitFor) {
          await page.waitForTimeout(parseInt(config.waitFor) || 1000);
        }
    
        // Extract content
        const title = await page.title();
        const text = await page.evaluate(() => {
          return document.body.innerText.replace(/\s+/g, ' ').trim();
        });
        const html = await page.content();
    
        // Extract links
        const links = await page.evaluate(() => {
          const linkElements = Array.from(document.querySelectorAll('a[href]'));
          return linkElements
            .map((el) => {
              try {
                return new URL((el as HTMLAnchorElement).href).href;
              } catch {
                return null;
              }
            })
            .filter((url): url is string => url !== null);
        });
    
        // Extract images
        const images = await page.evaluate(() => {
          const imgElements = Array.from(document.querySelectorAll('img[src]'));
          return imgElements
            .map((el) => {
              try {
                return new URL((el as HTMLImageElement).src).href;
              } catch {
                return null;
              }
            })
            .filter((url): url is string => url !== null);
        });
    
        // Extract tables
        const tables = await page.evaluate((): TableData[] => {
          const tableElements = Array.from(document.querySelectorAll('table'));
          return tableElements.map((table: Element) => {
            const tableData: TableData = {
              headers: [],
              rows: [],
            };
    
            // Extract caption
            const caption = table.querySelector('caption');
            if (caption) {
              tableData.caption = caption.textContent?.trim() || '';
            }
    
            // Extract headers
            const headerCells = table.querySelectorAll('thead th, thead td, tr:first-child th, tr:first-child td');
            headerCells.forEach((cell: Element) => {
              tableData.headers.push(cell.textContent?.trim() || '');
            });
    
            // Extract rows
            const rows = table.querySelectorAll('tbody tr, tr');
            rows.forEach((row: Element, index: number) => {
              // Skip first row if it's used as headers
              if (index === 0 && tableData.headers.length > 0) {
                return;
              }
              const rowData: string[] = [];
              row.querySelectorAll('td, th').forEach((cell: Element) => {
                rowData.push(cell.textContent?.trim() || '');
              });
              if (rowData.length > 0) {
                tableData.rows.push(rowData);
              }
            });
    
            return tableData;
          });
        });
    
        return {
          url: config.url,
          title,
          text,
          html,
          links: [...new Set(links)],
          images: [...new Set(images)],
          tables,
          scrapedAt: new Date(),
        };
      } finally {
        await page.close();
      }
    }
  • Registration of the 'scrape_dynamic_content' tool within the webScrapingTools array, including its name, description, and input schema.
      name: 'scrape_dynamic_content',
      description: 'Scrape JavaScript-rendered content using browser',
      inputSchema: {
        type: 'object',
        properties: {
          url: {
            type: 'string',
            description: 'URL to scrape',
          },
          waitForSelector: {
            type: 'string',
            description: 'CSS selector to wait for',
          },
          waitForTimeout: {
            type: 'number',
            description: 'Timeout in milliseconds',
          },
          timeout: {
            type: 'number',
            description: 'Page load timeout',
            default: 30000,
          },
        },
        required: ['url'],
      },
    },
  • Input schema for the scrape_dynamic_content tool defining the expected parameters: url (required), waitForSelector, waitForTimeout, and timeout.
    inputSchema: {
      type: 'object',
      properties: {
        url: {
          type: 'string',
          description: 'URL to scrape',
        },
        waitForSelector: {
          type: 'string',
          description: 'CSS selector to wait for',
        },
        waitForTimeout: {
          type: 'number',
          description: 'Timeout in milliseconds',
        },
        timeout: {
          type: 'number',
          description: 'Page load timeout',
          default: 30000,
        },
      },
      required: ['url'],
    },
  • Dispatch handler in handleWebScrapingTool function that calls the DynamicScraper instance's scrapeDynamicContent method and formats the output.
    case 'scrape_dynamic_content': {
      const data = await dynamicScraper.scrapeDynamicContent(config);
      return Formatters.formatScrapedData(data);
    }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/code-alchemist01/development-tools-mcp-Server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server