Skip to main content
Glama

get_webpage_content

Extract webpage content and convert it to Markdown, HTML, or plain text format for analysis and processing.

Instructions

Fetch webpage content and convert to specified format. Supports Markdown, HTML, and plain text.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe URL of the webpage to scrape. Must be a valid HTTP/HTTPS link.
formatNoOutput format: markdown (default), html, textmarkdown

Implementation Reference

  • Main handler function executing the tool logic: validates input, fetches content via service, formats response.
    async function handleGetWebpageContent(args) {
      const { url, format = 'markdown' } = args;
      
      if (!url || typeof url !== 'string') {
        throw new Error('URL parameter is required and must be a string');
      }
    
      try {
        new URL(url);
      } catch (error) {
        throw new Error('Invalid URL format');
      }
    
      if (!['markdown', 'html', 'text'].includes(format)) {
        throw new Error('format must be one of: markdown, html, text');
      }
    
      const searchService = (await import('../services/searchService.js')).default;
      
      let result;
      if (format === 'markdown') {
        result = await searchService.getWebpageMarkdown(url);
      } else {
        result = await searchService.scrapeWebpage(url);
      }
    
      return {
        tool: 'get_webpage_content',
        url,
        format,
        title: result.title,
        description: result.description,
        content: format === 'markdown' ? result.markdown : result.content,
        timestamp: result.timestamp
      };
    }
  • Input schema and metadata definition for the get_webpage_content tool, used for registration and validation.
    {
      name: 'get_webpage_content',
      description: 'Fetch webpage content and convert to specified format. Supports Markdown, HTML, and plain text.',
      inputSchema: {
        type: 'object',
        properties: {
          url: {
            type: 'string',
            description: 'The URL of the webpage to scrape. Must be a valid HTTP/HTTPS link.'
          },
          format: {
            type: 'string',
            enum: ['markdown', 'html', 'text'],
            description: 'Output format: markdown (default), html, text',
            default: 'markdown'
          }
        },
        required: ['url']
      }
    },
  • Tool dispatch registration in the CallToolRequestHandler switch statement.
    case 'get_webpage_content':
      result = await handleGetWebpageContent(args);
      break;
  • Core helper for fetching and converting webpage to Markdown format (primary format). Uses axios, cheerio, turndown.
    async getWebpageMarkdown(url) {
      try {
        const response = await axios.get(url, {
          headers: {
            'User-Agent': this.getRandomUserAgent(),
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive'
          },
          timeout: 15000
        });
    
        const $ = cheerio.load(response.data);
    
        // Extract page info
        const title = $('title').text().trim();
        const description = $('meta[name="description"]').attr('content') || '';
    
        // Clean HTML, remove unwanted elements
        $('script, style, noscript, iframe, img').remove();
        $('nav, header, footer, aside').remove();
    
        // Get main content area
        let mainContent = $('main, article, .content, .main, #content, #main');
        if (mainContent.length === 0) {
          mainContent = $('body');
        }
    
        // Convert to Markdown
        const TurndownService = (await import('turndown')).default;
        const turndownService = new TurndownService({
          headingStyle: 'atx',
          codeBlockStyle: 'fenced',
          emDelimiter: '*',
          bulletListMarker: '-'
        });
    
        // Custom rule for links
        turndownService.addRule('links', {
          filter: 'a',
          replacement: function(content, node) {
            const href = node.getAttribute('href');
            const text = content.trim();
            if (href && text) {
              return `[${text}](${href})`;
            }
            return content;
          }
        });
    
        const markdown = turndownService.turndown(mainContent.html());
    
        logger.info(`Webpage converted to Markdown successfully: ${url}`);
    
        return {
          url,
          title,
          description,
          markdown,
          // htmlSource: response.data,
          timestamp: new Date().toISOString()
        };
    
      } catch (error) {
        logger.error(`Markdown conversion error for ${url}:`, error);
        throw new Error(`Failed to convert webpage to Markdown: ${error.message}`);
      }
    }
  • Core helper for basic webpage scraping (used for html/text formats). Extracts title, desc, content, links using axios/cheerio.
    async scrapeWebpage(url) {
      try {
        const response = await axios.get(url, {
          headers: {
            'User-Agent': this.getRandomUserAgent(),
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive'
          },
          timeout: 15000
        });
    
        const $ = cheerio.load(response.data);
    
        // Extract page info
        const title = $('title').text().trim();
        const description = $('meta[name="description"]').attr('content') || '';
        const keywords = $('meta[name="keywords"]').attr('content') || '';
    
        // Extract main content
        const content = $('body').text()
          .replace(/\s+/g, ' ')
          .trim()
          .substring(0, 2000); // limit content length
    
        // Extract links
        const links = [];
        $('a[href]').each((index, element) => {
          if (index < 50) { // limit number of links
            const href = $(element).attr('href');
            const text = $(element).text().trim();
            if (href && text && href.startsWith('http')) {
              links.push({ url: href, text });
            }
          }
        });
    
        logger.info(`Webpage scraped successfully: ${url}`);
    
        return {
          url,
          title,
          description,
          keywords,
          content,
          links,
          timestamp: new Date().toISOString()
        };
    
      } catch (error) {
        logger.error(`Webpage scraping error for ${url}:`, error);
        throw new Error(`Failed to scrape webpage: ${error.message}`);
      }
    }
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It mentions fetching and converting content but lacks critical details such as authentication requirements, rate limits, error handling, or whether it performs web scraping with potential restrictions. This leaves significant gaps in understanding the tool's behavior.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is extremely concise with two sentences that directly state the tool's function and supported formats, with no wasted words. It's front-loaded and efficiently communicates the core purpose without unnecessary elaboration.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the lack of annotations and output schema, the description is incomplete for a web scraping tool. It doesn't address behavioral aspects like permissions, limitations, or response structure, which are crucial for proper usage. The high schema coverage doesn't compensate for these missing contextual elements.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents both parameters thoroughly. The description adds minimal value by listing the supported formats but doesn't provide additional semantic context beyond what's in the schema, such as examples or edge cases for the URL parameter.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('fetch', 'convert') and resources ('webpage content'), and identifies the supported output formats. However, it doesn't explicitly differentiate from sibling tools like 'get_webpage_source' or 'web_search', which prevents a perfect score.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives like 'batch_webpage_scrape' or 'get_webpage_source'. It mentions supported formats but doesn't explain scenarios where one format might be preferred over another or when to choose this tool over siblings.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/yc9yc/spider-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server