Skip to main content
Glama

scrapeBalanced

Extract web content efficiently with a balanced approach, including images and paginated data, while controlling parameters like scroll attempts, timeouts, and image size for precise results.

Instructions

Balanced web scraping approach with good coverage and reasonable speed

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
downloadImagesNoWhether to download images locally
maxImagesNoMaximum number of images to extract
maxScrollsNoMaximum number of scroll attempts (default: 10)
minImageSizeNoMinimum width/height for images in pixels
outputNoOutput directory for downloaded images
pagesNoNumber of pages to scrape (if pagination is present)
scrapeImagesNoWhether to include images in the scrape result
scrollDelayNoDelay between scrolls in ms (default: 2000)
timeoutNoMaximum time in ms for the scrape operation (default: 30000)
urlYesURL of the webpage to scrape

Implementation Reference

  • The handler function executes balanced web scraping using the prysm library. It configures scraping options, implements timeout handling via Promise.race, limits content and images to prevent overwhelming the client, and handles errors by returning a structured error response.
    handler: async (params: ScraperBaseParams & { timeout?: number }): Promise<ScraperResponse> => {
      const { url, maxScrolls = 10, scrollDelay = 2000, pages = 1, scrapeImages = false, 
              downloadImages = false, maxImages = 50, minImageSize = 100, timeout = 30000, 
              output, imageOutput } = params;
      
      try {
        // Create options object for the scraper
        const options = {
          maxScrolls,
          scrollDelay,
          pages,
          focused: false,
          standard: true, // Use standard mode for balanced extraction
          deep: false,
          scrapeImages: scrapeImages || downloadImages,
          downloadImages,
          maxImages,
          minImageSize,
          timeout, // Add timeout option
          output: output || config.serverOptions.defaultOutputDir, // Use configured default if not provided
          imageOutput: imageOutput || config.serverOptions.defaultImageOutputDir // Use configured default if not provided
        };
        
        // Create a promise with timeout
        const scrapePromise = prysm.scrape(url, options);
        
        // Add timeout
        const timeoutPromise = new Promise<never>((_, reject) => {
          setTimeout(() => reject(new Error(`Scraping timed out after ${timeout}ms`)), timeout);
        });
        
        // Race the scraping against the timeout
        const result = await Promise.race([scrapePromise, timeoutPromise]) as ScraperResponse;
        
        // Limit content size to prevent overwhelming the MCP client
        if (result.content && result.content.length > 0) {
          // Limit the number of content sections
          if (result.content.length > 20) {
            result.content = result.content.slice(0, 20);
            result.content.push("(Content truncated due to size limitations)");
          }
          
          // Limit the size of each content section
          result.content = result.content.map(section => {
            if (section.length > 5000) {
              return section.substring(0, 5000) + "... (truncated)";
            }
            return section;
          });
        }
        
        // Limit the number of images to return
        if (result.images && result.images.length > 20) {
          result.images = result.images.slice(0, 20);
        }
        
        return result;
      } catch (error) {
        console.error(`Error scraping ${url}:`, error);
        // Return a proper error format for MCP
        return {
          title: "Scraping Error",
          content: [`Failed to scrape ${url}: ${error instanceof Error ? error.message : String(error)}`],
          images: [],
          metadata: { error: true },
          url: url,
          structureType: "error",
          paginationType: "none",
          extractionMethod: "none"
        };
      }
    }
  • JSON Schema defining the input parameters for the scrapeBalanced tool, including required 'url' and optional parameters for scrolling, pagination, images, timeouts, and outputs.
    parameters: {
      type: 'object',
      properties: {
        url: {
          type: 'string',
          description: 'URL of the webpage to scrape'
        },
        maxScrolls: {
          type: 'number',
          description: 'Maximum number of scroll attempts (default: 10)'
        },
        scrollDelay: {
          type: 'number',
          description: 'Delay between scrolls in ms (default: 2000)'
        },
        pages: {
          type: 'number',
          description: 'Number of pages to scrape (if pagination is present)'
        },
        scrapeImages: {
          type: 'boolean',
          description: 'Whether to include images in the scrape result'
        },
        downloadImages: {
          type: 'boolean',
          description: 'Whether to download images locally'
        },
        maxImages: {
          type: 'number',
          description: 'Maximum number of images to extract'
        },
        minImageSize: {
          type: 'number',
          description: 'Minimum width/height for images in pixels'
        },
        timeout: {
          type: 'number',
          description: 'Maximum time in ms for the scrape operation (default: 30000)'
        },
        output: {
          type: 'string',
          description: 'Output directory for general results'
        },
        imageOutput: {
          type: 'string',
          description: 'Output directory for downloaded images'
        }
      },
      required: ['url']
    },
  • src/config.ts:65-71 (registration)
    Registration of the scrapeBalanced tool in the main server configuration array.
    tools: [
      scrapeFocused,
      scrapeBalanced, 
      scrapeDeep,
      // analyzeUrl,
      formatResult
    ],
  • Export of toolDefinitions array including scrapeBalanced for use in the MCP server.
    export const toolDefinitions: ToolDefinition[] = [
      scrapeFocused,
      scrapeBalanced,
      scrapeDeep,
      // analyzeUrl,
      formatResult,
    ]; 
  • Definition and export of the scrapeBalanced ToolDefinition object.
    export const scrapeBalanced: ToolDefinition = {
      name: 'scrapeBalanced',
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure but provides minimal information. It mentions 'good coverage and reasonable speed' which hints at performance characteristics, but doesn't disclose important behavioral traits like whether it respects robots.txt, what authentication might be needed, rate limiting considerations, error handling, or what the output format looks like. For a scraping tool with 10 parameters, this is inadequate behavioral transparency.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is appropriately concise - a single sentence that gets straight to the point without unnecessary words. However, while it's structurally efficient, it's under-specified rather than truly concise. Every word earns its place, but there aren't enough words to be truly helpful. The front-loading is good but the content is insufficient.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a complex scraping tool with 10 parameters, no annotations, and no output schema, the description is incomplete. It doesn't explain what 'balanced' means operationally, what gets returned (structured data? HTML? images?), error conditions, or performance guarantees. The context signals indicate significant complexity that the description fails to address adequately.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The description adds no parameter-specific information beyond what's already in the schema (which has 100% coverage). While the schema thoroughly documents all 10 parameters with clear descriptions, the tool description doesn't provide additional context about how parameters interact (e.g., relationship between downloadImages and scrapeImages) or usage patterns. With high schema coverage, the baseline is 3, but the description doesn't enhance parameter understanding.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose2/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description 'Balanced web scraping approach with good coverage and reasonable speed' is vague and tautological - it restates the tool name 'scrapeBalanced' without specifying what it actually does. It doesn't clearly state what resource it operates on (web pages) or what specific scraping approach it implements. Compared to siblings like 'scrapeDeep' and 'scrapeFocused', it fails to distinguish itself meaningfully.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. With sibling tools like 'scrapeDeep' and 'scrapeFocused' available, there's no indication of what 'balanced' means in comparison - whether it's a middle ground between depth and speed, or some other trade-off. No explicit when/when-not instructions or alternative recommendations are provided.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/pinkpixel-dev/prysm-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server