Skip to main content
Glama

scrape

Read-only

Extract webpage content for analysis by fetching clean markdown, HTML, or structured JSON. Handles JavaScript-rendered pages and bypasses anti-bot protection.

Instructions

Scrape any webpage and return its content using ZenRows.

Use this tool to fetch webpage content for analysis. By default it returns clean markdown, which is ideal for LLM processing.

When to enable options:

  • js_render: page uses React/Vue/Angular, loads content dynamically, or content appears missing on the first attempt

  • premium_proxy: site returns 403/blocked errors even with js_render enabled

  • wait_for: specific content loads after initial render (requires js_render)

  • css_extractor: you only need specific elements, not the whole page

  • autoparse: structured data pages like products or articles

Examples: Basic: { url: "https://example.com" } Dynamic: { url: "https://spa.com", js_render: true } Protected:{ url: "https://protected.com", js_render: true, premium_proxy: true } Extract: { url: "https://shop.com", css_extractor: '{"title":"h1","price":".price"}' }

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe webpage URL to scrape
js_renderNoEnable JavaScript rendering via headless browser. Required for SPAs (React, Vue, Angular) and pages that load content dynamically.
premium_proxyNoUse premium residential proxies to bypass anti-bot protection. Required for heavily protected sites. Implies higher credit cost.
proxy_countryNoCountry for geo-targeted scraping. ISO 3166-1 alpha-2 code (e.g. 'US', 'GB', 'DE'). Requires premium_proxy=true.
response_typeNoOutput format. 'markdown' (default) preserves structure and is ideal for LLMs. 'plaintext' strips all formatting for pure text extraction. 'pdf' returns a PDF of the page. 'html' returns the raw HTML source (omits the response_type param; ZenRows default). Ignored when autoparse, css_extractor, outputs, or screenshot params are set.markdown
autoparseNoAutomatically extract structured data from the page into JSON. Best for product pages, articles, and listings.
css_extractorNoExtract specific elements using CSS selectors. JSON object mapping names to selectors, e.g. '{"title":"h1","price":".price-tag"}'. Returns JSON instead of full page content.
wait_forNoCSS selector to wait for before capturing. Use when key content loads after the initial page render. Requires js_render=true.
waitNoMilliseconds to wait after page load before capturing content. Max 30000 (30s). Requires js_render=true.
js_instructionsNoJSON array of browser interactions to run before scraping. Requires js_render=true. Example: [{"click":"#load-more"},{"wait":1000},{"wait_for":".results"}]
outputsNoComma-separated list of data types to extract as structured JSON. Available: emails, headings, links, menus, images, videos, audios. Use '*' for all types. Returns JSON instead of full page content.
screenshotNoCapture an above-the-fold screenshot of the page. Returns an image instead of text content. Useful for visual verification or debugging.
screenshot_fullpageNoCapture a full-page screenshot including content below the fold. Returns an image instead of text content.
screenshot_selectorNoCapture a screenshot of a specific element using a CSS selector. Example: ".product-card". Returns an image instead of text content.

Implementation Reference

  • The implementation of the 'scrape' tool handler, which constructs API parameters, performs a fetch request to the ZenRows API, and processes the response (text or image).
    async (params) => {
      const searchParams = new URLSearchParams({
        apikey: apiKey,
        url: params.url,
      });
    
      if (
        params.js_render ||
        params.screenshot ||
        params.screenshot_fullpage ||
        params.screenshot_selector
      )
        searchParams.set("js_render", "true");
      if (params.premium_proxy) searchParams.set("premium_proxy", "true");
      if (params.proxy_country)
        searchParams.set("proxy_country", params.proxy_country.toUpperCase());
      if (params.autoparse) searchParams.set("autoparse", "true");
      if (params.css_extractor) searchParams.set("css_extractor", params.css_extractor);
      if (params.wait_for) searchParams.set("wait_for", params.wait_for);
      if (params.wait != null) searchParams.set("wait", String(params.wait));
      if (params.js_instructions) searchParams.set("js_instructions", params.js_instructions);
      if (params.outputs) searchParams.set("outputs", params.outputs);
      if (params.screenshot || params.screenshot_fullpage || params.screenshot_selector)
        searchParams.set("screenshot", "true");
      if (params.screenshot_fullpage) searchParams.set("screenshot_fullpage", "true");
      if (params.screenshot_selector)
        searchParams.set("screenshot_selector", params.screenshot_selector);
    
      // response_type is mutually exclusive with autoparse, css_extractor, outputs, and screenshot params.
      // 'html' is the ZenRows default (no param); all other values are passed through.
      const isScreenshot =
        params.screenshot || params.screenshot_fullpage || params.screenshot_selector;
      const effectiveType = params.response_type ?? "markdown";
      if (
        !params.autoparse &&
        !params.css_extractor &&
        !params.outputs &&
        !isScreenshot &&
        effectiveType !== "html"
      ) {
        searchParams.set("response_type", effectiveType);
      }
    
      let response: Response;
      try {
        response = await fetch(`${ZENROWS_API_URL}?${searchParams}`, {
          headers: { "User-Agent": `zenrows/mcp ${pkg.version}` },
        });
      } catch (err) {
        return {
          content: [
            {
              type: "text" as const,
              text: `Network error contacting ZenRows: ${err instanceof Error ? err.message : String(err)}`,
            },
          ],
          isError: true,
        };
      }
    
      if (!response.ok) {
        const body = await response.text();
        return {
          content: [{ type: "text" as const, text: `ZenRows error ${response.status}: ${body}` }],
          isError: true,
        };
      }
    
      const contentType = response.headers.get("content-type") ?? "";
      const buffer = await response.arrayBuffer();
      const bytes = new Uint8Array(buffer);
      const isPng =
        bytes[0] === 0x89 && bytes[1] === 0x50 && bytes[2] === 0x4e && bytes[3] === 0x47;
      const isJpeg = bytes[0] === 0xff && bytes[1] === 0xd8;
      if (contentType.startsWith("image/") || isPng || isJpeg) {
        const mimeType = isPng
          ? "image/png"
          : isJpeg
            ? "image/jpeg"
            : (contentType.split(";")[0].trim() as "image/png" | "image/jpeg");
        const base64 = Buffer.from(buffer).toString("base64");
        return {
          content: [{ type: "image" as const, data: base64, mimeType }],
        };
      }
    
      return {
        content: [{ type: "text" as const, text: new TextDecoder().decode(buffer) }],
      };
    }
  • src/server.ts:18-162 (registration)
    Registration of the 'scrape' tool with its input schema and configuration.
      server.registerTool(
        "scrape",
        {
          annotations: {
            title: "Scrape Webpage",
            readOnlyHint: true,
            destructiveHint: false,
          },
          description: `Scrape any webpage and return its content using ZenRows.
    
    Use this tool to fetch webpage content for analysis. By default it returns clean
    markdown, which is ideal for LLM processing.
    
    When to enable options:
    - js_render: page uses React/Vue/Angular, loads content dynamically, or content
      appears missing on the first attempt
    - premium_proxy: site returns 403/blocked errors even with js_render enabled
    - wait_for: specific content loads after initial render (requires js_render)
    - css_extractor: you only need specific elements, not the whole page
    - autoparse: structured data pages like products or articles
    
    Examples:
      Basic:    { url: "https://example.com" }
      Dynamic:  { url: "https://spa.com", js_render: true }
      Protected:{ url: "https://protected.com", js_render: true, premium_proxy: true }
      Extract:  { url: "https://shop.com", css_extractor: '{"title":"h1","price":".price"}' }`,
          inputSchema: {
            url: z.string().url().describe("The webpage URL to scrape"),
    
            js_render: z
              .boolean()
              .optional()
              .default(false)
              .describe(
                "Enable JavaScript rendering via headless browser. Required for SPAs " +
                  "(React, Vue, Angular) and pages that load content dynamically."
              ),
    
            premium_proxy: z
              .boolean()
              .optional()
              .default(false)
              .describe(
                "Use premium residential proxies to bypass anti-bot protection. " +
                  "Required for heavily protected sites. Implies higher credit cost."
              ),
    
            proxy_country: z
              .string()
              .optional()
              .describe(
                "Country for geo-targeted scraping. ISO 3166-1 alpha-2 code (e.g. 'US', 'GB', 'DE'). " +
                  "Requires premium_proxy=true."
              ),
    
            response_type: z
              .enum(["markdown", "plaintext", "pdf", "html"])
              .optional()
              .default("markdown")
              .describe(
                "Output format. 'markdown' (default) preserves structure and is ideal for LLMs. " +
                  "'plaintext' strips all formatting for pure text extraction. " +
                  "'pdf' returns a PDF of the page. " +
                  "'html' returns the raw HTML source (omits the response_type param; ZenRows default). " +
                  "Ignored when autoparse, css_extractor, outputs, or screenshot params are set."
              ),
    
            autoparse: z
              .boolean()
              .optional()
              .describe(
                "Automatically extract structured data from the page into JSON. " +
                  "Best for product pages, articles, and listings."
              ),
    
            css_extractor: z
              .string()
              .optional()
              .describe(
                "Extract specific elements using CSS selectors. " +
                  'JSON object mapping names to selectors, e.g. \'{"title":"h1","price":".price-tag"}\'. ' +
                  "Returns JSON instead of full page content."
              ),
    
            wait_for: z
              .string()
              .optional()
              .describe(
                "CSS selector to wait for before capturing. Use when key content loads " +
                  "after the initial page render. Requires js_render=true."
              ),
    
            wait: z
              .number()
              .int()
              .min(0)
              .max(30000)
              .optional()
              .describe(
                "Milliseconds to wait after page load before capturing content. " +
                  "Max 30000 (30s). Requires js_render=true."
              ),
    
            js_instructions: z
              .string()
              .optional()
              .describe(
                "JSON array of browser interactions to run before scraping. Requires js_render=true. " +
                  'Example: [{"click":"#load-more"},{"wait":1000},{"wait_for":".results"}]'
              ),
    
            outputs: z
              .string()
              .optional()
              .describe(
                "Comma-separated list of data types to extract as structured JSON. " +
                  "Available: emails, headings, links, menus, images, videos, audios. " +
                  "Use '*' for all types. Returns JSON instead of full page content."
              ),
    
            screenshot: z
              .boolean()
              .optional()
              .describe(
                "Capture an above-the-fold screenshot of the page. " +
                  "Returns an image instead of text content. Useful for visual verification or debugging."
              ),
    
            screenshot_fullpage: z
              .boolean()
              .optional()
              .describe(
                "Capture a full-page screenshot including content below the fold. " +
                  "Returns an image instead of text content."
              ),
    
            screenshot_selector: z
              .string()
              .optional()
              .describe(
                "Capture a screenshot of a specific element using a CSS selector. " +
                  'Example: ".product-card". Returns an image instead of text content.'
              ),
          },
        },
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true and destructiveHint=false. Description adds valuable behavioral context: default markdown output ideal for LLMs, and crucially explains that certain parameters (css_extractor, autoparse, outputs, screenshot) change the return type from text to JSON or images. This output-switching behavior is not captured in annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Well-structured with clear information hierarchy: purpose statement, default behavior, conditional options guide, and examples. Every section earns its place. Examples section is slightly verbose but appropriate for a 14-parameter tool where syntax matters. Good use of formatting (bullet points, code blocks).

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a complex tool with 14 parameters and no output schema, description adequately explains return value variations (markdown default vs JSON vs images depending on params). Covers the ZenRows-specific options (premium_proxy credit cost mentioned in schema, wait_for interactions explained). Could mention error handling or rate limits, but sufficient for invocation.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, establishing baseline 3. Description adds significant value via the 'When to enable options' section which provides contextual semantics for when to use parameters (e.g., 'page uses React/Vue/Angular' triggers js_render). The concrete examples demonstrate parameter interactions and valid value formats (e.g., CSS selector JSON syntax).

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Opens with specific verb+resource ('Scrape any webpage') and identifies the underlying service ('using ZenRows'). Clearly states default output format ('clean markdown') and primary use case ('fetch webpage content for analysis'). No siblings to differentiate from, but scope is precisely defined.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Contains explicit 'When to enable options' section that maps specific technical conditions (React/Vue/Angular, 403 errors, delayed content loading) to parameter usage. Provides concrete decision trees for selecting js_render, premium_proxy, and other options. Includes practical JSON examples showing parameter combinations.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ZenRows/zenrows-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server