extract_data
Extract structured data like products, articles, or prices from web pages using JSON-LD, Microdata, OpenGraph, or CSS. Supports semantic queries and listing extraction.
Instructions
Extract JSON-schema data from JSON-LD, Microdata, OpenGraph, or CSS. Use multiple:true for listings, mode="semantic" plus query for bounded host-side chunks, or exactly one scope: selector, ref_id, backendNodeId.
When to use: Typed products, articles, prices, or semantic facts. When NOT to use: Use read_page for raw content or javascript_tool for ad-hoc scraping.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| tabId | Yes | Tab ID to extract from | |
| schema | Yes | JSON Schema defining output structure. Example: { "type": "object", "properties": { "title": { "type": "string" }, "price": { "type": "number" } } } | |
| instruction | No | Optional natural language hint (e.g., "product details") | |
| query | No | Required for mode="semantic": query describing the information to extract from a bounded markdown chunk | |
| maxChars | No | Semantic mode only: max chunk chars returned to the host. Default 12000, hard cap 50000. | |
| startFromChar | No | Semantic mode only: continuation offset into filtered markdown. Default: 0. | |
| includeLinks | No | Semantic mode only: preserve markdown links. Default: true. | |
| includeImages | No | Semantic mode only: reserved for image markdown inclusion. Default: false. | |
| alreadyCollected | No | Semantic mode only: values already collected by the host, used for simple chunk dedupe hints. | |
| selector | No | CSS selector to scope extraction region | |
| ref_id | No | Element ref_id from read_page or oc_observe to scope extraction region | |
| backendNodeId | No | Chrome backend DOM node id to scope extraction region | |
| multiple | No | Extract array of items (for listings/tables). Default: false | |
| output_mode | No | "inline" (default): return the full payload in-band — byte-identical to v1.11.0. "handle": write payload to the handle store and return a small descriptor; redeem with oc_output_fetch. "auto": inline if payload ≤ output_inline_limit_bytes, otherwise handle. | |
| output_inline_limit_bytes | No | Only honored when output_mode="auto". If the serialized payload exceeds this byte count the response spills to a handle. Default: 32768. |