extract_data
Pull structured fields from crawled pages using LLM analysis. Extract specific data like pricing or API endpoints, or use auto-discovery to build datasets from unstructured web content for competitive research.
Instructions
Extract structured fields from crawled pages using an LLM.
Analyzes each crawled page and pulls out specific data fields you define
(e.g. company_name, pricing, features, api_endpoints). If no fields are
specified, the LLM automatically discovers relevant fields by sampling
pages from the crawl.
This tool makes external API calls to OpenAI (requires OPENAI_API_KEY
environment variable). Results are saved to extracted.jsonl and include
LLM attribution metadata.
Use this for competitive research, API documentation analysis, or building
structured datasets from unstructured web content.
Args:
jsonl_path: Full path to the pages.jsonl file. If empty, defaults to
<WEBCRAWLER_OUTPUT_DIR>/pages.jsonl.
fields: Comma-separated field names to extract. Example:
"company_name,pricing,features,api_endpoints". Leave empty to
let the LLM auto-discover the most relevant fields.
context: Description of your analysis goal. Improves auto-field
discovery quality. Example: "competitor pricing analysis" or
"API documentation review". Ignored when fields are specified.
sample_size: Number of pages to sample for auto-field discovery.
Default: 3. Higher values give better field suggestions but
cost more tokens.Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| jsonl_path | No | ||
| fields | No | ||
| context | No | ||
| sample_size | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |