extract_data
Use an LLM to extract structured fields from crawled pages. Define fields or let the LLM auto-discover by sampling. Results saved to extracted.jsonl. Ideal for competitive research, API analysis, and dataset creation.
Instructions
Extract structured fields from crawled pages using an LLM.
Analyzes each crawled page and pulls out specific data fields you define
(e.g. company_name, pricing, features, api_endpoints). If no fields are
specified, the LLM automatically discovers relevant fields by sampling
pages from the crawl.
This tool makes external API calls to OpenAI (requires OPENAI_API_KEY
environment variable). Results are saved to extracted.jsonl and include
LLM attribution metadata.
Use this for competitive research, API documentation analysis, or building
structured datasets from unstructured web content.
Args:
jsonl_path: Full path to the pages.jsonl file. If empty, defaults to
<MARKCRAWL_OUTPUT_DIR>/pages.jsonl.
fields: Comma-separated field names to extract. Example:
"company_name,pricing,features,api_endpoints". Leave empty to
let the LLM auto-discover the most relevant fields.
context: Description of your analysis goal. Improves auto-field
discovery quality. Example: "competitor pricing analysis" or
"API documentation review". Ignored when fields are specified.
sample_size: Number of pages to sample for auto-field discovery.
Default: 3. Higher values give better field suggestions but
cost more tokens.Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| jsonl_path | No | ||
| fields | No | ||
| context | No | ||
| sample_size | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |