alterlab_extract
Extract structured data from raw HTML, text, or markdown content using predefined extraction profiles or natural language prompts. Returns JSON with extracted fields like price, title, author, and more.
Instructions
Extract structured data from raw HTML, text, or markdown content WITHOUT scraping. Bring your own pre-fetched content. Use this when you already have the page content and want to run AlterLab's extraction pipeline on it. For scraping + extraction in one step, use alterlab_scrape with formats=['json'] instead. Profiles: 'product' (price, title, reviews), 'article' (title, author, body), 'job_posting', 'faq', 'recipe', 'event', 'ecommerce_homepage', 'directory_listing'. Returns JSON data. Use extraction_prompt for natural language extraction (LLM-powered). Use cache='only' to retrieve a previously cached result without calling the LLM.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| content | Yes | Raw content to extract from — HTML, text, or markdown. Bring your own pre-fetched content; this endpoint does NOT scrape a URL. | |
| content_type | No | Type of the provided content | html |
| extraction_profile | No | Pre-defined extraction profile. 'product' extracts price/title/reviews, 'article' extracts title/author/body, etc. 'auto' detects the page type. Mutually exclusive with extraction_template. | |
| extraction_template | No | Shorthand alias for extraction_profile — selects the same pre-built schema template. Mutually exclusive with extraction_profile. | |
| extraction_schema | No | Custom JSON Schema for extraction. Fields are mapped from content. Overrides extraction_profile when provided | |
| extraction_prompt | No | Natural language instructions for LLM extraction (e.g., 'Extract all product prices and ratings'). Charged at LLM extraction rate when provided. | |
| extraction_model | No | Per-request LLM model override in provider-specific format (e.g. 'gpt-4o', 'claude-opus-4-5-20251101', 'llama3-70b-8192'). Overrides the model saved in your BYOK key settings for this request only. | |
| extraction_provider | No | LLM provider to use for extraction. Selects the matching BYOK key registered at /dashboard/settings/llm-keys. When omitted, the most recently used registered key is used. | |
| formats | No | Output formats for content transformation. 'json' is best for structured extraction. 'content' returns filtered/cleaned content. 'raw' returns the unprocessed response body. | |
| source_url | No | Original URL of the content (for context only — not fetched). Helps the extractor understand the content's domain. | |
| evidence | No | Include field provenance/evidence for extracted fields (which part of the content each field came from) | |
| cache | No | Cache control for LLM extraction results. 'auto': return cached result if available (default). 'skip': bypass cache lookup, always call LLM (result is still stored). 'only': return cached result or 404 if not cached — never calls the LLM. | auto |
| cache_ttl | No | TTL for caching this extraction result, in seconds. Defaults to server setting (3600s). Max 86400s (24 hours). |