alterlab_extract
Extract structured data from HTML, text, or markdown content using pre-defined profiles (product, article, etc.) or custom schemas. Returns JSON with optional evidence and caching.
Instructions
Extract structured data from raw HTML, text, or markdown content WITHOUT scraping. Bring your own pre-fetched content. Use this when you already have the page content and want to run AlterLab's extraction pipeline on it. For scraping + extraction in one step, use alterlab_scrape with formats=['json'] instead. Profiles: 'product' (price, title, reviews), 'article' (title, author, body), 'job_posting', 'faq', 'recipe', 'event', 'ecommerce_homepage', 'directory_listing'. Returns JSON data. Use extraction_prompt for natural language extraction (LLM-powered). Use cache='only' to retrieve a previously cached result without calling the LLM.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| cache | No | Cache control for LLM extraction results. 'auto': return cached result if available (default). 'skip': bypass cache lookup, always call LLM (result is still stored). 'only': return cached result or 404 if not cached — never calls the LLM. | auto |
| content | Yes | Raw content to extract from — HTML, text, or markdown. Bring your own pre-fetched content; this endpoint does NOT scrape a URL. | |
| formats | No | Output formats for content transformation. 'json' is best for structured extraction. 'content' returns filtered/cleaned content. 'raw' returns the unprocessed response body. | |
| evidence | No | Include field provenance/evidence for extracted fields (which part of the content each field came from) | |
| cache_ttl | No | TTL for caching this extraction result, in seconds. Defaults to server setting (3600s). Max 86400s (24 hours). | |
| source_url | No | Original URL of the content (for context only — not fetched). Helps the extractor understand the content's domain. | |
| content_type | No | Type of the provided content | html |
| extraction_model | No | Per-request LLM model override in provider-specific format (e.g. 'gpt-4o', 'claude-opus-4-5-20251101', 'llama3-70b-8192'). Overrides the model saved in your BYOK key settings for this request only. | |
| extraction_prompt | No | Natural language instructions for LLM extraction (e.g., 'Extract all product prices and ratings'). Charged at LLM extraction rate when provided. | |
| extraction_schema | No | Custom JSON Schema for extraction. Fields are mapped from content. Overrides extraction_profile when provided | |
| extraction_profile | No | Pre-defined extraction profile. 'product' extracts price/title/reviews, 'article' extracts title/author/body, etc. 'auto' detects the page type. Mutually exclusive with extraction_template. | |
| extraction_provider | No | LLM provider to use for extraction. Selects the matching BYOK key registered at /dashboard/settings/llm-keys. When omitted, the most recently used registered key is used. | |
| extraction_template | No | Shorthand alias for extraction_profile — selects the same pre-built schema template. Mutually exclusive with extraction_profile. |