extract_url
Extract structured data from public web pages by specifying a URL and the fields you need. Returns clean JSON with requested information, or Markdown or raw HTML as needed.
Instructions
Extract structured data from permitted public web pages by providing a URL and describing what you want. Returns clean JSON with exactly the fields you asked for by default. Can also return clean Markdown or raw HTML when response_format is set. Uses supported fetch paths for JavaScript-heavy pages and returns explicit error signals when blocked. It does not solve CAPTCHA, access login/paywall-only pages, or circumvent anti-bot controls. This is the general-purpose extraction tool. Use extract_markdown for LLM/RAG-ready Markdown, extract_article for full article content, or extract_metadata for page meta tags instead, they are optimised shortcuts. Read-only, makes no changes to any external system. Requires HAUNT_API_KEY environment variable. Free tier: 1,000 credits/month. Returns an error if rate limit, credit quota, or API key is invalid.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | The full URL of the page to extract data from. Must be a valid HTTP or HTTPS URL. Supports permitted public pages, including some JavaScript-heavy SPAs. Human-verification, login-required, CAPTCHA-gated, paywalled, and blocked pages return explicit errors rather than fabricated data. | |
| prompt | Yes | A plain-English description of what data to extract from the page. Be specific about which fields you want. Examples: 'product name, price, and availability', 'all email addresses and phone numbers', 'the main heading, first paragraph, and all image URLs'. The more specific, the more accurate the extraction. | |
| response_format | No | Optional output mode. Leave blank or use json for structured extraction. Use markdown/md when you want clean page text for an agent, RAG pipeline, or .md file. Use raw_html/html only when you need the fetched HTML. |