fetch_url_as_markdown
Fetches a URL and extracts the main content as clean Markdown, using headless Chromium for JavaScript-rendered or bot-protected sites.
Instructions
Fetch a URL and return the main content as Markdown.
First tries a plain HTTP request with an Accept: text/markdown header. If the server responds with Content-Type: text/markdown (e.g. Cloudflare Markdown for Agents sites), the body is returned immediately without launching a browser.
Otherwise, uses patchright (a Playwright fork with anti-detection patches) to drive real Chromium, which clears most Cloudflare bot challenges and renders JavaScript-required pages. A single headless Chromium instance is kept alive across calls so subsequent fetches avoid the browser cold-start cost (~2-5s). After navigation, polls the page DOM and runs trafilatura, returning as soon as the extracted Markdown stabilizes across two consecutive polls — typically within a few hundred milliseconds of the DOM being built, regardless of whether trackers, ads, and analytics are still loading in the background.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | The URL to fetch. | |
| wait_until | No | When the navigation step is considered complete. "domcontentloaded" (default) returns when the HTML is parsed and the DOM is built. "load" waits for all subresources (images, scripts, stylesheets) — slower and rarely needed since content-stabilization polling runs after this. "networkidle" waits for network to quiet — best for SPAs but sometimes hangs on pages with persistent connections. "commit" returns as soon as the response starts. | domcontentloaded |
| timeout_ms | No | Navigation timeout in milliseconds. Default 60000. This is the budget for the navigation step only; content extraction has its own separate budget (poll_budget_ms). | |
| headless | No | Whether to run Chromium headless. Default True. Set to False to use a visible browser window — slower and pops a Chromium window on screen, but clears bot-detection challenges (Cloudflare, etc.) that block headless mode. If a fetch returns "ERROR: navigation timed out" or "ERROR: no extractable content" on a site that likely has bot protection, retry with headless=False. Requires a display, so headless=False fails on servers without a graphical environment unless a virtual display like Xvfb is configured. | |
| poll_budget_ms | No | Maximum time after navigation to wait for content extraction to stabilize. Default 5000. Increase for slow SPAs that progressively render content over many seconds, or when using headless=False on bot-protected sites where the challenge takes time to resolve — 10000-15000 is reasonable for the latter. | |
| poll_interval_ms | No | How often to re-attempt extraction during polling. Default 250. |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |