extract_url_content
Extracts clean, main text content from URLs using browser automation and readability tools. Handles dynamic rendering, GitHub repos, and pre-checks for non-HTML content. Ideal for articles, blogs, and structured data extraction.
Instructions
Uses browser automation (Puppeteer) and Mozilla's Readability library to extract the main article text content from a given URL. Handles dynamic JavaScript rendering and includes fallback logic. For GitHub repository URLs, it attempts to fetch structured content via gitingest.com. Performs a pre-check for non-HTML content types and checks HTTP status after navigation. Ideal for getting clean text from articles/blog posts. Note: May struggle to isolate only core content on complex homepages or dashboards, potentially including UI elements.
Input Schema
Name | Required | Description | Default |
---|---|---|---|
depth | No | Optional: Maximum depth for recursive link exploration (1-5). Default is 1 (no recursion). | |
url | Yes | The URL of the website to extract content from. |