extract_url_content
Extract clean main article text from any URL using browser automation and fallback logic. Handles dynamic JavaScript rendering and includes structured content retrieval for GitHub repositories. Ideal for articles and blog posts.
Instructions
Uses browser automation (Puppeteer) and Mozilla's Readability library to extract the main article text content from a given URL. Handles dynamic JavaScript rendering and includes fallback logic. For GitHub repository URLs, it attempts to fetch structured content via gitingest.com. Performs a pre-check for non-HTML content types and checks HTTP status after navigation. Ideal for getting clean text from articles/blog posts. Note: May struggle to isolate only core content on complex homepages or dashboards, potentially including UI elements.
Input Schema
Name | Required | Description | Default |
---|---|---|---|
depth | No | Optional: Maximum depth for recursive link exploration (1-5). Default is 1 (no recursion). | |
url | Yes | The URL of the website to extract content from. |