extract_content
Extract clean text and markdown content from web pages using hybrid extraction strategies, with optional JavaScript rendering support for dynamic sites.
Instructions
Extract clean text/markdown content from a URL using trafilatura (fast) with optional Playwright fallback (JS-rendered pages).
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | The URL to extract content from. | |
| timeout | No | Fetch timeout in seconds (default 10). |
Implementation Reference
- src/interdeep/server.py:162-172 (handler)The _handle_extract_content function executes the extraction logic using extract_hybrid_async.
async def _handle_extract_content(arguments: dict) -> list[TextContent]: url = arguments.get("url", "") if not url: return _err("url is required") timeout = arguments.get("timeout", 10) try: result = await extract_hybrid_async(url=url, timeout=timeout) return _ok(_result_to_dict(result)) except Exception as e: logger.exception("extract_content failed for %s", url) return _err(f"Extraction failed: {e}") - src/interdeep/server.py:61-68 (registration)The Tool definition for 'extract_content' is registered in the list_tools handler.
Tool( name="extract_content", description="Extract clean text/markdown content from a URL using trafilatura (fast) with optional Playwright fallback (JS-rendered pages).", inputSchema={ "type": "object", "properties": { "url": { "type": "string", - src/interdeep/server.py:242-242 (registration)The 'extract_content' tool is mapped to its handler function in the _HANDLERS dictionary.
"extract_content": _handle_extract_content,