url_to_markdown_tool
Convert web page content into clean markdown by scraping, removing unnecessary elements, and ranking content for clarity. Ideal for RAG applications and structured data extraction.
Instructions
Extract and convert web page content to markdown format.
This tool scrapes a web page, removes unnecessary elements, ranks content by importance using a custom algorithm, and returns clean markdown. Perfect for RAG applications.
Args: url: The web page URL to analyze and convert
Returns: str: Clean markdown representation of the web page content
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes |
Implementation Reference
- web_analyzer_mcp/server.py:20-36 (handler)The MCP tool handler and registration for 'url_to_markdown_tool'. It defines the tool function with input/output types and docstring schema, delegating to the core helper.@mcp.tool() def url_to_markdown_tool(url: str) -> str: """ Extract and convert web page content to markdown format. This tool scrapes a web page, removes unnecessary elements, ranks content by importance using a custom algorithm, and returns clean markdown. Perfect for RAG applications. Args: url: The web page URL to analyze and convert Returns: str: Clean markdown representation of the web page content """ return url_to_markdown(url)
- Core implementation of URL to markdown conversion. Orchestrates URL validation, HTML extraction with Selenium, content cleaning, importance ranking, special elements parsing, and markdown conversion.def url_to_markdown(url: str) -> str: """ Convert a URL to markdown format using advanced content extraction. This is the main function that replaces the original build_output function. It extracts HTML, analyzes content importance, and converts to markdown. Args: url: The URL to analyze and convert Returns: str: Markdown formatted content """ try: # Ensure valid URL clean_url = ensure_url_scheme(url) # Extract HTML content html_content = extract_html_content(clean_url) # Parse HTML soup = BeautifulSoup(html_content, 'html.parser') # Extract special elements before cleaning special_elements = parse_special_elements(soup) # Clean HTML content cleaned_soup = clean_html_content(soup) # Rank content by importance main_content = rank_content_by_importance(cleaned_soup) # Convert to markdown markdown_result = convert_to_markdown(special_elements, main_content) return markdown_result except Exception as e: return f"Error processing URL {url}: {str(e)}"