url_to_markdown_tool

Convert web pages to clean markdown format by extracting content, removing unnecessary elements, and ranking information for RAG applications.

Instructions

Extract and convert web page content to markdown format.

This tool scrapes a web page, removes unnecessary elements, ranks content by importance using a custom algorithm, and returns clean markdown. Perfect for RAG applications.

Args: url: The web page URL to analyze and convert

Returns: str: Clean markdown representation of the web page content

Input Schema

TableJSON Schema

Name	Required	Description	Default
`url`	Yes

Implementation Reference

web_analyzer_mcp/server.py:20-36 (handler)

The handler function for the 'url_to_markdown_tool' tool. It is registered using the @mcp.tool() decorator and delegates the core logic to the url_to_markdown helper function from web_extractor.py. The function signature and docstring define the tool's schema.

@mcp.tool()
def url_to_markdown_tool(url: str) -> str:
    """
    Extract and convert web page content to markdown format.
    
    This tool scrapes a web page, removes unnecessary elements, 
    ranks content by importance using a custom algorithm, and 
    returns clean markdown. Perfect for RAG applications.
    
    Args:
        url: The web page URL to analyze and convert
        
    Returns:
        str: Clean markdown representation of the web page content
    """
    return url_to_markdown(url)

web_analyzer_mcp/web_extractor.py:294-332 (helper)

The core helper function implementing the web page to markdown conversion. It handles URL validation, HTML extraction using Selenium, parsing special elements (tables, images, etc.), cleaning, content ranking by importance, and markdown conversion.

def url_to_markdown(url: str) -> str:
    """
    Convert a URL to markdown format using advanced content extraction.
    
    This is the main function that replaces the original build_output function.
    It extracts HTML, analyzes content importance, and converts to markdown.
    
    Args:
        url: The URL to analyze and convert
        
    Returns:
        str: Markdown formatted content
    """
    try:
        # Ensure valid URL
        clean_url = ensure_url_scheme(url)
        
        # Extract HTML content
        html_content = extract_html_content(clean_url)
        
        # Parse HTML
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Extract special elements before cleaning
        special_elements = parse_special_elements(soup)
        
        # Clean HTML content
        cleaned_soup = clean_html_content(soup)
        
        # Rank content by importance
        main_content = rank_content_by_importance(cleaned_soup)
        
        # Convert to markdown
        markdown_result = convert_to_markdown(special_elements, main_content)
        
        return markdown_result
        
    except Exception as e:
        return f"Error processing URL {url}: {str(e)}"

MCP WebAnalyzer