Skip to main content
Glama
fvanevski

Trafilatura MCP Server

by fvanevski

fetch_and_extract

Extract main content, metadata, and optional comments from web pages by providing a URL. Returns structured JSON data for web scraping and content analysis.

Instructions

Fetches a URL and extracts the main content, metadata, and comments. Returns a JSON object with the extracted data.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe URL of the web page to process.
include_commentsNoWhether to include comment sections at the bottom of articles.
include_tablesNoExtract text from HTML <table> elements.

Implementation Reference

  • Core handler function for the fetch_and_extract tool: fetches URL content using trafilatura.fetch_url and extracts JSON-formatted main content and metadata with options for comments and tables.
    def perform_trafilatura(args: TrafilaturaInput) -> str:
        """
        Fetches and extracts content from a URL using Trafilatura.
    
        Args:
            args: A TrafilaturaInput object containing the URL and extraction options.
    
        Returns:
            A JSON string containing the extracted content and metadata.
        """
        logging.info(f"Executing fetch_and_extract for URL: '{args.url}'")
        try:
            # Fetch and extract the content from the given URL
            downloaded = trafilatura.fetch_url(args.url)
            if downloaded is None:
                raise McpError(
                    ErrorData(
                        code=INTERNAL_ERROR,
                        message=f"Failed to download content from URL: {args.url}",
                    )
                )
    
            # Extract the main content and metadata as a JSON string
            json_output = trafilatura.extract(
                downloaded,
                include_comments=args.include_comments,
                include_tables=args.include_tables,
                output_format="json",
                with_metadata=True,
                url=args.url
            )
    
            if json_output is None:
                # If trafilatura returns nothing, build a minimal JSON response
                return json.dumps({"main_content": None, "metadata": {}}, indent=4)
    
            return json_output
    
        except Exception as e:
            logging.error(
                f"An unexpected error occurred during Trafilatura processing: {e}",
                exc_info=True,
            )
            raise McpError(
                ErrorData(code=INTERNAL_ERROR, message=f"Unexpected error: {e}")
            )
  • Pydantic input schema for the fetch_and_extract tool, defining required URL and optional flags for including comments and tables.
    class TrafilaturaInput(BaseModel):
        """Input model for the fetch_and_extract tool."""
    
        url: str = Field(..., description="The URL of the web page to process.")
        include_comments: bool = Field(
            default=False, description="Whether to include comment sections at the bottom of articles."
        )
        include_tables: bool = Field(
            default=False, description="Extract text from HTML <table> elements."
        )
  • Registration of the fetch_and_extract tool in the MCP server's list_tools method, providing name, description, and reference to the input schema.
    Tool(
        name="fetch_and_extract",
        description=(
            "Fetches a URL and extracts the main content, metadata, and comments. "
            "Returns a JSON object with the extracted data."
        ),
        inputSchema=TrafilaturaInput.model_json_schema(),
    )
  • MCP call_tool handler that dispatches to fetch_and_extract: validates tool name, parses input arguments using the schema, invokes the core perform_trafilatura function, and returns the JSON result as TextContent.
    @server.call_tool()
    async def call_tool(name: str, arguments: dict) -> list[TextContent]:
        if name != "fetch_and_extract":
            raise McpError(
                ErrorData(code=INVALID_PARAMS, message=f"Unknown tool: {name}")
            )
        try:
            args = TrafilaturaInput(**arguments)
        except ValueError as e:
            raise McpError(ErrorData(code=INVALID_PARAMS, message=str(e)))
    
        # Perform the extraction
        result_json_string = perform_trafilatura(args)
    
        # Return the result as a JSON string
        return [TextContent(type="text", text=result_json_string)]
Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/fvanevski/trafilatura_mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server