Skip to main content
Glama
fvanevski

Trafilatura MCP Server

by fvanevski

fetch_and_extract

Extract main content, metadata, and optional comments from web pages by providing a URL. Returns structured JSON data for web scraping and content analysis.

Instructions

Fetches a URL and extracts the main content, metadata, and comments. Returns a JSON object with the extracted data.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe URL of the web page to process.
include_commentsNoWhether to include comment sections at the bottom of articles.
include_tablesNoExtract text from HTML <table> elements.

Implementation Reference

  • Core handler function for the fetch_and_extract tool: fetches URL content using trafilatura.fetch_url and extracts JSON-formatted main content and metadata with options for comments and tables.
    def perform_trafilatura(args: TrafilaturaInput) -> str:
        """
        Fetches and extracts content from a URL using Trafilatura.
    
        Args:
            args: A TrafilaturaInput object containing the URL and extraction options.
    
        Returns:
            A JSON string containing the extracted content and metadata.
        """
        logging.info(f"Executing fetch_and_extract for URL: '{args.url}'")
        try:
            # Fetch and extract the content from the given URL
            downloaded = trafilatura.fetch_url(args.url)
            if downloaded is None:
                raise McpError(
                    ErrorData(
                        code=INTERNAL_ERROR,
                        message=f"Failed to download content from URL: {args.url}",
                    )
                )
    
            # Extract the main content and metadata as a JSON string
            json_output = trafilatura.extract(
                downloaded,
                include_comments=args.include_comments,
                include_tables=args.include_tables,
                output_format="json",
                with_metadata=True,
                url=args.url
            )
    
            if json_output is None:
                # If trafilatura returns nothing, build a minimal JSON response
                return json.dumps({"main_content": None, "metadata": {}}, indent=4)
    
            return json_output
    
        except Exception as e:
            logging.error(
                f"An unexpected error occurred during Trafilatura processing: {e}",
                exc_info=True,
            )
            raise McpError(
                ErrorData(code=INTERNAL_ERROR, message=f"Unexpected error: {e}")
            )
  • Pydantic input schema for the fetch_and_extract tool, defining required URL and optional flags for including comments and tables.
    class TrafilaturaInput(BaseModel):
        """Input model for the fetch_and_extract tool."""
    
        url: str = Field(..., description="The URL of the web page to process.")
        include_comments: bool = Field(
            default=False, description="Whether to include comment sections at the bottom of articles."
        )
        include_tables: bool = Field(
            default=False, description="Extract text from HTML <table> elements."
        )
  • Registration of the fetch_and_extract tool in the MCP server's list_tools method, providing name, description, and reference to the input schema.
    Tool(
        name="fetch_and_extract",
        description=(
            "Fetches a URL and extracts the main content, metadata, and comments. "
            "Returns a JSON object with the extracted data."
        ),
        inputSchema=TrafilaturaInput.model_json_schema(),
    )
  • MCP call_tool handler that dispatches to fetch_and_extract: validates tool name, parses input arguments using the schema, invokes the core perform_trafilatura function, and returns the JSON result as TextContent.
    @server.call_tool()
    async def call_tool(name: str, arguments: dict) -> list[TextContent]:
        if name != "fetch_and_extract":
            raise McpError(
                ErrorData(code=INVALID_PARAMS, message=f"Unknown tool: {name}")
            )
        try:
            args = TrafilaturaInput(**arguments)
        except ValueError as e:
            raise McpError(ErrorData(code=INVALID_PARAMS, message=str(e)))
    
        # Perform the extraction
        result_json_string = perform_trafilatura(args)
    
        # Return the result as a JSON string
        return [TextContent(type="text", text=result_json_string)]

Tool Definition Quality

Score is being calculated. Check back soon.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/fvanevski/trafilatura_mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server