Skip to main content
Glama
fvanevski

Trafilatura MCP Server

by fvanevski

fetch_and_extract

Extract main content, metadata, and optional comments from web pages by providing a URL. Returns structured JSON data for web scraping and content analysis.

Instructions

Fetches a URL and extracts the main content, metadata, and comments. Returns a JSON object with the extracted data.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe URL of the web page to process.
include_commentsNoWhether to include comment sections at the bottom of articles.
include_tablesNoExtract text from HTML <table> elements.

Implementation Reference

  • Core handler function for the fetch_and_extract tool: fetches URL content using trafilatura.fetch_url and extracts JSON-formatted main content and metadata with options for comments and tables.
    def perform_trafilatura(args: TrafilaturaInput) -> str:
        """
        Fetches and extracts content from a URL using Trafilatura.
    
        Args:
            args: A TrafilaturaInput object containing the URL and extraction options.
    
        Returns:
            A JSON string containing the extracted content and metadata.
        """
        logging.info(f"Executing fetch_and_extract for URL: '{args.url}'")
        try:
            # Fetch and extract the content from the given URL
            downloaded = trafilatura.fetch_url(args.url)
            if downloaded is None:
                raise McpError(
                    ErrorData(
                        code=INTERNAL_ERROR,
                        message=f"Failed to download content from URL: {args.url}",
                    )
                )
    
            # Extract the main content and metadata as a JSON string
            json_output = trafilatura.extract(
                downloaded,
                include_comments=args.include_comments,
                include_tables=args.include_tables,
                output_format="json",
                with_metadata=True,
                url=args.url
            )
    
            if json_output is None:
                # If trafilatura returns nothing, build a minimal JSON response
                return json.dumps({"main_content": None, "metadata": {}}, indent=4)
    
            return json_output
    
        except Exception as e:
            logging.error(
                f"An unexpected error occurred during Trafilatura processing: {e}",
                exc_info=True,
            )
            raise McpError(
                ErrorData(code=INTERNAL_ERROR, message=f"Unexpected error: {e}")
            )
  • Pydantic input schema for the fetch_and_extract tool, defining required URL and optional flags for including comments and tables.
    class TrafilaturaInput(BaseModel):
        """Input model for the fetch_and_extract tool."""
    
        url: str = Field(..., description="The URL of the web page to process.")
        include_comments: bool = Field(
            default=False, description="Whether to include comment sections at the bottom of articles."
        )
        include_tables: bool = Field(
            default=False, description="Extract text from HTML <table> elements."
        )
  • Registration of the fetch_and_extract tool in the MCP server's list_tools method, providing name, description, and reference to the input schema.
    Tool(
        name="fetch_and_extract",
        description=(
            "Fetches a URL and extracts the main content, metadata, and comments. "
            "Returns a JSON object with the extracted data."
        ),
        inputSchema=TrafilaturaInput.model_json_schema(),
    )
  • MCP call_tool handler that dispatches to fetch_and_extract: validates tool name, parses input arguments using the schema, invokes the core perform_trafilatura function, and returns the JSON result as TextContent.
    @server.call_tool()
    async def call_tool(name: str, arguments: dict) -> list[TextContent]:
        if name != "fetch_and_extract":
            raise McpError(
                ErrorData(code=INVALID_PARAMS, message=f"Unknown tool: {name}")
            )
        try:
            args = TrafilaturaInput(**arguments)
        except ValueError as e:
            raise McpError(ErrorData(code=INVALID_PARAMS, message=str(e)))
    
        # Perform the extraction
        result_json_string = perform_trafilatura(args)
    
        # Return the result as a JSON string
        return [TextContent(type="text", text=result_json_string)]
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries the full burden of behavioral disclosure. It mentions that the tool returns a JSON object with extracted data, but does not cover important aspects like error handling, rate limits, authentication needs, or what happens if the URL is inaccessible. For a tool that fetches external content, this is a significant gap.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise and front-loaded, consisting of two sentences that directly state the tool's function and output. There is no wasted language, and every sentence earns its place by providing essential information efficiently.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (fetching and extracting web content) and the lack of annotations and output schema, the description is somewhat incomplete. It covers the basic purpose and output format but misses behavioral details like error handling or performance considerations. However, it is adequate as a minimum viable description for a tool with no siblings.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The schema description coverage is 100%, so the schema already documents all parameters (url, include_comments, include_tables) with clear descriptions. The description does not add any additional meaning or context beyond what the schema provides, such as examples or edge cases. Baseline 3 is appropriate when the schema does the heavy lifting.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: fetching a URL and extracting main content, metadata, and comments. It specifies the verb ('fetches' and 'extracts') and resource ('URL'), but since there are no sibling tools, it cannot distinguish from alternatives. The description is not tautological and provides a clear action.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. It does not mention any prerequisites, exclusions, or specific contexts for usage. Without sibling tools, there is no explicit comparison, but it lacks any usage context or limitations.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/fvanevski/trafilatura_mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server