Skip to main content
Glama
fvanevski

Trafilatura MCP Server

by fvanevski

fetch_and_extract

Extract main content, metadata, and optional comments from web pages by providing a URL. Returns structured JSON data for web scraping and content analysis.

Instructions

Fetches a URL and extracts the main content, metadata, and comments. Returns a JSON object with the extracted data.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe URL of the web page to process.
include_commentsNoWhether to include comment sections at the bottom of articles.
include_tablesNoExtract text from HTML <table> elements.

Implementation Reference

  • Core handler function for the fetch_and_extract tool: fetches URL content using trafilatura.fetch_url and extracts JSON-formatted main content and metadata with options for comments and tables.
    def perform_trafilatura(args: TrafilaturaInput) -> str: """ Fetches and extracts content from a URL using Trafilatura. Args: args: A TrafilaturaInput object containing the URL and extraction options. Returns: A JSON string containing the extracted content and metadata. """ logging.info(f"Executing fetch_and_extract for URL: '{args.url}'") try: # Fetch and extract the content from the given URL downloaded = trafilatura.fetch_url(args.url) if downloaded is None: raise McpError( ErrorData( code=INTERNAL_ERROR, message=f"Failed to download content from URL: {args.url}", ) ) # Extract the main content and metadata as a JSON string json_output = trafilatura.extract( downloaded, include_comments=args.include_comments, include_tables=args.include_tables, output_format="json", with_metadata=True, url=args.url ) if json_output is None: # If trafilatura returns nothing, build a minimal JSON response return json.dumps({"main_content": None, "metadata": {}}, indent=4) return json_output except Exception as e: logging.error( f"An unexpected error occurred during Trafilatura processing: {e}", exc_info=True, ) raise McpError( ErrorData(code=INTERNAL_ERROR, message=f"Unexpected error: {e}") )
  • Pydantic input schema for the fetch_and_extract tool, defining required URL and optional flags for including comments and tables.
    class TrafilaturaInput(BaseModel): """Input model for the fetch_and_extract tool.""" url: str = Field(..., description="The URL of the web page to process.") include_comments: bool = Field( default=False, description="Whether to include comment sections at the bottom of articles." ) include_tables: bool = Field( default=False, description="Extract text from HTML <table> elements." )
  • Registration of the fetch_and_extract tool in the MCP server's list_tools method, providing name, description, and reference to the input schema.
    Tool( name="fetch_and_extract", description=( "Fetches a URL and extracts the main content, metadata, and comments. " "Returns a JSON object with the extracted data." ), inputSchema=TrafilaturaInput.model_json_schema(), )
  • MCP call_tool handler that dispatches to fetch_and_extract: validates tool name, parses input arguments using the schema, invokes the core perform_trafilatura function, and returns the JSON result as TextContent.
    @server.call_tool() async def call_tool(name: str, arguments: dict) -> list[TextContent]: if name != "fetch_and_extract": raise McpError( ErrorData(code=INVALID_PARAMS, message=f"Unknown tool: {name}") ) try: args = TrafilaturaInput(**arguments) except ValueError as e: raise McpError(ErrorData(code=INVALID_PARAMS, message=str(e))) # Perform the extraction result_json_string = perform_trafilatura(args) # Return the result as a JSON string return [TextContent(type="text", text=result_json_string)]

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/fvanevski/trafilatura_mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server