Skip to main content
Glama

Trafilatura MCP Server

by fvanevski
README.md4.46 kB
# Trafilatura MCP Server This repository contains a Model Context Protocol (MCP) server that provides a tool-based interface to the [Trafilatura](https://trafilatura.readthedocs.io/) library, a powerful tool for web scraping. It is designed for use with MCP-compatible clients, allowing developers and models to extract main content and metadata from web pages programmatically. ## Features - **Web Scraping:** Utilizes Trafilatura to extract the main text content from a given URL. - **Metadata Extraction:** Retrieves metadata such as title, author, date, and more. - **Configurable Extraction:** Options to include or exclude comments and tables from the output. - **Simple Tool:** Exposes a single, easy-to-use `fetch_and_extract` tool. - **Asynchronous:** Built with an asynchronous architecture for efficient I/O operations. - **MCP Standard:** Communicates over standard I/O, making it compatible with various MCP clients. ## Prerequisites Before running the server, you need to have Python 3.12+ and `uv` installed. You will also need Node.js and `npx` to run the MCP Inspector tool for testing. ## Installation 1. **Clone the repository.** ```bash git clone <repository-url> cd trafilatura_mcp ``` 2. **Create a virtual environment and install the required dependencies:** ```bash # Create a virtual environment uv venv # Activate the virtual environment source .venv/bin/activate # Install the dependencies uv sync ``` ## Running and Testing the Server The MCP server is a command-line application that communicates over standard I/O. To use it, a client (like an IDE, a coding agent, or an inspector tool) must launch the server process. ### Running for Diagnostics You can run the script directly from your terminal to see if it starts without errors. This is a quick way to validate your Python environment and the script's basic syntax. ```bash python3 trafilatura_mcp.py ``` The server will start and wait for input, but you won't be able to interact with it directly from your terminal. ### Testing with MCP Inspector The recommended way to test the server interactively is with **MCP Inspector**. It provides an interactive shell for sending requests to your server. 1. **Launch the Inspector:** You can run the inspector without a permanent installation using `npx`. The inspector will launch your MCP server script for you. From your project directory, run: ```bash npx @modelcontextprotocol/inspector uv run -- python3 trafilatura_mcp.py ``` 2. **Interact with the Server:** Once the inspector starts, you can connect to the server and use commands like `list_tools` and `call_tool`. **Example session:** ```bash # List the available tool > list_tools # Call the 'fetch_and_extract' tool with a URL > call_tool fetch_and_extract '''{"url": "https://www.theguardian.com/us-news/2025/sep/28/mass-shootings-north-carolina-texas-new-orleans"}''' ``` ## Configuration This server does not require any external API keys or configuration files. ## Usage with an MCP Client (VS Code Example) You can connect to this server from any standard MCP client. Here’s how to do it in a VS Code environment that supports MCP: 1. **Configure Your MCP Client:** In your IDE's MCP client settings (e.g., in `mcp.json` for VS Code), configure a new MCP server that points to the script. **Example `mcp.json` entry:** ```json { "servers": { "trafilatura_scraper": { "command": "uv", "args": [ "run", "python3", "trafilatura_mcp.py" ], "cwd": "/path/to/your/project/trafilatura_mcp" } } } ``` *Note: Replace `/path/to/your/project/trafilatura_mcp` with the absolute path to the project directory.* 2. **Use the Tool:** Once connected, you can use the exposed tool in your chat or agent interactions. For example, to extract content from a news article, you could send the following structured tool call: ```json { "tool": "fetch_and_extract", "arguments": { "url": "https://apnews.com/article/elon-musk-x-twitter-hate-speech-antisemitism-0d35c5a69fd5c6183b729f7f3c87064a", "include_comments": false, "include_tables": true } } ``` The server will fetch the URL, extract the main content and metadata, and return it as a JSON object.

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/fvanevski/trafilatura_mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server