README.md•4.46 kB
# Trafilatura MCP Server
This repository contains a Model Context Protocol (MCP) server that provides a tool-based interface to the [Trafilatura](https://trafilatura.readthedocs.io/) library, a powerful tool for web scraping. It is designed for use with MCP-compatible clients, allowing developers and models to extract main content and metadata from web pages programmatically.
## Features
- **Web Scraping:** Utilizes Trafilatura to extract the main text content from a given URL.
- **Metadata Extraction:** Retrieves metadata such as title, author, date, and more.
- **Configurable Extraction:** Options to include or exclude comments and tables from the output.
- **Simple Tool:** Exposes a single, easy-to-use `fetch_and_extract` tool.
- **Asynchronous:** Built with an asynchronous architecture for efficient I/O operations.
- **MCP Standard:** Communicates over standard I/O, making it compatible with various MCP clients.
## Prerequisites
Before running the server, you need to have Python 3.12+ and `uv` installed. You will also need Node.js and `npx` to run the MCP Inspector tool for testing.
## Installation
1. **Clone the repository.**
```bash
git clone <repository-url>
cd trafilatura_mcp
```
2. **Create a virtual environment and install the required dependencies:**
```bash
# Create a virtual environment
uv venv
# Activate the virtual environment
source .venv/bin/activate
# Install the dependencies
uv sync
```
## Running and Testing the Server
The MCP server is a command-line application that communicates over standard I/O. To use it, a client (like an IDE, a coding agent, or an inspector tool) must launch the server process.
### Running for Diagnostics
You can run the script directly from your terminal to see if it starts without errors. This is a quick way to validate your Python environment and the script's basic syntax.
```bash
python3 trafilatura_mcp.py
```
The server will start and wait for input, but you won't be able to interact with it directly from your terminal.
### Testing with MCP Inspector
The recommended way to test the server interactively is with **MCP Inspector**. It provides an interactive shell for sending requests to your server.
1. **Launch the Inspector:**
You can run the inspector without a permanent installation using `npx`. The inspector will launch your MCP server script for you. From your project directory, run:
```bash
npx @modelcontextprotocol/inspector uv run -- python3 trafilatura_mcp.py
```
2. **Interact with the Server:**
Once the inspector starts, you can connect to the server and use commands like `list_tools` and `call_tool`.
**Example session:**
```bash
# List the available tool
> list_tools
# Call the 'fetch_and_extract' tool with a URL
> call_tool fetch_and_extract '''{"url": "https://www.theguardian.com/us-news/2025/sep/28/mass-shootings-north-carolina-texas-new-orleans"}'''
```
## Configuration
This server does not require any external API keys or configuration files.
## Usage with an MCP Client (VS Code Example)
You can connect to this server from any standard MCP client. Here’s how to do it in a VS Code environment that supports MCP:
1. **Configure Your MCP Client:**
In your IDE's MCP client settings (e.g., in `mcp.json` for VS Code), configure a new MCP server that points to the script.
**Example `mcp.json` entry:**
```json
{
"servers": {
"trafilatura_scraper": {
"command": "uv",
"args": [
"run",
"python3",
"trafilatura_mcp.py"
],
"cwd": "/path/to/your/project/trafilatura_mcp"
}
}
}
```
*Note: Replace `/path/to/your/project/trafilatura_mcp` with the absolute path to the project directory.*
2. **Use the Tool:**
Once connected, you can use the exposed tool in your chat or agent interactions. For example, to extract content from a news article, you could send the following structured tool call:
```json
{
"tool": "fetch_and_extract",
"arguments": {
"url": "https://apnews.com/article/elon-musk-x-twitter-hate-speech-antisemitism-0d35c5a69fd5c6183b729f7f3c87064a",
"include_comments": false,
"include_tables": true
}
}
```
The server will fetch the URL, extract the main content and metadata, and return it as a JSON object.