README.md•8.76 kB
# 📂 MCP Data Fetch Server
**MCP Data Fetch Server** is secure, sandboxed server that fetches web content and extracts data via the **Model Control Protocol (MCP)**. without executing JavaScript.
---
## Table of Contents
- [Features](#features)
- [Installation & Quick Start](#installation--quick-start)
- [Command‑Line Options](#commandline-options)
- [Integration with LM Studio](#integration-with-lm-studio)
- [MCP API Overview](#mcp-api-overview)
- [`initialize`](#initialize)
- [`tools/list`](#toolslist)
- [`tools/call`](#toolscall)
- [Available Tools](#available-tools)
- [`fetch_webpage`](#fetch_webpage)
- [`extract_links`](#extract_links)
- [`download_file`](#download_file)
- [`get_page_metadata`](#get_page_metadata)
- [`check_url`](#check_url)
- [Security Features](#security-features)
---
## 🎯 Features
- **Secure web page fetching** – strips scripts, iframes and cookie banners; no JavaScript execution.
- **Rich data extraction** – retrieve links, metadata, Open Graph/Twitter cards, and downloadable resources.
- **Safe file downloads** – size limits, filename sanitisation, and path‑traversal protection within a sandboxed cache.
- **Built‑in caching** – optional cache directory reduces repeated network calls.
- **Prompt‑injection detection** – validates URLs and fetched content for malicious instructions.
---
## 📦 Installation & Quick Start
```bash
# Clone the repository (or copy the MCPDataFetchServer.1 folder)
git clone https://github.com/undici77/MCPDataFetchServer.git
cd MCPDataFetchServer
# Make the startup script executable
chmod +x run.sh
# Run the server, pointing to a sandboxed working directory
./run.sh -d /path/to/working/directory
```
> 📌 **Three‑step overview**
> 1️⃣ The script creates a virtual environment and installs dependencies.
> 2️⃣ It prepares a cache folder (`.fetch_cache`) inside the project root.
> 3️⃣ `main.py` launches the MCP server, listening on *stdin/stdout* for JSON‑RPC requests.
---
## ⚙️ Command‑Line Options
| Option | Description |
|--------|-------------|
| `-d`, `--working-dir` | Path to the **sandboxed working directory** where all file operations are confined (default: `~/.mcp_datafetch`). |
| `-c`, `--cache-dir` | Name of the **cache subdirectory** relative to the working directory (default: `cache`). |
| `-h`, `--help` | Show help message and exit. |
---
## 🤝 Integration with LM Studio *(or any MCP‑compatible client)*
Add an entry to your `mcp.json` configuration so that LM Studio can launch the server automatically.
```json
{
"mcpServers": {
"datafetch": {
"command": "/absolute/path/to/MCPDataFetchServer.1/run.sh",
"args": [
"-d",
"/absolute/path/to/working/directory"
],
"env": { "WORKING_DIR": "." }
}
}
}
```
> 📌 **Tip:** Ensure `run.sh` is executable (`chmod +x …`) and that the virtual environment can install the required Python packages on first launch.
---
## 📡 MCP API Overview
All communication follows **JSON‑RPC 2.0** over *stdin/stdout*.
### `initialize`
Request:
```json
{
"jsonrpc": "2.0",
"id": 1,
"method": "initialize",
"params": {}
}
```
Response contains the protocol version, server capabilities and basic metadata (e.g., name = `mcp-datafetch-server`, version = `2.1.0`).
### `tools/list`
Request:
```json
{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/list",
"params": {}
}
```
Response: `{ "tools": [ …tool definitions… ] }`. Each definition includes `name`, `description` and an **input schema** (JSON Schema).
### `tools/call`
Generic request shape (replace `<tool_name>` and arguments as needed):
```json
{
"jsonrpc": "2.0",
"id": 3,
"method": "tools/call",
"params": {
"name": "<tool_name>",
"arguments": { … }
}
}
```
The server validates the request against the tool’s schema, executes the operation, and returns a `ToolResult` containing one or more **content blocks**.
---
## 🛠️ Available Tools
### `fetch_webpage`
- **Securely fetches a web page and returns clean content in the requested format.**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `url` | string | ✅ (no default) | URL to fetch (**http/https only**). |
| `format` | string | ❌ (`markdown`) | Output format – one of `markdown`, `text`, or `html`. |
| `include_links` | boolean | ❌ (`true`) | Whether to append an extracted links list. |
| `include_images` | boolean | ❌ (`false`) | Whether to list image URLs in the output. |
| `remove_banners` | boolean | ❌ (`true`) | Attempt to strip cookie banners & pop‑ups. |
**Example**
```json
{
"jsonrpc": "2.0",
"id": 10,
"method": "tools/call",
"params": {
"name": "fetch_webpage",
"arguments": {
"url": "https://example.com/article",
"format": "markdown",
"include_links": true,
"include_images": false,
"remove_banners": true
}
}
}
```
*Note:* The tool sanitises HTML, removes scripts/iframes, and checks for prompt‑injection patterns before returning content.
---
### `extract_links`
- **Extracts and categorises all hyperlinks from a page.**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `url` | string | ✅ (no default) | URL of the page to analyse. |
| `filter` | string | ❌ (`all`) | Return only `all`, `internal`, `external`, or `resources`. |
**Example**
```json
{
"jsonrpc": "2.0",
"id": 11,
"method": "tools/call",
"params": {
"name": "extract_links",
"arguments": {
"url": "https://example.com/blog",
"filter": "internal"
}
}
}
```
*Note:* Links are classified as **internal** (same domain) or **external**; resource links (images, PDFs…) can be filtered with `resources`.
---
### `download_file`
- **Safely downloads a remote file into the sandboxed cache directory.**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `url` | string | ✅ (no default) | Direct URL to the file. |
| `filename` | string | ❌ (auto‑generated) | Desired filename; will be sanitised and forced into the cache directory. |
**Example**
```json
{
"jsonrpc": "2.0",
"id": 12,
"method": "tools/call",
"params": {
"name": "download_file",
"arguments": {
"url": "https://example.com/files/report.pdf",
"filename": "report_latest.pdf"
}
}
}
```
*Note:* The server enforces a **100 MB** download limit, validates the URL against blocked domains/extensions, and returns the relative path inside the working directory for cross‑agent access.
---
### `get_page_metadata`
- **Extracts structured metadata (title, description, Open Graph, Twitter Cards) from a web page.**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `url` | string | ✅ (no default) | URL of the page to inspect. |
**Example**
```json
{
"jsonrpc": "2.0",
"id": 13,
"method": "tools/call",
"params": {
"name": "get_page_metadata",
"arguments": { "url": "https://example.com/product/42" }
}
}
```
*Note:* The tool returns a formatted text block with title, description, keywords, Open Graph properties and Twitter Card fields.
---
### `check_url`
- **Performs a lightweight HEAD request to report status code, headers and size without downloading the body.**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `url` | string | ✅ (no default) | URL to probe. |
**Example**
```json
{
"jsonrpc": "2.0",
"id": 14,
"method": "tools/call",
"params": {
"name": "check_url",
"arguments": { "url": "https://example.com/resource.zip" }
}
}
```
*Note:* The response includes the final URL after redirects, a concise status summary (✅ OK or ⚠️ Error), and selected HTTP headers such as `Content‑Type` and `Content‑Length`.
---
## 🔐 Security Features
- **Path‑traversal protection** – all file operations are confined to the sandboxed *working directory*.
- **Prompt‑injection detection** in URLs, fetched HTML and generated content.
- **Blocked domains & extensions** (localhost, private IP ranges, executable/script files).
- **Content‑size limits** – max 50 MB for page fetches, max 100 MB for file downloads.
- **HTML sanitisation** – removes `<script>`, `<iframe>`, event handlers and other risky elements before processing.
- **Cookie/banner handling** – optional removal of consent banners and pop‑ups during fetch.
---
*© 2025 Undici77 – All rights reserved.*