📂 MCP Data Fetch Server

MCP Data Fetch Server is secure, sandboxed server that fetches web content and extracts data via the Model Control Protocol (MCP). without executing JavaScript.

🎯 Features

Secure web page fetching – strips scripts, iframes and cookie banners; no JavaScript execution.
Rich data extraction – retrieve links, metadata, Open Graph/Twitter cards, and downloadable resources.
Safe file downloads – size limits, filename sanitisation, and path‑traversal protection within a sandboxed cache.
Built‑in caching – optional cache directory reduces repeated network calls.
Prompt‑injection detection – validates URLs and fetched content for malicious instructions.

📦 Installation & Quick Start

# Clone the repository (or copy the MCPDataFetchServer.1 folder) git clone https://github.com/undici77/MCPDataFetchServer.git cd MCPDataFetchServer # Make the startup script executable chmod +x run.sh # Run the server, pointing to a sandboxed working directory ./run.sh -d /path/to/working/directory

📌 Three‑step overview
1️⃣ The script creates a virtual environment and installs dependencies.
2️⃣ It prepares a cache folder (.fetch_cache) inside the project root.
3️⃣ main.py launches the MCP server, listening on stdin/stdout for JSON‑RPC requests.

⚙️ Command‑Line Options

Option

Description

-d

--working-dir

Path to the

sandboxed working directory

where all file operations are confined (default:

~/.mcp_datafetch

-c

--cache-dir

Name of the

cache subdirectory

relative to the working directory (default:

cache

-h

--help

Show help message and exit.

🤝 Integration with LM Studio (or any MCP‑compatible client)

Add an entry to your mcp.json configuration so that LM Studio can launch the server automatically.

{ "mcpServers": { "datafetch": { "command": "/absolute/path/to/MCPDataFetchServer.1/run.sh", "args": [ "-d", "/absolute/path/to/working/directory" ], "env": { "WORKING_DIR": "." } } } }

📌 Tip: Ensure run.sh is executable (chmod +x …) and that the virtual environment can install the required Python packages on first launch.

📡 MCP API Overview

All communication follows JSON‑RPC 2.0 over stdin/stdout.

`initialize`

Request:

{ "jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {} }

Response contains the protocol version, server capabilities and basic metadata (e.g., name = mcp-datafetch-server, version = 2.1.0).

`tools/list`

Request:

{ "jsonrpc": "2.0", "id": 2, "method": "tools/list", "params": {} }

Response: { "tools": [ …tool definitions… ] }. Each definition includes name, description and an input schema (JSON Schema).

`tools/call`

Generic request shape (replace <tool_name> and arguments as needed):

{ "jsonrpc": "2.0", "id": 3, "method": "tools/call", "params": { "name": "<tool_name>", "arguments": { … } } }

The server validates the request against the tool’s schema, executes the operation, and returns a ToolResult containing one or more content blocks.

🛠️ Available Tools

`fetch_webpage`

Securely fetches a web page and returns clean content in the requested format.

Name	Type	Required	Description
`url`	string	✅ (no default)	URL to fetch ( http/https only ).
`format`	string	❌ ( `markdown` )	Output format – one of `markdown` , `text` , or `html` .
`include_links`	boolean	❌ ( `true` )	Whether to append an extracted links list.
`include_images`	boolean	❌ ( `false` )	Whether to list image URLs in the output.
`remove_banners`	boolean	❌ ( `true` )	Attempt to strip cookie banners & pop‑ups.

Example

{ "jsonrpc": "2.0", "id": 10, "method": "tools/call", "params": { "name": "fetch_webpage", "arguments": { "url": "https://example.com/article", "format": "markdown", "include_links": true, "include_images": false, "remove_banners": true } } }

Note: The tool sanitises HTML, removes scripts/iframes, and checks for prompt‑injection patterns before returning content.

`extract_links`

Extracts and categorises all hyperlinks from a page.

Name

Type

Required

Description

url

string

✅ (no default)

URL of the page to analyse.

filter

string

❌ (

all

)

Return only

all

internal

external

, or

resources

Example

{ "jsonrpc": "2.0", "id": 11, "method": "tools/call", "params": { "name": "extract_links", "arguments": { "url": "https://example.com/blog", "filter": "internal" } } }

Note: Links are classified as internal (same domain) or external; resource links (images, PDFs…) can be filtered with resources.

`download_file`

Safely downloads a remote file into the sandboxed cache directory.

Name	Type	Required	Description
`url`	string	✅ (no default)	Direct URL to the file.
`filename`	string	❌ (auto‑generated)	Desired filename; will be sanitised and forced into the cache directory.

Example

{ "jsonrpc": "2.0", "id": 12, "method": "tools/call", "params": { "name": "download_file", "arguments": { "url": "https://example.com/files/report.pdf", "filename": "report_latest.pdf" } } }

Note: The server enforces a 100 MB download limit, validates the URL against blocked domains/extensions, and returns the relative path inside the working directory for cross‑agent access.

`get_page_metadata`

Extracts structured metadata (title, description, Open Graph, Twitter Cards) from a web page.

Name	Type	Required	Description
`url`	string	✅ (no default)	URL of the page to inspect.

Example

{ "jsonrpc": "2.0", "id": 13, "method": "tools/call", "params": { "name": "get_page_metadata", "arguments": { "url": "https://example.com/product/42" } } }

Note: The tool returns a formatted text block with title, description, keywords, Open Graph properties and Twitter Card fields.

`check_url`

Performs a lightweight HEAD request to report status code, headers and size without downloading the body.

Name	Type	Required	Description
`url`	string	✅ (no default)	URL to probe.

Example

{ "jsonrpc": "2.0", "id": 14, "method": "tools/call", "params": { "name": "check_url", "arguments": { "url": "https://example.com/resource.zip" } } }

Note: The response includes the final URL after redirects, a concise status summary (✅ OK or ⚠️ Error), and selected HTTP headers such as Content‑Type and Content‑Length.

🔐 Security Features

Path‑traversal protection – all file operations are confined to the sandboxed working directory.
Prompt‑injection detection in URLs, fetched HTML and generated content.
Blocked domains & extensions (localhost, private IP ranges, executable/script files).
Content‑size limits – max 50 MB for page fetches, max 100 MB for file downloads.
HTML sanitisation – removes <script>, <iframe>, event handlers and other risky elements before processing.
Cookie/banner handling – optional removal of consent banners and pop‑ups during fetch.

MCP Data Fetch Server

📂 MCP Data Fetch Server

Table of Contents

🎯 Features

📦 Installation & Quick Start

⚙️ Command‑Line Options

🤝 Integration with LM Studio (or any MCP‑compatible client)

📡 MCP API Overview

`initialize`

`tools/list`

`tools/call`

🛠️ Available Tools

`fetch_webpage`

`extract_links`

`download_file`

`get_page_metadata`

`check_url`

🔐 Security Features

Resources

New MCP Servers

Latest Blog Posts

MCP directory API

📂 MCP Data Fetch Server

Table of Contents

🎯 Features

📦 Installation & Quick Start

⚙️ Command‑Line Options

🤝 Integration with LM Studio (or any MCP‑compatible client)

📡 MCP API Overview

initialize

tools/list

tools/call

🛠️ Available Tools

fetch_webpage

extract_links

download_file

get_page_metadata

check_url

🔐 Security Features

Resources

New MCP Servers

Latest Blog Posts

MCP directory API

🤝 Integration with LM Studio (or any MCP‑compatible client)

`initialize`

`tools/list`

`tools/call`

`fetch_webpage`

`extract_links`

`download_file`

`get_page_metadata`

`check_url`