Skip to main content
Glama

MCP Data Fetch Server

by undici77

📂 MCP Data Fetch Server

MCP Data Fetch Server is secure, sandboxed server that fetches web content and extracts data via the Model Control Protocol (MCP). without executing JavaScript.


Table of Contents


🎯 Features

  • Secure web page fetching – strips scripts, iframes and cookie banners; no JavaScript execution.

  • Rich data extraction – retrieve links, metadata, Open Graph/Twitter cards, and downloadable resources.

  • Safe file downloads – size limits, filename sanitisation, and path‑traversal protection within a sandboxed cache.

  • Built‑in caching – optional cache directory reduces repeated network calls.

  • Prompt‑injection detection – validates URLs and fetched content for malicious instructions.


📦 Installation & Quick Start

# Clone the repository (or copy the MCPDataFetchServer.1 folder) git clone https://github.com/undici77/MCPDataFetchServer.git cd MCPDataFetchServer # Make the startup script executable chmod +x run.sh # Run the server, pointing to a sandboxed working directory ./run.sh -d /path/to/working/directory

📌 Three‑step overview
1️⃣ The script creates a virtual environment and installs dependencies.
2️⃣ It prepares a cache folder (.fetch_cache) inside the project root.
3️⃣ main.py launches the MCP server, listening on stdin/stdout for JSON‑RPC requests.


⚙️ Command‑Line Options

Option

Description

-d

,

--working-dir

Path to the

sandboxed working directory

where all file operations are confined (default:

~/.mcp_datafetch

).

-c

,

--cache-dir

Name of the

cache subdirectory

relative to the working directory (default:

cache

).

-h

,

--help

Show help message and exit.


🤝 Integration with LM Studio (or any MCP‑compatible client)

Add an entry to your mcp.json configuration so that LM Studio can launch the server automatically.

{ "mcpServers": { "datafetch": { "command": "/absolute/path/to/MCPDataFetchServer.1/run.sh", "args": [ "-d", "/absolute/path/to/working/directory" ], "env": { "WORKING_DIR": "." } } } }

📌 Tip: Ensure run.sh is executable (chmod +x …) and that the virtual environment can install the required Python packages on first launch.


📡 MCP API Overview

All communication follows JSON‑RPC 2.0 over stdin/stdout.

initialize

Request:

{ "jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {} }

Response contains the protocol version, server capabilities and basic metadata (e.g., name = mcp-datafetch-server, version = 2.1.0).

tools/list

Request:

{ "jsonrpc": "2.0", "id": 2, "method": "tools/list", "params": {} }

Response: { "tools": [ …tool definitions… ] }. Each definition includes name, description and an input schema (JSON Schema).

tools/call

Generic request shape (replace <tool_name> and arguments as needed):

{ "jsonrpc": "2.0", "id": 3, "method": "tools/call", "params": { "name": "<tool_name>", "arguments": { … } } }

The server validates the request against the tool’s schema, executes the operation, and returns a ToolResult containing one or more content blocks.


🛠️ Available Tools

fetch_webpage

  • Securely fetches a web page and returns clean content in the requested format.

Name

Type

Required

Description

url

string

✅ (no default)

URL to fetch (

http/https only

).

format

string

❌ (

markdown

)

Output format – one of

markdown

,

text

, or

html

.

include_links

boolean

❌ (

true

)

Whether to append an extracted links list.

include_images

boolean

❌ (

false

)

Whether to list image URLs in the output.

remove_banners

boolean

❌ (

true

)

Attempt to strip cookie banners & pop‑ups.

Example

{ "jsonrpc": "2.0", "id": 10, "method": "tools/call", "params": { "name": "fetch_webpage", "arguments": { "url": "https://example.com/article", "format": "markdown", "include_links": true, "include_images": false, "remove_banners": true } } }

Note: The tool sanitises HTML, removes scripts/iframes, and checks for prompt‑injection patterns before returning content.


extract_links

  • Extracts and categorises all hyperlinks from a page.

Name

Type

Required

Description

url

string

✅ (no default)

URL of the page to analyse.

filter

string

❌ (

all

)

Return only

all

,

internal

,

external

, or

resources

.

Example

{ "jsonrpc": "2.0", "id": 11, "method": "tools/call", "params": { "name": "extract_links", "arguments": { "url": "https://example.com/blog", "filter": "internal" } } }

Note: Links are classified as internal (same domain) or external; resource links (images, PDFs…) can be filtered with resources.


download_file

  • Safely downloads a remote file into the sandboxed cache directory.

Name

Type

Required

Description

url

string

✅ (no default)

Direct URL to the file.

filename

string

❌ (auto‑generated)

Desired filename; will be sanitised and forced into the cache directory.

Example

{ "jsonrpc": "2.0", "id": 12, "method": "tools/call", "params": { "name": "download_file", "arguments": { "url": "https://example.com/files/report.pdf", "filename": "report_latest.pdf" } } }

Note: The server enforces a 100 MB download limit, validates the URL against blocked domains/extensions, and returns the relative path inside the working directory for cross‑agent access.


get_page_metadata

  • Extracts structured metadata (title, description, Open Graph, Twitter Cards) from a web page.

Name

Type

Required

Description

url

string

✅ (no default)

URL of the page to inspect.

Example

{ "jsonrpc": "2.0", "id": 13, "method": "tools/call", "params": { "name": "get_page_metadata", "arguments": { "url": "https://example.com/product/42" } } }

Note: The tool returns a formatted text block with title, description, keywords, Open Graph properties and Twitter Card fields.


check_url

  • Performs a lightweight HEAD request to report status code, headers and size without downloading the body.

Name

Type

Required

Description

url

string

✅ (no default)

URL to probe.

Example

{ "jsonrpc": "2.0", "id": 14, "method": "tools/call", "params": { "name": "check_url", "arguments": { "url": "https://example.com/resource.zip" } } }

Note: The response includes the final URL after redirects, a concise status summary (✅ OK or ⚠️ Error), and selected HTTP headers such as Content‑Type and Content‑Length.


🔐 Security Features

  • Path‑traversal protection – all file operations are confined to the sandboxed working directory.

  • Prompt‑injection detection in URLs, fetched HTML and generated content.

  • Blocked domains & extensions (localhost, private IP ranges, executable/script files).

  • Content‑size limits – max 50 MB for page fetches, max 100 MB for file downloads.

  • HTML sanitisation – removes <script>, <iframe>, event handlers and other risky elements before processing.

  • Cookie/banner handling – optional removal of consent banners and pop‑ups during fetch.


© 2025 Undici77 – All rights reserved.

-
security - not tested
A
license - permissive license
-
quality - not tested

remote-capable server

The server can be hosted and run remotely because it primarily relies on remote services or has no dependency on the local environment.

Securely fetches web content, extracts links and metadata, and downloads files through a sandboxed MCP server without JavaScript execution. Includes prompt-injection detection and comprehensive HTML sanitization for safe web data retrieval.

  1. Table of Contents
    1. 🎯 Features
      1. 📦 Installation & Quick Start
        1. ⚙️ Command‑Line Options
          1. 🤝 Integration with LM Studio (or any MCP‑compatible client)
            1. 📡 MCP API Overview
              1. initialize
              2. tools/list
              3. tools/call
            2. 🛠️ Available Tools
              1. fetch_webpage
              2. extract_links
              3. download_file
              4. get_page_metadata
              5. check_url
            3. 🔐 Security Features

              MCP directory API

              We provide all the information about MCP servers via our MCP API.

              curl -X GET 'https://glama.ai/api/mcp/v1/servers/undici77/MCPDataFetchServer'

              If you have feedback or need assistance with the MCP directory API, please join our Discord server