Skip to main content
Glama

MCP Data Fetch Server

by undici77
README.md8.76 kB
# 📂 MCP Data Fetch Server **MCP Data Fetch Server** is secure, sandboxed server that fetches web content and extracts data via the **Model Control Protocol (MCP)**. without executing JavaScript. --- ## Table of Contents - [Features](#features) - [Installation & Quick Start](#installation--quick-start) - [Command‑Line Options](#commandline-options) - [Integration with LM Studio](#integration-with-lm-studio) - [MCP API Overview](#mcp-api-overview) - [`initialize`](#initialize) - [`tools/list`](#toolslist) - [`tools/call`](#toolscall) - [Available Tools](#available-tools) - [`fetch_webpage`](#fetch_webpage) - [`extract_links`](#extract_links) - [`download_file`](#download_file) - [`get_page_metadata`](#get_page_metadata) - [`check_url`](#check_url) - [Security Features](#security-features) --- ## 🎯 Features - **Secure web page fetching** – strips scripts, iframes and cookie banners; no JavaScript execution. - **Rich data extraction** – retrieve links, metadata, Open Graph/Twitter cards, and downloadable resources. - **Safe file downloads** – size limits, filename sanitisation, and path‑traversal protection within a sandboxed cache. - **Built‑in caching** – optional cache directory reduces repeated network calls. - **Prompt‑injection detection** – validates URLs and fetched content for malicious instructions. --- ## 📦 Installation & Quick Start ```bash # Clone the repository (or copy the MCPDataFetchServer.1 folder) git clone https://github.com/undici77/MCPDataFetchServer.git cd MCPDataFetchServer # Make the startup script executable chmod +x run.sh # Run the server, pointing to a sandboxed working directory ./run.sh -d /path/to/working/directory ``` > 📌 **Three‑step overview** > 1️⃣ The script creates a virtual environment and installs dependencies. > 2️⃣ It prepares a cache folder (`.fetch_cache`) inside the project root. > 3️⃣ `main.py` launches the MCP server, listening on *stdin/stdout* for JSON‑RPC requests. --- ## ⚙️ Command‑Line Options | Option | Description | |--------|-------------| | `-d`, `--working-dir` | Path to the **sandboxed working directory** where all file operations are confined (default: `~/.mcp_datafetch`). | | `-c`, `--cache-dir` | Name of the **cache subdirectory** relative to the working directory (default: `cache`). | | `-h`, `--help` | Show help message and exit. | --- ## 🤝 Integration with LM Studio *(or any MCP‑compatible client)* Add an entry to your `mcp.json` configuration so that LM Studio can launch the server automatically. ```json { "mcpServers": { "datafetch": { "command": "/absolute/path/to/MCPDataFetchServer.1/run.sh", "args": [ "-d", "/absolute/path/to/working/directory" ], "env": { "WORKING_DIR": "." } } } } ``` > 📌 **Tip:** Ensure `run.sh` is executable (`chmod +x …`) and that the virtual environment can install the required Python packages on first launch. --- ## 📡 MCP API Overview All communication follows **JSON‑RPC 2.0** over *stdin/stdout*. ### `initialize` Request: ```json { "jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {} } ``` Response contains the protocol version, server capabilities and basic metadata (e.g., name = `mcp-datafetch-server`, version = `2.1.0`). ### `tools/list` Request: ```json { "jsonrpc": "2.0", "id": 2, "method": "tools/list", "params": {} } ``` Response: `{ "tools": [ …tool definitions… ] }`. Each definition includes `name`, `description` and an **input schema** (JSON Schema). ### `tools/call` Generic request shape (replace `<tool_name>` and arguments as needed): ```json { "jsonrpc": "2.0", "id": 3, "method": "tools/call", "params": { "name": "<tool_name>", "arguments": { … } } } ``` The server validates the request against the tool’s schema, executes the operation, and returns a `ToolResult` containing one or more **content blocks**. --- ## 🛠️ Available Tools ### `fetch_webpage` - **Securely fetches a web page and returns clean content in the requested format.** | Name | Type | Required | Description | |------|------|----------|-------------| | `url` | string | ✅ (no default) | URL to fetch (**http/https only**). | | `format` | string | ❌ (`markdown`) | Output format – one of `markdown`, `text`, or `html`. | | `include_links` | boolean | ❌ (`true`) | Whether to append an extracted links list. | | `include_images` | boolean | ❌ (`false`) | Whether to list image URLs in the output. | | `remove_banners` | boolean | ❌ (`true`) | Attempt to strip cookie banners & pop‑ups. | **Example** ```json { "jsonrpc": "2.0", "id": 10, "method": "tools/call", "params": { "name": "fetch_webpage", "arguments": { "url": "https://example.com/article", "format": "markdown", "include_links": true, "include_images": false, "remove_banners": true } } } ``` *Note:* The tool sanitises HTML, removes scripts/iframes, and checks for prompt‑injection patterns before returning content. --- ### `extract_links` - **Extracts and categorises all hyperlinks from a page.** | Name | Type | Required | Description | |------|------|----------|-------------| | `url` | string | ✅ (no default) | URL of the page to analyse. | | `filter` | string | ❌ (`all`) | Return only `all`, `internal`, `external`, or `resources`. | **Example** ```json { "jsonrpc": "2.0", "id": 11, "method": "tools/call", "params": { "name": "extract_links", "arguments": { "url": "https://example.com/blog", "filter": "internal" } } } ``` *Note:* Links are classified as **internal** (same domain) or **external**; resource links (images, PDFs…) can be filtered with `resources`. --- ### `download_file` - **Safely downloads a remote file into the sandboxed cache directory.** | Name | Type | Required | Description | |------|------|----------|-------------| | `url` | string | ✅ (no default) | Direct URL to the file. | | `filename` | string | ❌ (auto‑generated) | Desired filename; will be sanitised and forced into the cache directory. | **Example** ```json { "jsonrpc": "2.0", "id": 12, "method": "tools/call", "params": { "name": "download_file", "arguments": { "url": "https://example.com/files/report.pdf", "filename": "report_latest.pdf" } } } ``` *Note:* The server enforces a **100 MB** download limit, validates the URL against blocked domains/extensions, and returns the relative path inside the working directory for cross‑agent access. --- ### `get_page_metadata` - **Extracts structured metadata (title, description, Open Graph, Twitter Cards) from a web page.** | Name | Type | Required | Description | |------|------|----------|-------------| | `url` | string | ✅ (no default) | URL of the page to inspect. | **Example** ```json { "jsonrpc": "2.0", "id": 13, "method": "tools/call", "params": { "name": "get_page_metadata", "arguments": { "url": "https://example.com/product/42" } } } ``` *Note:* The tool returns a formatted text block with title, description, keywords, Open Graph properties and Twitter Card fields. --- ### `check_url` - **Performs a lightweight HEAD request to report status code, headers and size without downloading the body.** | Name | Type | Required | Description | |------|------|----------|-------------| | `url` | string | ✅ (no default) | URL to probe. | **Example** ```json { "jsonrpc": "2.0", "id": 14, "method": "tools/call", "params": { "name": "check_url", "arguments": { "url": "https://example.com/resource.zip" } } } ``` *Note:* The response includes the final URL after redirects, a concise status summary (✅ OK or ⚠️ Error), and selected HTTP headers such as `Content‑Type` and `Content‑Length`. --- ## 🔐 Security Features - **Path‑traversal protection** – all file operations are confined to the sandboxed *working directory*. - **Prompt‑injection detection** in URLs, fetched HTML and generated content. - **Blocked domains & extensions** (localhost, private IP ranges, executable/script files). - **Content‑size limits** – max 50 MB for page fetches, max 100 MB for file downloads. - **HTML sanitisation** – removes `<script>`, `<iframe>`, event handlers and other risky elements before processing. - **Cookie/banner handling** – optional removal of consent banners and pop‑ups during fetch. --- *© 2025 Undici77 – All rights reserved.*

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/undici77/MCPDataFetchServer'

If you have feedback or need assistance with the MCP directory API, please join our Discord server