Skip to main content
Glama

mcp-webgate

Python Version License MCP Protocol Latest Release Beta

Web search that doesn't wreck your AI's memory.

mcp-webgate is an MCP server that gives your AI clean, bounded web content โ€” across all major AI clients:

  • IDEs: Claude Desktop, Claude Code, Zed, Cursor, Windsurf, VSCode

  • CLI Agents: Gemini CLI, Claude CLI, custom agents

๐ŸŒฑ A Gentle Introduction

What is mcp-webgate? When your AI uses a standard "fetch URL" tool, it gets the raw HTML of the page โ€” ads, menus, scripts, cookie banners and all. A single news article can dump 200,000 tokens of garbage into the AI's memory, wiping out your entire conversation.

mcp-webgate is a protective filter that sits between your AI and the web:

  1. Strips the junk โ€” menus, scripts, ads, footers are removed with surgical HTML parsing; only readable text passes through

  2. Hard-caps every response โ€” no page can ever blow up your context window, no matter how big the original was

  3. Optionally summarizes โ€” route results through a secondary local LLM that produces a compact Markdown report with citations; your primary AI gets a polished briefing instead of a wall of text

The result: clean, bounded, useful web content โ€” always.

๐Ÿ”ฌ Real example: what happens under the hood

Searching for "mcp model context protocol" with LLM features on:

Query โ†’ LLM expands to 5 search variants โ†’ 20 pages found, 13 fetched in parallel

Raw HTML downloaded     5.16 MB   (~1,290,000 tokens)
After cleaning          52.1 KB   (   ~13,000 tokens)  โ€” 99% noise stripped
After LLM summary        5.8 KB   (    ~1,450 tokens)  โ€” structured report with citations

13 sources distilled into ~1,450 tokens. A single naive fetch of just one of those pages (e.g. a security blog at 563 KB) would dump ~140,000 tokens of raw HTML into your AI's context. webgate processes all 13 and delivers a clean briefing that fits in a footnote.

This is an intensive case (5 queries ร— 5 results). A typical search with 3โ€“5 results still saves 95%+ of context compared to raw fetching โ€” and your AI gets structured, ranked content instead of a wall of HTML soup.

๐Ÿš€ Quick Start

1. Make sure you have uvx

pip install uv

uvx runs Python tools without installing them permanently. You only need to do this once.

2. Set up a search backend

The easiest option is SearXNG โ€” free, no account, runs locally:

docker run -d -p 8080:8080 --name searxng searxng/searxng

No Docker? Use a cloud backend instead (Brave, Tavily, Exa, SerpAPI) โ€” see Backends.

3. Add webgate to your AI client

See the Integrations table for your specific client. As a quick example, for Claude Desktop:

Open the config file:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json

  • Windows: %APPDATA%\Claude\claude_desktop_config.json

Add this:

{
  "mcpServers": {
    "webgate": {
      "command": "uvx",
      "args": ["mcp-webgate"],
      "env": {
        "WEBGATE_DEFAULT_BACKEND": "searxng",
        "WEBGATE_SEARXNG_URL": "http://localhost:8080"
      }
    }
  }
}

Restart the client after editing.

Search the web for: latest news on AI regulation

The AI will use webgate_query automatically. You're done.

๐Ÿ” How it works

Your question
    โ†“
Search backend  (SearXNG / Brave / Tavily / Exa / SerpAPI)
    โ†“  [deduplicate URLs, block binary files, filter domains]
Fetch pages in parallel  (streaming โ€” hard size cap per page)
    โ†“  [optional: retry failed pages from reserve pool]
Strip HTML junk  (menus, ads, scripts, footers โ€” lxml)
    โ†“
Clean up text  (invisible chars, unicode junk, BiDi tricks)
    โ†“
BM25 reranking  (best-matching results first โ€” always active)
    โ†“  [optional: LLM reranking]
Cap total output to budget
    โ†“  [optional: LLM summarization โ†’ compact Markdown report]
Clean result lands in your AI's context

๐Ÿ› ๏ธ Tools

webgate gives your AI three tools:

webgate_fetch โ€” read a single page

Use this when you already know the URL you want. The AI passes the URL and gets back the cleaned text โ€” up to max_query_budget characters (default 32,000).

{ "url": "https://example.com/article", "max_chars": 32000 }
{
  "url": "https://example.com/article",
  "title": "Article Title",
  "text": "cleaned text...",
  "truncated": true,
  "char_count": 12450
}

webgate_query โ€” search + fetch + clean

Runs a full search cycle. Pass one query (or several) and get back cleaned, ranked results.

{ "queries": "how to set up a VPN on Linux", "num_results_per_query": 5 }

Multiple queries run in parallel and are merged:

{
  "queries": ["VPN Linux setup", "best VPN Linux 2024"],
  "num_results_per_query": 5
}

Output without LLM โ€” returns cleaned page content for each result:

{
  "sources": [
    { "id": 1, "title": "...", "url": "...", "content": "cleaned text...", "truncated": false }
  ],
  "snippet_pool": [ { "id": 6, "title": "...", "url": "...", "snippet": "..." } ],
  "stats": { "fetched": 5, "total_chars": 18200, "per_page_limit": 6400 }
}

Output with LLM summarization โ€” returns a compact Markdown report:

{
  "summary": "## How to set up a VPN on Linux\n\nTo install...[1][2]",
  "citations": [{ "id": 1, "title": "...", "url": "..." }],
  "stats": { "fetched": 5, "total_chars": 58000 }
}

Output when LLM fails โ€” error reason shown, full sources returned as fallback:

{
  "llm_summary_error": "ReadTimeout: LLM did not respond in time",
  "sources": [ "..." ],
  "stats": { "..." : "..." }
}

snippet_pool contains extra results from the search that were not fetched (search-engine snippet only). The AI can use these to decide if more fetches are worthwhile.

webgate_onboarding โ€” how-to guide

Returns a JSON guide explaining how to use webgate effectively. The AI should call this once at the start of a session if in doubt about which tool to use.

๐Ÿ”ง Using webgate with local or smaller models

Most frontier models follow MCP tool instructions automatically. Smaller or local models sometimes ignore the server-provided guidance and fall back to a built-in fetch tool instead โ€” returning raw HTML that floods the context with noise.

If you notice this happening, add an explicit instruction block to your system prompt:

You have access to webgate tools for web search and page retrieval.
Follow these rules in every session:
- To search the web: use webgate_query โ€” never use a built-in fetch, browser, or HTTP tool
- To retrieve a URL: use webgate_fetch โ€” never fetch URLs directly
- Built-in fetch tools return raw HTML that floods your context; webgate returns clean, bounded text
At the start of each session, call webgate_onboarding to read the full operational guide.

This works because user system prompt instructions take precedence over MCP server-level guidance, making the constraint explicit at the highest-priority layer the model sees.

Tip: if your client supports named system prompts or prompt templates, save the block above as a reusable preset so you don't have to paste it every time.

๐ŸŽ›๏ธ Tuning

This section explains what the key parameters do and when to change them. The defaults work well for most cases โ€” only tweak if you have a specific reason.

What is a "character budget"?

webgate measures text in characters (not tokens). A rough conversion for English text:

4 characters โ‰ˆ 1 token

Characters

Approximate tokens

8,000

~2,000

32,000

~8,000

96,000

~24,000

webgate_fetch budget

When you fetch a single URL, the ceiling is max_query_budget (default 32,000 chars). The tool parameter max_chars can request less, but never more than this ceiling.

Why max_query_budget and not max_result_length? Because you're fetching one page โ€” the "total output" IS that one page, so the right limit is the overall context budget, not the per-page cap designed for multi-source queries.

webgate_query budget โ€” without LLM

With no LLM, the cleaned sources go directly to your AI's context. webgate distributes max_query_budget across all fetched pages so the total never exceeds the budget:

Per-page limit = max_query_budget รท number of results (capped at max_result_length)

Results fetched

Per-page limit

Total output

1

8,000 (cap)

โ‰ค 8,000

5

6,400

โ‰ค 32,000

10

3,200

โ‰ค 32,000

20

1,600

โ‰ค 32,000

The total output is always at most max_query_budget, regardless of how many results you request โ€” the per-page share automatically shrinks to compensate.

webgate_query budget โ€” with LLM summarization

When a secondary LLM is summarizing, it compresses the content before passing the result to your primary AI. This means it's safe โ€” and beneficial โ€” to give it more raw material to work from.

webgate scales up the input using input_budget_factor (default 3):

LLM input budget = max_query_budget ร— input_budget_factor Default: 32,000 ร— 3 = 96,000 chars

Results fetched

LLM input / page

Total LLM input

Output to your AI

1

96,000

96,000

compact report

5

19,200

96,000

compact report

10

9,600

96,000

compact report

20

4,800

96,000

compact report

The secondary LLM sees much more content per page. Your primary AI sees only the final report โ€” typically 1,000โ€“3,000 tokens โ€” regardless of how many sources were processed. This is the main efficiency advantage of LLM mode.

Quick tuning guide

Symptom

Fix

AI responses feel slow, too much text

Reduce max_query_budget (e.g. 16000)

AI answers are shallow or miss details

Increase max_query_budget (e.g. 48000)

LLM summary is thin or misses things

Increase input_budget_factor (e.g. 5)

LLM summary times out or is very slow

Reduce input_budget_factor (e.g. 2) or reduce results_per_query

fetch returns too little of a long page

Increase max_query_budget (e.g. 64000)

Pages are slow to download

Reduce max_download_mb (e.g. 1, already default)

Server downloads too much garbage

Reduce max_download_mb (e.g. 1)

๐Ÿค– LLM Features

Optional, opt-in. When llm.enabled = false (the default), webgate is fully deterministic. Enable the [llm] block to unlock three extra capabilities.

๐Ÿค” When to enable LLM features

Situation

Recommended setup

Typical latency overhead

Fast answers, general research

LLM disabled (default) โ€” BM25-ranked clean sources, zero latency overhead

none

Deep research on a complex topic

Summarization on โ€” get a cited Markdown report instead of raw pages

+5โ€“30s

Broad topic, one query isn't enough

Expansion + Summarization โ€” LLM generates variants and synthesizes all results

+6โ€“35s

Result order matters more than speed

LLM reranking on โ€” semantic ordering at the cost of one extra LLM call per query

+1โ€“5s

Privacy: with LLM disabled, no data leaves your machine except web requests. With LLM enabled, cleaned search results (not raw HTML) are sent to the configured base_url. Point it at a local Ollama instance to keep everything on-device.

Latency trade-off: each enabled feature adds one LLM round-trip per query. Expansion adds ~1โ€“5s; summarization adds ~5โ€“30s depending on model and content volume. For interactive use, summarization with a fast local model (e.g. Gemma 3 4B) is a good starting point.

Setup

[llm]
enabled  = true
base_url = "http://localhost:11434/v1"   # Ollama, OpenAI, LM Studio, vLLM, Groq...
api_key  = ""                            # empty for local models
model    = "gemma3:27b"
timeout  = 60                            # local 27B+ models may need up to 60s

Or with env vars:

"env": {
  "WEBGATE_LLM_ENABLED": "true",
  "WEBGATE_LLM_BASE_URL": "http://localhost:11434/v1",
  "WEBGATE_LLM_MODEL": "gemma3:27b",
  "WEBGATE_LLM_TIMEOUT": "60"
}

base_url accepts any OpenAI-compatible endpoint: OpenAI, Ollama, LM Studio, vLLM, Together AI, Groq, and others.

Query expansion

When you send a single query and expansion_enabled = true, the LLM automatically generates complementary search variants before hitting the backend. If you already pass multiple queries, this step is skipped.

"best laptop for programming"
    โ†“ expansion
["best laptop for programming 2024", "developer laptop recommendations", "laptop specs for coding"]
    โ†“ all search in parallel

Falls back silently to your original query if the LLM fails.

Summarization

When summarization_enabled = true, the LLM reads all fetched pages and writes a structured Markdown report with inline citations. Your AI receives the report instead of the raw text.

  • Success: summary + citations (lean output โ€” no raw content passed to your AI)

  • Failure: llm_summary_error with the reason + full sources as fallback (your AI can still work with the cleaned content)

The report length target is max_summary_words. When 0 (default), it is derived from max_query_budget / 5 โ€” e.g. with a 32k budget, the target is ~6,400 words.

Reranking

Results are always reranked by BM25 (keyword overlap, zero cost) before being returned. Optionally, the LLM can do a second pass for semantic relevance:

Tier

When

Cost

BM25 (deterministic)

Always

Zero โ€” pure math

LLM-assisted

llm_rerank_enabled = true

One LLM call per query

LLM reranking adds latency proportional to your LLM response time. Enable it only if result ordering matters more than speed.

Pipeline: clean โ†’ BM25 rerank โ†’ (LLM rerank) โ†’ (LLM summarize) โ†’ output

๐Ÿ”— Integrations

mcp-webgate works with all major AI clients:

Platform

Configuration Guide

Notes

Claude Desktop

IDE Integration

Desktop application

Claude Code

IDE Integration

CLI coding agent

Zed Editor

IDE Integration

Native MCP support

Cursor

IDE Integration

Requires Agent mode

Windsurf

IDE Integration

Global config only

VSCode

IDE Integration

Via Copilot or MCP extension

Gemini CLI

Agent Integration

Google's CLI agent

Claude CLI

Agent Integration

Anthropic's CLI agent

๐Ÿ“ฆ Installation

uvx mcp-webgate

Via pip / uv

pip install mcp-webgate
# or
uv add mcp-webgate

โš™๏ธ Full Configuration

Ready-to-use config files are in examples/.

Resolution order

CLI args  >  env vars  >  webgate.toml  >  defaults

Config is read once at startup; restart the server to apply changes.

You can configure webgate in three ways โ€” mix and match as needed:

  • webgate.toml โ€” checked at startup in ./webgate.toml then ~/webgate.toml

  • Env vars โ€” WEBGATE_* prefix, always strings (MCP JSON requirement)

  • CLI args โ€” --kebab-case, integers stay integers, ideal for multi-instance setups

Config file (webgate.toml)

[server]
max_download_mb    = 1        # how many MB to download per page before cutting off
max_result_length  = 8000     # max chars per page in multi-source queries (no LLM)
max_query_budget   = 32000    # total char budget for a fetch, or input pool for a query
max_search_queries = 5        # max parallel queries per call
results_per_query  = 5        # results to fetch per query
search_timeout     = 8        # seconds before giving up on a page
oversampling_factor = 2       # fetch 2ร— more candidates than needed (dedup reserve)
auto_recovery_fetch = false   # retry failed fetches from reserve pool
max_total_results  = 20       # hard cap: never fetch more than this many pages total
blocked_domains    = ["reddit.com", "pinterest.com"]
allowed_domains    = []       # if non-empty, only these domains are allowed
adaptive_budget    = false   # [EXPERIMENTAL] proportional char allocation based on BM25 rank
adaptive_budget_fetch_factor = 3  # generous pre-rank fetch multiplier

[backends]
default = "searxng"

[backends.searxng]
url = "http://localhost:8080"

[backends.brave]
api_key = "BSA..."

[backends.tavily]
api_key = "tvly-..."
search_depth = "basic"

[llm]
enabled  = true
base_url = "http://localhost:11434/v1"
api_key  = ""
model    = "llama3.2"
timeout  = 60
expansion_enabled     = true
summarization_enabled = true
llm_rerank_enabled    = false
max_summary_words     = 0     # 0 = max_query_budget / 5 (e.g. 6400 with budget 32000)
input_budget_factor   = 3     # LLM input = max_query_budget ร— factor (default: 96000)

MCP client config examples

With env vars (all values must be strings):

{
  "mcpServers": {
    "webgate": {
      "command": "uvx",
      "args": ["mcp-webgate"],
      "env": {
        "WEBGATE_DEFAULT_BACKEND": "searxng",
        "WEBGATE_SEARXNG_URL": "http://localhost:8080",
        "WEBGATE_LLM_ENABLED": "true",
        "WEBGATE_LLM_TIMEOUT": "60"
      }
    }
  }
}

With CLI args (integers stay integers โ€” ideal for running independent instances in Zed, Cursor, etc.):

{
  "mcpServers": {
    "webgate": {
      "command": "uvx",
      "args": [
        "mcp-webgate",
        "--searxng-url", "http://localhost:8080",
        "--llm-enabled",
        "--llm-model", "gemma3:27b",
        "--llm-timeout", "60"
      ]
    }
  }
}

Boolean flags support --flag / --no-flag syntax (e.g. --llm-enabled, --no-llm-rerank-enabled).

Full reference

CLI argument

Env var

Default

Description

--default-backend

WEBGATE_DEFAULT_BACKEND

searxng

Active backend

--searxng-url

WEBGATE_SEARXNG_URL

http://localhost:8080

SearXNG instance URL

--brave-api-key

WEBGATE_BRAVE_API_KEY

(empty)

Brave Search API key

--tavily-api-key

WEBGATE_TAVILY_API_KEY

(empty)

Tavily API key

--exa-api-key

WEBGATE_EXA_API_KEY

(empty)

Exa API key

--serpapi-api-key

WEBGATE_SERPAPI_API_KEY

(empty)

SerpAPI key

--serpapi-engine

WEBGATE_SERPAPI_ENGINE

google

SerpAPI engine (google, bing, ...)

--serpapi-gl

WEBGATE_SERPAPI_GL

us

SerpAPI country code

--serpapi-hl

WEBGATE_SERPAPI_HL

en

SerpAPI language

--max-download-mb

WEBGATE_MAX_DOWNLOAD_MB

1

Per-page download size cap (MB)

--max-result-length

WEBGATE_MAX_RESULT_LENGTH

8000

Per-page char cap (no-LLM queries)

--max-query-budget

WEBGATE_MAX_QUERY_BUDGET

32000

Total char budget for fetch and query

--max-search-queries

WEBGATE_MAX_SEARCH_QUERIES

5

Max queries per call

--results-per-query

WEBGATE_RESULTS_PER_QUERY

5

Default results fetched per query

--search-timeout

WEBGATE_SEARCH_TIMEOUT

8

HTTP request timeout (seconds)

--oversampling-factor

WEBGATE_OVERSAMPLING_FACTOR

2

Search result multiplier for dedup reserve

--auto-recovery-fetch

WEBGATE_AUTO_RECOVERY_FETCH

false

Enable gap-filler (Round 2 fetch)

--max-total-results

WEBGATE_MAX_TOTAL_RESULTS

20

Hard cap on total results per call

--debug

WEBGATE_DEBUG

false

Enable structured debug logging

--log-file

WEBGATE_LOG_FILE

(empty)

Log file path (empty = stderr)

--trace

WEBGATE_TRACE

false

Include content in summarized citations; also activates debug logging

--adaptive-budget

WEBGATE_ADAPTIVE_BUDGET

false

[EXPERIMENTAL] Proportional char allocation based on BM25 rank

--adaptive-budget-fetch-factor

WEBGATE_ADAPTIVE_BUDGET_FETCH_FACTOR

3

[EXPERIMENTAL] Generous pre-rank fetch multiplier

--llm-enabled

WEBGATE_LLM_ENABLED

false

Enable LLM features

--llm-base-url

WEBGATE_LLM_BASE_URL

http://localhost:11434/v1

OpenAI-compatible endpoint

--llm-api-key

WEBGATE_LLM_API_KEY

(empty)

API key (empty for local models)

--llm-model

WEBGATE_LLM_MODEL

llama3.2

Model name

--llm-timeout

WEBGATE_LLM_TIMEOUT

30

LLM request timeout (seconds)

--llm-expansion-enabled

WEBGATE_LLM_EXPANSION_ENABLED

true

Auto-expand queries into variants

--llm-summarization-enabled

WEBGATE_LLM_SUMMARIZATION_ENABLED

true

LLM summary with citations

--llm-rerank-enabled

WEBGATE_LLM_RERANK_ENABLED

false

LLM-assisted reranking

--llm-max-summary-words

WEBGATE_LLM_MAX_SUMMARY_WORDS

0

Summary word target (0 = auto)

--llm-input-budget-factor

WEBGATE_LLM_INPUT_BUDGET_FACTOR

3

LLM input budget multiplier

๐Ÿ”Œ Backends

Backend

Auth

Notes

SearXNG

none

Self-hosted, recommended

Brave Search

API key

High quality, free tier available

Tavily

API key

AI-oriented snippets, free tier available

Exa

API key

Neural/semantic search, free tier available

SerpAPI

API key

Proxy for Google, Bing, DuckDuckGo and more, free tier available

SearXNG quickstart (Docker)

docker run -d -p 8080:8080 --name searxng searxng/searxng

Then set WEBGATE_SEARXNG_URL=http://localhost:8080.

Exa notes

Exa uses neural (semantic) search by default โ€” the primary reason to use it over keyword backends. use_autoprompt is hardcoded to false (not user-configurable) because mcp-webgate handles query expansion via its own LLM expander.

SerpAPI notes

engine selects the underlying search engine (google, bing, duckduckgo, yandex, yahoo). gl and hl significantly affect result quality for non-English queries.

๐Ÿ› Debug mode

When enabled, every tool call logs a structured entry:

  • fetch: URL, raw KB downloaded, clean KB returned, elapsed ms

  • query: queries used, results requested/fetched/failed, raw MB, clean KB, total elapsed ms

export WEBGATE_DEBUG=true             # log to stderr
export WEBGATE_LOG_FILE=/tmp/wg.log  # or log to file

๐Ÿ›ก๏ธ Protections summary

These protections are always active โ€” they are the core value proposition and cannot be disabled.

What could go wrong

How webgate stops it

Page dumps 2 MB of HTML

max_download_mb hard cap โ€” download stops mid-stream, never buffered

Cleaned text is still huge

max_result_length char cap per page

Many results flood the context

max_query_budget distributes a fixed total across all results

Too many pages fetched

max_total_results hard cap

PDF / ZIP / DOCX requested

Binary extension filter runs before any network request

Slow or hanging connections

search_timeout + 5s connect timeout

Invisible Unicode tricks in content

Full regex sterilization pipeline (zero-width, BiDi, etc.)

Rate limiting (429 / 502 / 503)

Exponential retry backoff, respects Retry-After header

Unwanted domains

blocked_domains / allowed_domains filter

๐Ÿ“š Documentation Structure

Integration Guides

๐Ÿงช Beta Status

mcp-webgate is in beta. Core functionality is stable and the server is used in production, but the configuration API may still change before 1.0.

Feedback is very welcome. If something doesn't work as expected, behaves oddly, or you have a use case that isn't covered:

โ†’ Open an issue on GitHub

Bug reports, configuration questions, and feature requests all help shape the roadmap.

๐Ÿค Contributing

Contributions are welcome! Please see CONTRIBUTING.md for detailed guidelines on:

  • Development setup and workflow

  • Code style and conventions

  • Testing requirements

  • Documentation standards

  • Pull request process

๐Ÿ“„ License

MIT License โ€” see LICENSE for details.


Need help? Check the documentation or open an issue on GitHub.

Install Server
A
license - permissive license
A
quality
B
maintenance

Maintenance

โ€“Maintainers
โ€“Response time
3dRelease cycle
4Releases (12mo)

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/x-hannibal/mcp-webgate'

If you have feedback or need assistance with the MCP directory API, please join our Discord server