intercept-mcp
intercept-mcp is an MCP server that gives AI assistants the ability to reliably read and search the web, requiring no API keys for basic usage.
Fetch any URL as clean markdown, with native handling for special content types:
Twitter/X tweets (text, author, media, engagement stats)
YouTube videos (title, channel, duration, views, description)
arXiv papers (metadata, authors, abstract, categories)
PDFs (text extraction)
Bypass access barriers (403s, paywalls, bot blocks) via a multi-tier fallback pipeline:
Tier 1: Jina Reader
Tier 2: Wayback Machine + Codetabs CORS proxy (parallel)
Tier 3: Raw fetch with browser headers
Tier 4: RSS, CrossRef, Semantic Scholar, HackerNews, Reddit metadata
Tier 5: Open Graph meta tags (last resort)
Control fetch depth via
maxTier(1–5) to trade speed for thoroughnessSearch the web using Brave Search API (with optional API key) or SearXNG fallback, returning 1–20 results per query
Automatic URL cleaning — strips 60+ tracking parameters, upgrades to HTTPS, removes AMP artifacts
Quality filtering — auto-rejects CAPTCHA pages, login walls, error pages, and content under 200 characters
Session-level LRU caching (up to 100 entries) to avoid redundant fetches
Compatible with Claude Code, Claude Desktop, Cursor, Windsurf, Cline, and any stdio MCP client
Provides a specialized handler for arXiv URLs to extract paper metadata, authors, abstracts, and categories.
Powers the search tool to perform web searches and return results when a Brave Search API key is provided.
Acts as a metadata and discussion fallback tier to enhance content extraction for specific URLs.
Utilizes RSS feeds as a fallback strategy to retrieve content and metadata when primary fetching methods fail.
Functions as a search backend for the search tool, enabling web querying through self-hosted SearXNG instances.
Serves as a fallback source to gather academic metadata and discussion details for processed URLs.
Includes a dedicated handler for YouTube links to retrieve video titles, channel information, duration, and descriptions.
intercept-mcp
Give your AI the ability to read the web. One command, no API keys required.
Without it, your AI hits a URL and gets a 403, a wall, or a wall of raw HTML. With intercept, it almost always gets the content — clean markdown, ready to use.
Handles tweets, YouTube videos (with transcripts when available), arXiv papers, PDFs, Wikipedia articles, and GitHub repos. If the first strategy fails, it tries up to 14 more before giving up.
Works with any MCP client: Claude Code, Claude Desktop, Codex, Cursor, Windsurf, Cline, and more.
Install
Claude Code
claude mcp add intercept -s user -- npx -y intercept-mcpCodex
codex mcp add intercept -- npx -y intercept-mcpCursor
Settings → MCP → Add Server:
{
"mcpServers": {
"intercept": {
"command": "npx",
"args": ["-y", "intercept-mcp"]
}
}
}Windsurf
Settings → MCP → Add Server → same JSON config as above.
Claude Desktop
Add to your claude_desktop_config.json:
{
"mcpServers": {
"intercept": {
"command": "npx",
"args": ["-y", "intercept-mcp"]
}
}
}Other MCP clients
Any client that supports stdio MCP servers can run npx -y intercept-mcp.
No API keys needed for the fetch tool.
Related MCP server: Web Reader
How it works
URLs are processed in four stages:
1. Site-specific handlers
Known URL patterns are routed to dedicated handlers before the fallback pipeline:
Pattern | Handler | What you get |
| Twitter/X | Tweet text, author, media, engagement stats (via third-party APIs) |
| YouTube | Title, channel, duration, views, description, transcript (when captions available) |
| arXiv | Paper metadata, authors, abstract, categories |
| Extracted text (text-layer PDFs only) | |
| Wikipedia | Clean article content via Wikimedia REST API |
| GitHub | Raw README.md content |
| GitHub | Raw file content, code-fenced by language |
| GitHub | Issue/PR title, state, body, diff stats, comments (via GitHub API) |
| GitHub | Release notes (via GitHub API) |
The GitHub API endpoints work unauthenticated (60 requests/hour). Set GITHUB_TOKEN to raise the limit.
2. Shared cache (agentsweb.org)
Before hitting any fetcher, every request checks agentsweb.org — a global shared markdown cache for AI agents backed by a 9-source parallel fetch pipeline with JS/SPA rendering (React, Vue, Angular via Cloudflare Browser Run). If another agent already fetched this URL, you get the result in under 50ms.
Every successful fetch contributes back automatically. Entries gain trust through a self-healing consensus model: when independent instances fetch the same URL and confirm the same content, confidence increases.
Opt out entirely with INTERCEPT_SHARED_CACHE=false, or use read-only mode (consume but never contribute) with INTERCEPT_CACHE_READ_ONLY=true.
agentsweb.org API
agentsweb.org also exposes standalone endpoints for direct use:
/web?q=— search the web/research?q=— search + fetch + cache in one call/fetch?url=— fetch on demand, auto-cached
See agentsweb.org/docs for full API documentation.
3. Fallback pipeline
If no handler matches (or the handler returns nothing), the URL enters the multi-tier pipeline:
Tier | Fetcher | Strategy |
0 | agentsweb.org | Global shared markdown cache — instant if another agent already fetched this URL |
1 | Cloudflare Browser Run | JS/SPA rendering + markdown extraction — also powers agentsweb.org (optional, needs API token) |
1 | Jina Reader | Clean markdown extraction service |
2 | Wayback Machine | Archived version from archive.org |
2 | Arquivo.pt | Portuguese web archive (broad international coverage) |
2 | Common Crawl | Petabyte web archive read from Common Crawl's index + S3 — not subject to the origin's rate limits, bot detection, or paywall |
2 | Codetabs | CORS proxy |
3 | Markdown endpoint | Asks the site for a native markdown version ( |
3 | archive.ph | Archived snapshots via timemap API + stealth TLS fetch |
3 | Raw fetch | Direct GET with browser headers + Turndown markdown conversion |
3 | Stealth fetch | Browser TLS fingerprint impersonation via got-scraping (opt-in, see below) |
3 | FlareSolverr | Real-browser challenge solver for Cloudflare/DDoS-Guard (opt-in, needs a FlareSolverr instance) |
3 | Web unlocker | Commercial unlocker API — residential rotation + rendering + CAPTCHA (opt-in, BYO key, paid per request) |
4 | RSS, CrossRef, Semantic Scholar, HN, Reddit | Metadata / discussion fallbacks |
5 | OG Meta | Open Graph tags (guaranteed fallback) |
Tier 2 fetchers run in parallel. When multiple succeed, the highest quality result wins. All other tiers run sequentially.
All fetchers return proper Markdown (headings, links, bold, tables, code blocks) via Turndown — not plain text.
4. Caching
Results are cached in-memory with TTL (60 min for successes, 5 min for failures). Max 250 entries with LRU eviction. Failed URLs are cached to prevent re-attempting known-dead URLs. All three knobs are configurable via INTERCEPT_CACHE_TTL_MS, INTERCEPT_CACHE_FAILURE_TTL_MS, and INTERCEPT_CACHE_SIZE.
Tools
fetch
Fetch a URL and return its content as clean markdown.
url(string, required) — URL to fetchmaxTier(number, optional, 1-5) — Stop at this tier for speed-sensitive casesmaxLength(number, optional, default 50000) — Maximum characters to returnstartIndex(number, optional, default 0) — Character offset for paginating long contentnoCache(boolean, optional) — Skip session and shared caches and fetch live
Long pages are truncated at maxLength with a notice telling the agent which startIndex continues the content. Structured output reports source, quality, contentLength, truncated, nextStartIndex, and cacheAgeSeconds so agents can branch on them programmatically.
Direct image URLs (.png, .jpg, .gif, .webp, up to 5 MB) are returned as an MCP image block instead of text, so the agent's own vision model can read charts, diagrams, screenshots, and scanned documents. The structured output reports source: "image", mimeType, and bytes.
fetch_batch
Fetch up to 10 URLs in parallel, each through the same handler/fallback chain.
urls(string[], required, 1-10) — URLs to fetchmaxTier,noCache— as infetchmaxLength(number, optional, default 20000) — Per-URL character budget
research
Search the web and fetch the top results in one call — replaces a search followed by several fetches.
query(string, required) — Search querycount(number, optional, 1-5, default 3) — Results to fetchmaxLength(number, optional, default 20000) — Per-result character budgetsite(string, optional) — Restrict to a domainfreshness(string, optional) —day,week,month, oryear
search
Search the web and return results.
query(string, required) — Search querycount(number, optional, 1-20, default 5) — Number of resultssite(string, optional) — Restrict results to a domainfreshness(string, optional) —day,week,month, oryearpage(number, optional, 1-10) — Results page for pagination
Uses Brave Search API if BRAVE_API_KEY is set, then SearXNG if SEARXNG_URL is set, then DuckDuckGo as an unreliable last resort. freshness and page are ignored by the DuckDuckGo fallback.
extract
Extract specific values from a page as JSON instead of markdown prose — for when you need particular data, not the whole page. Honors per-domain auth and proxies.
url(string, required) — The URL to extract fromselectors(object, optional) — Map of field name → CSS selector. Each value is either a selector string (returns the first match's text) or{ selector, attr?, all? }—attrextracts an attribute (e.g.href),all: truereturns every match as an array.tables(boolean, optional) — Convert every HTML table to an array of row objects (defaults to true when noselectorsare given).
{
"url": "https://shop.example.com/item",
"selectors": {
"title": "h1",
"price": ".price",
"images": { "selector": "img.gallery", "attr": "src", "all": true }
}
}Returns the extracted fields and/or tables as structured output.
Resources
intercept://session/recent
Markdown list of URLs fetched and cached in this session, most recent first. Re-fetching any of them is instant.
Prompts
research-topic
Search for a topic and fetch the top results for a multi-source summary.
topic(string) — The topic to researchdepth(string, default "3") — Number of top results to fetch
extract-article
Fetch a URL and extract the key points from the content.
url(string) — The URL to fetch and summarize
Environment variables
Variable | Required | Description |
| No | Brave Search API key for search |
| No | Self-hosted SearXNG instance URL (recommended) |
| No | GitHub token raising API rate limits for the issue/PR/release handler |
| No | JSON map of domain → headers/cookies, to fetch content you're logged in to (see Per-domain authentication) |
| No | Cloudflare API token with "Browser Rendering - Edit" permission |
| No | Cloudflare account ID (required if |
| No | Set to |
| No | URL of a FlareSolverr instance (e.g. |
| No | GET template (with a |
| No | Set to |
| No | Set to |
| No | In-memory cache TTL for successful fetches in ms (default |
| No | In-memory cache TTL for failed fetches in ms (default |
| No | Max in-memory cache entries (default |
| No | Standard proxy passthrough — routes all outbound fetches (including stealth) through the proxy. Honors |
| No | Comma/space-separated list of HTTP(S) proxies to rotate across, with automatic retry through the next proxy on a blocked response. Takes precedence over |
Search: Has a DuckDuckGo fallback but it's rate-limited and unreliable. For production use, self-host SearXNG and set SEARXNG_URL (see below), or get a Brave Search API key.
Fetch: Works without any keys. Set CF_API_TOKEN + CF_ACCOUNT_ID to enable Cloudflare Browser Run (formerly Browser Rendering) for JavaScript-heavy pages (SPAs, React sites).
Stealth fetch (USE_STEALTH_FETCH)
Use at your own risk. When enabled, this adds a fetcher that impersonates real browser TLS fingerprints (Chrome/Firefox cipher suites, HTTP/2 settings, header ordering) using got-scraping. This can bypass bot detection and CAPTCHA triggers on sites that would otherwise block automated requests.
This fetcher runs at tier 3 after the regular raw fetch. If the raw fetch gets blocked (CAPTCHA, Cloudflare challenge, 403), the stealth fetcher retries with browser impersonation.
This may violate the terms of service of some websites. The authors of intercept-mcp take no responsibility for how this feature is used. It is disabled by default and must be explicitly opted into.
Challenge solving (FLARESOLVERR_URL)
The stealth fetcher impersonates a browser's TLS fingerprint, but it can't execute a JavaScript challenge — so sites protected by a Cloudflare "Checking your browser" / DDoS-Guard interstitial still block it. FlareSolverr runs a real headless browser that solves the challenge and returns the page HTML.
Run it (Docker):
docker run -d -p 8191:8191 ghcr.io/flaresolverr/flaresolverr:latestThen set FLARESOLVERR_URL=http://localhost:8191. It runs at tier 3 as a last resort after the raw and stealth fetchers, and only when this variable is set. Solving a challenge can take 30–60s, so it's the slowest fetcher — but it recovers pages nothing else can.
Commercial web unlocker (WEB_UNLOCKER_URL)
For the hardest targets — sites that need residential IP rotation and real-browser rendering and CAPTCHA handling together — a commercial unlocker is the pragmatic answer. intercept-mcp supports any unlocker that exposes a "GET this URL, return the HTML" endpoint, via a template with a {url} placeholder that holds your API key:
# ScrapingBee
WEB_UNLOCKER_URL='https://app.scrapingbee.com/api/v1/?api_key=KEY&render_js=true&url={url}'
# ScraperAPI
WEB_UNLOCKER_URL='https://api.scraperapi.com/?api_key=KEY&render=true&url={url}'
# ZenRows
WEB_UNLOCKER_URL='https://api.zenrows.com/v1/?apikey=KEY&js_render=true&url={url}'intercept substitutes the (URL-encoded) target for {url} and converts the returned HTML (or JSON wrapping it) to markdown. It runs at tier 3 as a paid last resort after the free fetchers, only when this variable is set — and your credentials in the template are only ever sent to the unlocker, never to the target. Bright Data's proxy-based Web Unlocker is just an authenticated proxy, so use HTTPS_PROXY / INTERCEPT_PROXIES for that instead. This bills per request.
Bring-your-own proxy (HTTPS_PROXY)
If raw fetches start getting flagged, the most effective fix is usually a clean outbound IP — not a fancier fingerprint. intercept-mcp honors the standard HTTPS_PROXY / HTTP_PROXY / NO_PROXY env vars, so you can route all outbound traffic through whatever proxy you already have:
HTTPS_PROXY=http://user:pass@proxy.example.com:8080 npx intercept-mcpThis works with any HTTP(S) proxy — a self-hosted Squid, a Tailscale exit node, a $5 VPS running 3proxy, or commercial residential proxies (Bright Data, Oxylabs, etc.). The stealth fetcher and got-scraping calls also pick this up automatically.
Proxy rotation (INTERCEPT_PROXIES)
A single proxy still presents a single IP, which can itself get flagged under load. Set INTERCEPT_PROXIES to a comma- or space-separated list and intercept-mcp round-robins across them, automatically retrying through the next proxy when a request comes back blocked (HTTP 403, 429, 451, 503) or errors:
INTERCEPT_PROXIES="http://user:pass@p1.example.com:8080,http://user:pass@p2.example.com:8080,http://p3.example.com:8080" npx intercept-mcpRequests spread across the list, and a blocked response is retried through a different egress (up to 3 attempts) before giving up — so a handful of cheap proxies, or a rotating residential endpoint listed multiple times, behave like a pool. INTERCEPT_PROXIES takes precedence over HTTPS_PROXY, applies per request (so the stealth and archive.ph got-scraping calls rotate too), and accepts HTTP(S) proxies. Invalid entries are ignored.
Per-domain authentication (INTERCEPT_AUTH)
Most of the web is behind a login. INTERCEPT_AUTH lets you attach your own headers or cookies to requests for a specific origin, so the fetch tools can read content you're legitimately signed in to — a paid subscription, a private dashboard, an intranet, an authenticated API.
It's a JSON object mapping a domain to a header map. A domain also matches its subdomains:
INTERCEPT_AUTH='{
"nytimes.com": { "Cookie": "nyt-s=...; nyt-a=..." },
"api.acme.com": { "Authorization": "Bearer eyJ..." }
}' npx intercept-mcpTo get a cookie: open the site logged-in, open DevTools → Network, copy the Cookie request header from any request to that domain.
Security model — read this before using it
Credentials only ever go to the configured origin. Headers are keyed on the actual host being contacted. When intercept fetches a page through Jina, a web archive, a CORS proxy, FlareSolverr, or the shared cache, those intermediaries connect to a different host, so your cookie/token is never sent to them — only a direct fetch of the origin carries it.
Authenticated responses never touch the shared cache. When a request matches an
INTERCEPT_AUTHentry, intercept does not read from or write to the public agentsweb.org cache for that URL — so your private/paid content is never published, and you always get your authenticated view rather than a stranger's anonymous copy. (The in-process session cache still applies.)Treat the value as a secret. It contains live session tokens. Environment variables are visible to the process and its children and may be captured in shell history or process listings — prefer a secrets manager or a non-committed env file, and never commit it. Cookies expire, so you'll periodically need to refresh them.
You are responsible for authorized use. Only supply credentials for accounts you own or are permitted to use, and respect each site's terms of service. intercept simply forwards the headers you provide.
Self-hosting SearXNG
For reliable search, self-host SearXNG with Docker. A config is included in the repo:
git clone https://github.com/bighippoman/intercept-mcp.git
cd intercept-mcp/searxng && docker compose up -dThen set SEARXNG_URL=http://localhost:8888. No rate limits, no CAPTCHAs, aggregates Google + Bing + DuckDuckGo + Wikipedia + Brave.
Or use any existing SearXNG instance — just set SEARXNG_URL to its URL.
URL normalization
Incoming URLs are automatically cleaned:
Strips 60+ tracking params (UTM, click IDs, analytics, A/B testing, etc.)
Removes hash fragments
Upgrades to HTTPS
Cleans AMP artifacts
Preserves functional params (
ref,format,page,offset,limit)
SSRF protection
Agents pass URLs taken from untrusted web content, so the fetch tools refuse anything pointing at local or internal infrastructure: loopback and private IPv4/IPv6 ranges, link-local addresses (including the 169.254.169.254 cloud metadata endpoint), CGNAT, multicast/reserved ranges, and local hostnames (localhost, *.local, *.internal, *.home.arpa). Literal IPs are checked, including alternate notations (decimal, hex) normalized by the URL parser; DNS is not resolved, so public hostnames pointing at private IPs are not caught.
Content quality detection
Each fetcher result is scored for quality. Automatic fail on:
CAPTCHA / Cloudflare challenges
Login walls
HTTP error pages in body
Content under 200 characters
Requirements
Node.js >= 20
No API keys required for basic use
Maintenance
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/bighippoman/intercept-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server