webpeel
OfficialProvides a purpose-built domain extractor to fetch and clean data from Amazon product pages, removing boilerplate and ads for efficient content retrieval.
Offers specialized parsing for arXiv papers to extract clean markdown content and metadata from academic publications.
Includes a domain-specific extractor for GitHub that strips navigation, ads, and sidebars from repositories and READMEs to provide clean markdown.
Features a specialized parser for Reddit threads, enabling agents to extract clean discussion content and metadata without platform-specific noise.
Provides a custom extractor for Wikipedia pages to deliver structured markdown content while significantly reducing token usage.
Automatically detects and extracts YouTube transcripts and video metadata, optimized for use in AI agent research and processing.
The Problem
Every AI agent that touches the web rebuilds the same brittle stack: HTTP fetch → headless browser → anti-bot bypass → HTML cleanup → markdown conversion → token budgeting. Each layer fails differently. Sites change. Cloudflare rotates challenges. Your agent gets empty strings at 2 AM and your pipeline breaks.
WebPeel replaces that entire stack with one function call. It handles engine selection, anti-bot escalation, domain-specific extraction, and token optimization so your agent gets clean, structured data every time — without managing browsers, proxies, or parsing logic.
Related MCP server: Anybrowse
Quick Start
# Zero-install — just run it
npx webpeel "https://example.com"
# Search the web
npx webpeel search "latest AI agent frameworks"
# Crawl an entire site
npx webpeel crawl docs.example.com --max-pages 50
# Screenshot any page
npx webpeel screenshot "https://stripe.com/pricing" --full-page
# Ask a question about any page
npx webpeel ask "https://arxiv.org/abs/2401.00001" "What is the main contribution?"Or install globally:
npm install -g webpeelUse as a library:
import { peel } from 'webpeel';
const result = await peel('https://news.ycombinator.com');
console.log(result.markdown); // Clean markdown, ready for your LLM
console.log(result.metadata); // Title, tokens saved, timing, etc.Use via API:
curl "https://api.webpeel.dev/v1/fetch?url=https://stripe.com/pricing" \
-H "Authorization: Bearer $WEBPEEL_API_KEY"{
"url": "https://stripe.com/pricing",
"markdown": "# Stripe Pricing\n\n**Integrated per-transaction fees**...",
"metadata": {
"title": "Pricing & Fees | Stripe",
"tokens": 420,
"tokensOriginal": 8200,
"savingsPct": 94.9
}
}Get your free API key → · No credit card required · 500 requests/week free
Why WebPeel
🧠 55+ Domain Extractors — Not Just HTML-to-Markdown
Generic scrapers convert raw HTML to markdown and call it a day. WebPeel has purpose-built extractors for 55+ domains — Reddit, GitHub, YouTube, Amazon, ArXiv, Hacker News, Wikipedia, StackOverflow, Zillow, Polymarket, ESPN, and more. Each extractor understands the site's structure and returns clean, structured data without browser rendering.
⚡ 65–98% Token Savings
Domain extractors strip navigation, ads, sidebars, and boilerplate before content reaches your agent. Less context consumed = lower costs, faster inference, and longer agent chains.
Site | Raw HTML tokens | WebPeel tokens | Savings |
News article | 18,000 | 640 | 96% |
Reddit thread | 24,000 | 890 | 96% |
Wikipedia page | 31,000 | 2,100 | 93% |
GitHub README | 5,200 | 1,800 | 65% |
E-commerce product | 14,000 | 310 | 98% |
🔄 6-Layer Engine Escalation
WebPeel doesn't just try one method — it automatically escalates through 6 engines until it gets a good result:
Simple HTTP → Domain API → Browser render → Stealth browser → Cloaked browser → Search cache fallbackNo manual --render flags for most sites. WebPeel knows which sites need JavaScript, which need stealth, and which have anti-bot protection — and picks the right engine automatically.
🔌 Firecrawl-Compatible Migration Path
Already using Firecrawl-style workflows? WebPeel supports compatible /v1/scrape, /v2/scrape, /v1/crawl, /v1/search, and /v1/map endpoints, which makes migration dramatically easier than rebuilding your pipeline from scratch.
Agent-Native Integrations
MCP Server (Claude, Cursor, Windsurf, VS Code)
Give any MCP-compatible AI the ability to browse, search, and extract from the web.
{
"mcpServers": {
"webpeel": {
"command": "npx",
"args": ["-y", "webpeel", "mcp"],
"env": { "WEBPEEL_API_KEY": "wp_your_key_here" }
}
}
}7 MCP tools exposed: webpeel_read · webpeel_find · webpeel_see · webpeel_extract · webpeel_monitor · webpeel_act · webpeel_crawl
LangChain
import { WebPeelLoader } from 'webpeel/integrations/langchain';
const loader = new WebPeelLoader({ url: 'https://example.com', render: true });
const docs = await loader.load();LlamaIndex
import { WebPeelReader } from 'webpeel/integrations/llamaindex';
const reader = new WebPeelReader();
const docs = await reader.loadData('https://example.com');Python SDK
pip install webpeelfrom webpeel import WebPeel
wp = WebPeel(api_key="wp_...")
result = wp.fetch("https://example.com")
print(result.markdown)Full Feature Set
Capability | CLI | API | Details |
Fetch & extract |
|
| Clean markdown from any URL |
Web search |
|
| DuckDuckGo (free) or Brave (BYOK) |
Smart search | — |
| AI-powered structured results |
Crawl sites |
|
| Depth/page limits, rate control |
Screenshots |
|
| Full-page, multi-viewport, visual diff, filmstrip |
Structured extraction |
|
| JSON schema → structured data |
Q&A |
|
| Answer questions about any page |
Deep research | — |
| Multi-query autonomous research |
Content monitoring |
|
| Change detection with webhooks |
Browser sessions | — |
| Persistent sessions for login flows |
Browser actions |
| actions field | Click, type, scroll, wait |
Batch scrape |
|
| Parallel multi-URL processing |
URL discovery |
|
| Sitemap and link discovery |
YouTube transcripts | auto-detected | auto-detected | Multiple export formats |
PDF extraction | auto-detected | auto-detected | Text, tables, structure |
Research agent | — |
| Autonomous multi-step research |
Use Cases for Agent Builders
RAG pipelines — Fetch docs, articles, or entire sites as clean markdown ready for chunking, embedding, and retrieval.
Price monitoring — Track product pages across major commerce sites with structured extraction and change detection.
Competitive intel — Monitor competitor pages, pricing tables, and job boards. Visual diff screenshots catch layout changes CSS selectors would miss.
Research agents — Give Claude, Codex, Cursor, or your own agent grounded web access through the API or MCP server.
Lead enrichment — Pull company details, public links, and page structure from business sites without writing per-site parsers.
Content aggregation — Crawl and extract from communities, docs sites, and publications with domain-native extractors that understand each site's structure.
Architecture
Your Agent
↓
WebPeel (npm / API / MCP)
↓
┌─────────────────────────────────┐
│ Engine Ranker │
│ HTTP → Domain API → Browser │
│ → Stealth → Cloaked → Cache │
├─────────────────────────────────┤
│ 55+ Domain Extractors │
│ reddit · github · youtube │
│ amazon · arxiv · zillow · ... │
├─────────────────────────────────┤
│ Content Pipeline │
│ Readability → Turndown → │
│ Token budgeting → Chunking │
└─────────────────────────────────┘
↓
Clean markdown / structured JSONReliability
WebPeel is built for production agent workflows, not just one-off demos.
Automated evals in-repo — smart search and fetch eval suites ship with the codebase
Post-deploy gate — critical checks run before calling a deploy healthy
Engine fallback chain — when one fetch method fails, WebPeel escalates instead of giving up
Multiple surfaces, one core — CLI, API, SDK, and MCP all ride the same extraction pipeline
Security
SSRF protection — blocks localhost, private IPs, metadata endpoints,
file://schemesHelmet.js — HSTS, X-Frame-Options, nosniff, XSS protection on all responses
Webhook signing — HMAC-SHA256 on all outbound webhooks
API key hashing — SHA-256 with granular scopes
Rate limiting — sliding window, per-tier
Audit logging — every API call logged with IP, key, and action
GDPR compliant —
DELETE /v1/accountfor full data erasure Security policy → · SLA (99.9% uptime) →
Why teams choose WebPeel instead of stitching a stack together
Approach | What it gives you | Where it breaks down |
Raw HTTP + HTML parsing | Cheap, simple fetches | Falls apart on JS-heavy sites, anti-bot pages, and noisy HTML |
Pure browser automation | Maximum control | Expensive, slow, fragile, and high-maintenance for large-scale use |
Search-only APIs | Great discovery | Weak page extraction, limited structured output, limited downstream actions |
Single-purpose scrapers | Fast on one job | You end up composing 4–6 tools for real agent workflows |
WebPeel | Fetch + search + crawl + extraction + screenshots + monitoring in one layer | Opinionated toward agent workflows rather than generic scraping |
Links
📖 Documentation · 💰 Pricing · 🎮 Playground · 📝 Blog · 💬 Discussions · 🚀 Releases · 📊 Status · 🔒 Security · 📋 Changelog
Contributing
Pull requests welcome. Please open an issue first to discuss major changes.
git clone https://github.com/webpeel/webpeel.git
cd webpeel && npm install
npm run build && npm testLicense
WebPeel SDK License — free for personal and commercial use with attribution.
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/webpeel/webpeel'
If you have feedback or need assistance with the MCP directory API, please join our Discord server