Webustler
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@WebustlerScrape https://example.com and return markdown"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
๐ค Why Webustler?
Most scraping tools fail on protected sites. Webustler doesn't.
โ Other Tools
Block on Cloudflare
Require API keys
Charge per request
Return messy HTML
No retry logic
โ Webustler
Bypasses protection automatically
100% free & self-hosted
Unlimited requests
Clean, LLM-ready markdown
Smart retry with fallback
๐ Comparison
Feature | Webustler | Firecrawl | ScrapeGraphAI | Crawl4AI | Deepcrawl |
Anti-bot bypass | โ | โ ๏ธ | โ | โ ๏ธ | โ |
Cloudflare support | โ | โ ๏ธ | โ | โ ๏ธ | โ |
No API key needed | โ | โ | โ | โ | โ ๏ธ |
Self-hosted | โ | โ | โ | โ | โ |
MCP native | โ | โ | โ | โ | โ |
Token optimized | โ | โ | โ | โ | โ |
Rich metadata | โ | โ | โ ๏ธ | โ ๏ธ | โ |
Link categorization | โ | โ | โ | โ | โ |
File detection | โ | โ ๏ธ | โ | โ | โ |
Reading time | โ | โ | โ | โ | โ |
Zero config | โ | โ | โ | โ | โ |
Free forever | โ | โ | โ | โ | โ |
โจ Features
๐ก๏ธ Smart Fallback System
Primary method fails? Automatically retries with anti-bot bypass. No manual intervention needed.
๐ Rich Metadata Extraction
Title, description, author
Open Graph & Twitter Cards
Published/modified time
Language, keywords, robots
๐ Link Categorization
Separates internal links (same domain) from external links. Perfect for crawling workflows.
๐ File Download Detection
Detects PDFs, images, archives, and other file types. Returns structured info instead of garbled binary.
๐งน Token-Optimized Output
Removes ads, sidebars, popups, base64 images, cookie banners, and all the junk LLMs don't need.
๐ Table Preservation
Data tables stay intact in markdown. No more broken layouts.
โฑ๏ธ Content Analysis
Word count and reading time calculated automatically. Know your content at a glance.
๐ฆ Installation
git clone https://github.com/drruin/webustler.git
cd webustler
docker build -t webustler .๐ง MCP Configuration
Claude Desktop
Add to your claude_desktop_config.json:
{
"mcpServers": {
"webustler": {
"command": "docker",
"args": ["run", "-i", "--rm", "webustler"]
}
}
}Claude Code
claude mcp add webustler -- docker run -i --rm webustlerCursor
Add to your Cursor MCP settings:
{
"mcpServers": {
"webustler": {
"command": "docker",
"args": ["run", "-i", "--rm", "webustler"]
}
}
}Windsurf
Add to your Windsurf MCP config:
{
"mcpServers": {
"webustler": {
"command": "docker",
"args": ["run", "-i", "--rm", "webustler"]
}
}
}With Custom Timeout
Pass the TIMEOUT environment variable (in seconds):
{
"mcpServers": {
"webustler": {
"command": "docker",
"args": ["run", "-i", "--rm", "-e", "TIMEOUT=180", "webustler"]
}
}
}๐ Usage
Once configured, the scrape tool is available to your MCP client:
Scrape https://example.com and summarize the contentExtract all links from https://news.ycombinator.comGet the article from https://protected-site.com/articleWebustler handles everything automatically โ including Cloudflare challenges.
๐ Output Format
Returns clean markdown with YAML frontmatter:
---
sourceURL: https://example.com/article
statusCode: 200
title: Article Title
description: Meta description here
author: John Doe
language: en
wordCount: 1542
readingTime: 8 mins
publishedTime: 2025-01-01
openGraph:
title: OG Title
image: https://example.com/og.png
twitter:
card: summary_large_image
internalLinksCount: 42
externalLinksCount: 15
imagesCount: 8
---
# Article Title
Clean markdown content here with **formatting** preserved...
| Column 1 | Column 2 |
|----------|----------|
| Tables | Work too |
---
## Internal Links
- https://example.com/page1
- https://example.com/page2
---
## External Links
- https://other-site.com/reference
---
## Images
- https://example.com/image1.jpgโ๏ธ How It Works
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ URL โโโบ Primary Fetch โโโบ Blocked? โโโบ Fallback Fetch โ
โ โ โ โ
โ โผ โผ โ
โ Success โโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ Clean HTML โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโ โ
โ โผ โผ โผ โ
โ Metadata Markdown Links โ
โ โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโ โ
โ โผ โ
โ Format Output โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ๐ Retry Logic
Method | Attempts | Delay | Purpose |
Primary | 2 | 5s | Fast extraction |
Fallback | 3 | 5s | Anti-bot bypass |
Total: Up to 5 attempts before failure. Handles timeouts, rate limits, and challenges.
๐งน Content Cleaning
Tags Removed
Category | Elements |
Scripts |
|
Styles |
|
Navigation |
|
Interactive |
|
Media |
|
Selectors Removed
Sidebars (
[class*='sidebar'],[id*='sidebar'])Comments (
[class*='comment'])Ads (
[class*='ad-'],[class*='advertisement'])Social (
[class*='social'],[class*='share'])Popups (
[class*='popup'],[class*='modal'])Cookie banners (
[class*='cookie'])Newsletters (
[class*='newsletter'])Promos (
[class*='banner'],[class*='promo'])
Also Removed
Base64 inline images (massive token savings)
Empty elements
Excessive newlines (max 3 consecutive)
๐ง Configuration
Variable | Default | Description |
|
| Request timeout in seconds |
๐ Why Not Just Use...
Firecrawl is excellent but:
Requires API key and paid plans for serious usage
Limited anti-bot capabilities
More complex setup with environment variables
ScrapeGraphAI uses LLMs to parse pages:
Requires LLM API keys (OpenAI, etc.) for all operations
Adds latency (LLM calls) and cost (token usage)
Webustler is deterministic โ faster, cheaper, predictable
Crawl4AI is a powerful open-source crawler but:
Requires more configuration to get started
LLM features require additional API keys
Webustler works out of the box with zero config
Deepcrawl is a great Firecrawl alternative but:
Hosted API requires API key (self-host is free)
No anti-bot bypass capabilities
REST API only, not an MCP server
๐ Project Structure
webustler/
โโโ server.py # MCP server
โโโ Dockerfile # Docker image
โโโ requirements.txt # Dependencies
โโโ LICENSE # MIT License
โโโ images/ # Assets
โ โโโ image.png
โโโ README.md # Documentationโ๏ธ Ethical Use & Disclaimer
Webustler is provided as a tool for security research, data interoperability, and educational purposes.
Responsibility: As I, the developer of Webustler do not condone unauthorized scraping or the violation of any website's Terms of Service (TOS).
Compliance: Users are solely responsible for ensuring that their use of this tool complies with local laws (such as the CFAA or GDPR) and the intellectual property rights of the content owners.
Respect Robots.txt: I encourage all users to respect
robots.txtfiles and implement reasonable crawl delays to avoid putting undue stress on web servers.
This project is an exploration of web technologies and challenge-response mechanisms. Use it responsibly.
๐ License
MIT License โ use it however you want.
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/DrRuin/webustler'
If you have feedback or need assistance with the MCP directory API, please join our Discord server