Skip to main content
Glama
DrRuin
by DrRuin

๐Ÿค” Why Webustler?

Most scraping tools fail on protected sites. Webustler doesn't.

โŒ Other Tools

  • Block on Cloudflare

  • Require API keys

  • Charge per request

  • Return messy HTML

  • No retry logic

โœ… Webustler

  • Bypasses protection automatically

  • 100% free & self-hosted

  • Unlimited requests

  • Clean, LLM-ready markdown

  • Smart retry with fallback


๐Ÿ“Š Comparison

Feature

Webustler

Firecrawl

ScrapeGraphAI

Crawl4AI

Deepcrawl

Anti-bot bypass

โœ…

โš ๏ธ

โŒ

โš ๏ธ

โŒ

Cloudflare support

โœ…

โš ๏ธ

โŒ

โš ๏ธ

โŒ

No API key needed

โœ…

โŒ

โŒ

โœ…

โš ๏ธ

Self-hosted

โœ…

โœ…

โœ…

โœ…

โœ…

MCP native

โœ…

โœ…

โœ…

โœ…

โŒ

Token optimized

โœ…

โœ…

โŒ

โœ…

โœ…

Rich metadata

โœ…

โœ…

โš ๏ธ

โš ๏ธ

โœ…

Link categorization

โœ…

โŒ

โŒ

โŒ

โœ…

File detection

โœ…

โš ๏ธ

โŒ

โŒ

โŒ

Reading time

โœ…

โŒ

โŒ

โŒ

โŒ

Zero config

โœ…

โŒ

โŒ

โŒ

โŒ

Free forever

โœ…

โŒ

โŒ

โœ…

โœ…


โœจ Features

๐Ÿ›ก๏ธ Smart Fallback System

Primary method fails? Automatically retries with anti-bot bypass. No manual intervention needed.

๐Ÿ“‹ Rich Metadata Extraction

  • Title, description, author

  • Open Graph & Twitter Cards

  • Published/modified time

  • Language, keywords, robots

Separates internal links (same domain) from external links. Perfect for crawling workflows.

๐Ÿ“ File Download Detection

Detects PDFs, images, archives, and other file types. Returns structured info instead of garbled binary.

๐Ÿงน Token-Optimized Output

Removes ads, sidebars, popups, base64 images, cookie banners, and all the junk LLMs don't need.

๐Ÿ“Š Table Preservation

Data tables stay intact in markdown. No more broken layouts.

โฑ๏ธ Content Analysis

Word count and reading time calculated automatically. Know your content at a glance.


๐Ÿ“ฆ Installation

git clone https://github.com/drruin/webustler.git
cd webustler
docker build -t webustler .

๐Ÿ”ง MCP Configuration

Claude Desktop

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "webustler": {
      "command": "docker",
      "args": ["run", "-i", "--rm", "webustler"]
    }
  }
}

Claude Code

claude mcp add webustler -- docker run -i --rm webustler

Cursor

Add to your Cursor MCP settings:

{
  "mcpServers": {
    "webustler": {
      "command": "docker",
      "args": ["run", "-i", "--rm", "webustler"]
    }
  }
}

Windsurf

Add to your Windsurf MCP config:

{
  "mcpServers": {
    "webustler": {
      "command": "docker",
      "args": ["run", "-i", "--rm", "webustler"]
    }
  }
}

With Custom Timeout

Pass the TIMEOUT environment variable (in seconds):

{
  "mcpServers": {
    "webustler": {
      "command": "docker",
      "args": ["run", "-i", "--rm", "-e", "TIMEOUT=180", "webustler"]
    }
  }
}

๐Ÿš€ Usage

Once configured, the scrape tool is available to your MCP client:

Scrape https://example.com and summarize the content
Extract all links from https://news.ycombinator.com
Get the article from https://protected-site.com/article

Webustler handles everything automatically โ€” including Cloudflare challenges.


๐Ÿ“„ Output Format

Returns clean markdown with YAML frontmatter:

---
sourceURL: https://example.com/article
statusCode: 200
title: Article Title
description: Meta description here
author: John Doe
language: en
wordCount: 1542
readingTime: 8 mins
publishedTime: 2025-01-01
openGraph:
  title: OG Title
  image: https://example.com/og.png
twitter:
  card: summary_large_image
internalLinksCount: 42
externalLinksCount: 15
imagesCount: 8
---

# Article Title

Clean markdown content here with **formatting** preserved...

| Column 1 | Column 2 |
|----------|----------|
| Tables   | Work too |

---
## Internal Links

- https://example.com/page1
- https://example.com/page2

---
## External Links

- https://other-site.com/reference

---
## Images

- https://example.com/image1.jpg

โš™๏ธ How It Works

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                                 โ”‚
โ”‚    URL โ”€โ”€โ–บ Primary Fetch โ”€โ”€โ–บ Blocked? โ”€โ”€โ–บ Fallback Fetch       โ”‚
โ”‚                                  โ”‚              โ”‚               โ”‚
โ”‚                                  โ–ผ              โ–ผ               โ”‚
โ”‚                              Success โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜               โ”‚
โ”‚                                  โ”‚                              โ”‚
โ”‚                                  โ–ผ                              โ”‚
โ”‚                          Clean HTML                             โ”‚
โ”‚                                  โ”‚                              โ”‚
โ”‚                                  โ–ผ                              โ”‚
โ”‚              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”‚
โ”‚              โ–ผ                   โ–ผ                   โ–ผ          โ”‚
โ”‚         Metadata            Markdown             Links          โ”‚
โ”‚              โ”‚                   โ”‚                   โ”‚          โ”‚
โ”‚              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ”‚
โ”‚                                  โ–ผ                              โ”‚
โ”‚                          Format Output                          โ”‚
โ”‚                                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”„ Retry Logic

Method

Attempts

Delay

Purpose

Primary

2

5s

Fast extraction

Fallback

3

5s

Anti-bot bypass

Total: Up to 5 attempts before failure. Handles timeouts, rate limits, and challenges.


๐Ÿงน Content Cleaning

Tags Removed

Category

Elements

Scripts

<script>, <noscript>

Styles

<style>

Navigation

<nav>, <header>, <footer>, <aside>

Interactive

<form>, <button>, <input>, <select>, <textarea>

Media

<svg>, <canvas>, <video>, <audio>, <iframe>, <object>, <embed>

Selectors Removed

  • Sidebars ([class*='sidebar'], [id*='sidebar'])

  • Comments ([class*='comment'])

  • Ads ([class*='ad-'], [class*='advertisement'])

  • Social ([class*='social'], [class*='share'])

  • Popups ([class*='popup'], [class*='modal'])

  • Cookie banners ([class*='cookie'])

  • Newsletters ([class*='newsletter'])

  • Promos ([class*='banner'], [class*='promo'])

Also Removed

  • Base64 inline images (massive token savings)

  • Empty elements

  • Excessive newlines (max 3 consecutive)


๐Ÿ”ง Configuration

Variable

Default

Description

TIMEOUT

120

Request timeout in seconds


๐Ÿ† Why Not Just Use...

Firecrawl is excellent but:

  • Requires API key and paid plans for serious usage

  • Limited anti-bot capabilities

  • More complex setup with environment variables

ScrapeGraphAI uses LLMs to parse pages:

  • Requires LLM API keys (OpenAI, etc.) for all operations

  • Adds latency (LLM calls) and cost (token usage)

  • Webustler is deterministic โ€” faster, cheaper, predictable

Crawl4AI is a powerful open-source crawler but:

  • Requires more configuration to get started

  • LLM features require additional API keys

  • Webustler works out of the box with zero config

Deepcrawl is a great Firecrawl alternative but:

  • Hosted API requires API key (self-host is free)

  • No anti-bot bypass capabilities

  • REST API only, not an MCP server


๐Ÿ“ Project Structure

webustler/
โ”œโ”€โ”€ server.py           # MCP server
โ”œโ”€โ”€ Dockerfile          # Docker image
โ”œโ”€โ”€ requirements.txt    # Dependencies
โ”œโ”€โ”€ LICENSE             # MIT License
โ”œโ”€โ”€ images/             # Assets
โ”‚   โ””โ”€โ”€ image.png
โ””โ”€โ”€ README.md           # Documentation

โš–๏ธ Ethical Use & Disclaimer

Webustler is provided as a tool for security research, data interoperability, and educational purposes.

  • Responsibility: As I, the developer of Webustler do not condone unauthorized scraping or the violation of any website's Terms of Service (TOS).

  • Compliance: Users are solely responsible for ensuring that their use of this tool complies with local laws (such as the CFAA or GDPR) and the intellectual property rights of the content owners.

  • Respect Robots.txt: I encourage all users to respect robots.txt files and implement reasonable crawl delays to avoid putting undue stress on web servers.

This project is an exploration of web technologies and challenge-response mechanisms. Use it responsibly.


๐Ÿ“œ License

MIT License โ€” use it however you want.


A
license - permissive license
-
quality - not tested
B
maintenance

Maintenance

โ€“Maintainers
โ€“Response time
โ€“Release cycle
1Releases (12mo)

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/DrRuin/webustler'

If you have feedback or need assistance with the MCP directory API, please join our Discord server