Webustler

FreeNot checked

Enables clean, LLM-ready markdown extraction from any URL with automatic anti-bot bypass.

by DrRuin

GitHub Embed

About

Enables clean, LLM-ready markdown extraction from any URL with automatic anti-bot bypass.

README

Webustler

MCP server for web scraping that actually works.
Extracts clean, LLM-ready markdown from any URL — even Cloudflare-protected sites.

Why Webustler? • Features • Installation • Usage • Output

🤔 Why Webustler?

Most scraping tools fail on protected sites. Webustler doesn't.

❌ Other Tools

Block on Cloudflare
Require API keys
Charge per request
Return messy HTML
No retry logic

✅ Webustler

Bypasses protection automatically
100% free & self-hosted
Unlimited requests
Clean, LLM-ready markdown
Smart retry with fallback

📊 Comparison

Feature	Webustler	Firecrawl	ScrapeGraphAI	Crawl4AI	Deepcrawl
Anti-bot bypass	✅	⚠️	❌	⚠️	❌
Cloudflare support	✅	⚠️	❌	⚠️	❌
No API key needed	✅	❌	❌	✅	⚠️
Self-hosted	✅	✅	✅	✅	✅
MCP native	✅	✅	✅	✅	❌
Token optimized	✅	✅	❌	✅	✅
Rich metadata	✅	✅	⚠️	⚠️	✅
Link categorization	✅	❌	❌	❌	✅
File detection	✅	⚠️	❌	❌	❌
Reading time	✅	❌	❌	❌	❌
Zero config	✅	❌	❌	❌	❌
Free forever	✅	❌	❌	✅	✅

_{✅ Full support · ⚠️ Partial/Limited · ❌ Not supported}

✨ Features

🛡️ Smart Fallback System

Primary method fails? Automatically retries with anti-bot bypass. No manual intervention needed.

📋 Rich Metadata Extraction

Title, description, author
Open Graph & Twitter Cards
Published/modified time
Language, keywords, robots

🔗 Link Categorization

Separates internal links (same domain) from external links. Perfect for crawling workflows.

📁 File Download Detection

Detects PDFs, images, archives, and other file types. Returns structured info instead of garbled binary.

🧹 Token-Optimized Output

Removes ads, sidebars, popups, base64 images, cookie banners, and all the junk LLMs don't need.

📊 Table Preservation

Data tables stay intact in markdown. No more broken layouts.

⏱️ Content Analysis

Word count and reading time calculated automatically. Know your content at a glance.

📦 Installation

git clone https://github.com/drruin/webustler.git
cd webustler
docker build -t webustler .

🔧 MCP Configuration

Claude Desktop

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "webustler": {
      "command": "docker",
      "args": ["run", "-i", "--rm", "webustler"]
    }
  }
}

Claude Code

claude mcp add webustler -- docker run -i --rm webustler

Cursor

Add to your Cursor MCP settings:

{
  "mcpServers": {
    "webustler": {
      "command": "docker",
      "args": ["run", "-i", "--rm", "webustler"]
    }
  }
}

Windsurf

Add to your Windsurf MCP config:

{
  "mcpServers": {
    "webustler": {
      "command": "docker",
      "args": ["run", "-i", "--rm", "webustler"]
    }
  }
}

With Custom Timeout

Pass the TIMEOUT environment variable (in seconds):

{
  "mcpServers": {
    "webustler": {
      "command": "docker",
      "args": ["run", "-i", "--rm", "-e", "TIMEOUT=180", "webustler"]
    }
  }
}

🚀 Usage

Once configured, the scrape tool is available to your MCP client:

Scrape https://example.com and summarize the content

Extract all links from https://news.ycombinator.com

Get the article from https://protected-site.com/article

Webustler handles everything automatically — including Cloudflare challenges.

📄 Output Format

Returns clean markdown with YAML frontmatter:

---
sourceURL: https://example.com/article
statusCode: 200
title: Article Title
description: Meta description here
author: John Doe
language: en
wordCount: 1542
readingTime: 8 mins
publishedTime: 2025-01-01
openGraph:
  title: OG Title
  image: https://example.com/og.png
twitter:
  card: summary_large_image
internalLinksCount: 42
externalLinksCount: 15
imagesCount: 8
---

# Article Title

Clean markdown content here with **formatting** preserved...

| Column 1 | Column 2 |
|----------|----------|
| Tables   | Work too |

---
## Internal Links

- https://example.com/page1
- https://example.com/page2

---
## External Links

- https://other-site.com/reference

---
## Images

- https://example.com/image1.jpg

⚙️ How It Works

┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│    URL ──► Primary Fetch ──► Blocked? ──► Fallback Fetch       │
│                                  │              │               │
│                                  ▼              ▼               │
│                              Success ◄──────────┘               │
│                                  │                              │
│                                  ▼                              │
│                          Clean HTML                             │
│                                  │                              │
│                                  ▼                              │
│              ┌───────────────────┼───────────────────┐          │
│              ▼                   ▼                   ▼          │
│         Metadata            Markdown             Links          │
│              │                   │                   │          │
│              └───────────────────┼───────────────────┘          │
│                                  ▼                              │
│                          Format Output                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

🔄 Retry Logic

Method	Attempts	Delay	Purpose
Primary	2	5s	Fast extraction
Fallback	3	5s	Anti-bot bypass

Total: Up to 5 attempts before failure. Handles timeouts, rate limits, and challenges.

🧹 Content Cleaning

Click to see what gets removed

Tags Removed

Category	Elements
Scripts	`<script>`, `<noscript>`
Styles	`<style>`
Navigation	`<nav>`, `<header>`, `<footer>`, `<aside>`
Interactive	`<form>`, `<button>`, `<input>`, `<select>`, `<textarea>`
Media	`<svg>`, `<canvas>`, `<video>`, `<audio>`, `<iframe>`, `<object>`, `<embed>`

Selectors Removed

Sidebars ([class*='sidebar'], [id*='sidebar'])
Comments ([class*='comment'])
Ads ([class*='ad-'], [class*='advertisement'])
Social ([class*='social'], [class*='share'])
Popups ([class*='popup'], [class*='modal'])
Cookie banners ([class*='cookie'])
Newsletters ([class*='newsletter'])
Promos ([class*='banner'], [class*='promo'])

Also Removed

Base64 inline images (massive token savings)
Empty elements
Excessive newlines (max 3 consecutive)

🔧 Configuration

Variable	Default	Description
`TIMEOUT`	`120`	Request timeout in seconds

🏆 Why Not Just Use...

Firecrawl?

Firecrawl is excellent but:

Requires API key and paid plans for serious usage
Limited anti-bot capabilities
More complex setup with environment variables

ScrapeGraphAI?

ScrapeGraphAI uses LLMs to parse pages:

Requires LLM API keys (OpenAI, etc.) for all operations
Adds latency (LLM calls) and cost (token usage)
Webustler is deterministic — faster, cheaper, predictable

Crawl4AI?

Crawl4AI is a powerful open-source crawler but:

Requires more configuration to get started
LLM features require additional API keys
Webustler works out of the box with zero config

Deepcrawl?

Deepcrawl is a great Firecrawl alternative but:

Hosted API requires API key (self-host is free)
No anti-bot bypass capabilities
REST API only, not an MCP server

📁 Project Structure

webustler/
├── server.py           # MCP server
├── Dockerfile          # Docker image
├── requirements.txt    # Dependencies
├── LICENSE             # MIT License
├── images/             # Assets
│   └── image.png
└── README.md           # Documentation

⚖️ Ethical Use & Disclaimer

Webustler is provided as a tool for security research, data interoperability, and educational purposes.

Responsibility: As I, the developer of Webustler do not condone unauthorized scraping or the violation of any website's Terms of Service (TOS).
Compliance: Users are solely responsible for ensuring that their use of this tool complies with local laws (such as the CFAA or GDPR) and the intellectual property rights of the content owners.
Respect Robots.txt: I encourage all users to respect robots.txt files and implement reasonable crawl delays to avoid putting undue stress on web servers.

This project is an exploration of web technologies and challenge-response mechanisms. Use it responsibly.

📜 License

MIT License — use it however you want.

MCP server for LLMs. Works everywhere. No API keys. No limits.

_{Made with care for the AI community}

from github.com/DrRuin/webustler

Installing Webustler

This server has no published package — it is built from source. Open the repository and follow its README.

▸ github.com/DrRuin/webustler

FAQ

Is Webustler MCP free?

Yes, Webustler MCP is free — one-click install via Unyly at no cost.

Does Webustler need an API key?

No, Webustler runs without API keys or environment variables.

Is Webustler hosted or self-hosted?

A hosted option is available: Unyly runs the server in the cloud, no local setup required.

How do I install Webustler in Claude Desktop, Claude Code or Cursor?

Open Webustler on unyly.org, pick your client tab (Claude Desktop, Claude Code, Cursor) and press Install — the config is generated automatically, no JSON editing.

Related MCPs

Fetch

Web content fetching and conversion for efficient LLM usage.

by Community

AWS KB Retrieval

Retrieval from AWS Knowledge Base using Bedrock Agent Runtime.

by modelcontextprotocol

Spring AI MCP Server

Provides auto-configuration for setting up an MCP server in Spring Boot applications.

by Community

llm-analysis-assistant

A very streamlined mcp client that supports calling and monitoring stdio/sse/streamableHttp, and can also view request responses through the /logs page. It also

by xuzexin-hz

Compare Webustler with

WebustlervsFetch WebustlervsAWS KB Retrieval WebustlervsSpring AI MCP Server Webustlervsllm-analysis-assistant

Not sure what to pick?

Find your stack in 60 seconds

Author?

Embed badge for your README

Browse similar

All ai MCPs

Command Palette

Webustler

About

README

Webustler

🤔 Why Webustler?

❌ Other Tools

✅ Webustler

📊 Comparison

✨ Features

🛡️ Smart Fallback System

📋 Rich Metadata Extraction

🔗 Link Categorization

📁 File Download Detection

🧹 Token-Optimized Output

📊 Table Preservation

⏱️ Content Analysis

📦 Installation

🔧 MCP Configuration

Claude Desktop

Claude Code

Cursor

Windsurf

With Custom Timeout

🚀 Usage

📄 Output Format

⚙️ How It Works

🔄 Retry Logic

🧹 Content Cleaning

Tags Removed

Selectors Removed

Also Removed

🔧 Configuration

🏆 Why Not Just Use...

📁 Project Structure

⚖️ Ethical Use & Disclaimer

📜 License

Installing Webustler

FAQ

Is Webustler MCP free?

Does Webustler need an API key?

Is Webustler hosted or self-hosted?

How do I install Webustler in Claude Desktop, Claude Code or Cursor?

Related MCPs

Fetch

AWS KB Retrieval

Spring AI MCP Server

llm-analysis-assistant

Compare Webustler with