loading…
Search for a command to run...
loading…
Enables clean, LLM-ready markdown extraction from any URL with automatic anti-bot bypass.
Enables clean, LLM-ready markdown extraction from any URL with automatic anti-bot bypass.
MCP server for web scraping that actually works.
Extracts clean, LLM-ready markdown from any URL — even Cloudflare-protected sites.
Why Webustler? • Features • Installation • Usage • Output
Most scraping tools fail on protected sites. Webustler doesn't.
❌ Other Tools
|
✅ Webustler
|
| Feature | Webustler | Firecrawl | ScrapeGraphAI | Crawl4AI | Deepcrawl |
|---|---|---|---|---|---|
| Anti-bot bypass | ✅ | ⚠️ | ❌ | ⚠️ | ❌ |
| Cloudflare support | ✅ | ⚠️ | ❌ | ⚠️ | ❌ |
| No API key needed | ✅ | ❌ | ❌ | ✅ | ⚠️ |
| Self-hosted | ✅ | ✅ | ✅ | ✅ | ✅ |
| MCP native | ✅ | ✅ | ✅ | ✅ | ❌ |
| Token optimized | ✅ | ✅ | ❌ | ✅ | ✅ |
| Rich metadata | ✅ | ✅ | ⚠️ | ⚠️ | ✅ |
| Link categorization | ✅ | ❌ | ❌ | ❌ | ✅ |
| File detection | ✅ | ⚠️ | ❌ | ❌ | ❌ |
| Reading time | ✅ | ❌ | ❌ | ❌ | ❌ |
| Zero config | ✅ | ❌ | ❌ | ❌ | ❌ |
| Free forever | ✅ | ❌ | ❌ | ✅ | ✅ |
✅ Full support · ⚠️ Partial/Limited · ❌ Not supported
🛡️ Smart Fallback SystemPrimary method fails? Automatically retries with anti-bot bypass. No manual intervention needed. 📋 Rich Metadata Extraction
🔗 Link CategorizationSeparates internal links (same domain) from external links. Perfect for crawling workflows. 📁 File Download DetectionDetects PDFs, images, archives, and other file types. Returns structured info instead of garbled binary. |
🧹 Token-Optimized OutputRemoves ads, sidebars, popups, base64 images, cookie banners, and all the junk LLMs don't need. 📊 Table PreservationData tables stay intact in markdown. No more broken layouts. ⏱️ Content AnalysisWord count and reading time calculated automatically. Know your content at a glance. |
git clone https://github.com/drruin/webustler.git
cd webustler
docker build -t webustler .
Add to your claude_desktop_config.json:
{
"mcpServers": {
"webustler": {
"command": "docker",
"args": ["run", "-i", "--rm", "webustler"]
}
}
}
claude mcp add webustler -- docker run -i --rm webustler
Add to your Cursor MCP settings:
{
"mcpServers": {
"webustler": {
"command": "docker",
"args": ["run", "-i", "--rm", "webustler"]
}
}
}
Add to your Windsurf MCP config:
{
"mcpServers": {
"webustler": {
"command": "docker",
"args": ["run", "-i", "--rm", "webustler"]
}
}
}
Pass the TIMEOUT environment variable (in seconds):
{
"mcpServers": {
"webustler": {
"command": "docker",
"args": ["run", "-i", "--rm", "-e", "TIMEOUT=180", "webustler"]
}
}
}
Once configured, the scrape tool is available to your MCP client:
Scrape https://example.com and summarize the content
Extract all links from https://news.ycombinator.com
Get the article from https://protected-site.com/article
Webustler handles everything automatically — including Cloudflare challenges.
Returns clean markdown with YAML frontmatter:
---
sourceURL: https://example.com/article
statusCode: 200
title: Article Title
description: Meta description here
author: John Doe
language: en
wordCount: 1542
readingTime: 8 mins
publishedTime: 2025-01-01
openGraph:
title: OG Title
image: https://example.com/og.png
twitter:
card: summary_large_image
internalLinksCount: 42
externalLinksCount: 15
imagesCount: 8
---
# Article Title
Clean markdown content here with **formatting** preserved...
| Column 1 | Column 2 |
|----------|----------|
| Tables | Work too |
---
## Internal Links
- https://example.com/page1
- https://example.com/page2
---
## External Links
- https://other-site.com/reference
---
## Images
- https://example.com/image1.jpg
┌─────────────────────────────────────────────────────────────────┐
│ │
│ URL ──► Primary Fetch ──► Blocked? ──► Fallback Fetch │
│ │ │ │
│ ▼ ▼ │
│ Success ◄──────────┘ │
│ │ │
│ ▼ │
│ Clean HTML │
│ │ │
│ ▼ │
│ ┌───────────────────┼───────────────────┐ │
│ ▼ ▼ ▼ │
│ Metadata Markdown Links │
│ │ │ │ │
│ └───────────────────┼───────────────────┘ │
│ ▼ │
│ Format Output │
│ │
└─────────────────────────────────────────────────────────────────┘
| Method | Attempts | Delay | Purpose |
|---|---|---|---|
| Primary | 2 | 5s | Fast extraction |
| Fallback | 3 | 5s | Anti-bot bypass |
Total: Up to 5 attempts before failure. Handles timeouts, rate limits, and challenges.
| Category | Elements |
|---|---|
| Scripts | <script>, <noscript> |
| Styles | <style> |
| Navigation | <nav>, <header>, <footer>, <aside> |
| Interactive | <form>, <button>, <input>, <select>, <textarea> |
| Media | <svg>, <canvas>, <video>, <audio>, <iframe>, <object>, <embed> |
[class*='sidebar'], [id*='sidebar'])[class*='comment'])[class*='ad-'], [class*='advertisement'])[class*='social'], [class*='share'])[class*='popup'], [class*='modal'])[class*='cookie'])[class*='newsletter'])[class*='banner'], [class*='promo'])| Variable | Default | Description |
|---|---|---|
TIMEOUT |
120 |
Request timeout in seconds |
Firecrawl is excellent but:
ScrapeGraphAI uses LLMs to parse pages:
Crawl4AI is a powerful open-source crawler but:
Deepcrawl is a great Firecrawl alternative but:
webustler/
├── server.py # MCP server
├── Dockerfile # Docker image
├── requirements.txt # Dependencies
├── LICENSE # MIT License
├── images/ # Assets
│ └── image.png
└── README.md # Documentation
Webustler is provided as a tool for security research, data interoperability, and educational purposes.
robots.txt files and implement reasonable crawl delays to avoid putting undue stress on web servers.This project is an exploration of web technologies and challenge-response mechanisms. Use it responsibly.
MIT License — use it however you want.
MCP server for LLMs. Works everywhere. No API keys. No limits.
Made with care for the AI community
Run in your terminal:
claude mcp add webustler -- npx Security
Low riskAutomated heuristic from public metadata — not a security guarantee.