loading…
Search for a command to run...
loading…
Unified web layer for AI agents. Search across 8 engines, stealth browse with Camoufox/Playwright, manage cookie auth, and act on 24 platforms.
Unified web layer for AI agents. Search across 8 engines, stealth browse with Camoufox/Playwright, manage cookie auth, and act on 24 platforms.
The unified web layer for AI agents. Search, browse, crawl, extract, and act on platforms — one package, self-hosted.
5,000 free searches/month via Gemini Grounded Search. Full page scraping, stealth browsing, multi-page crawling, structured extraction, AI browser agent, 24 platform adapters.
AI agents need to interact with the web — searching, browsing pages, crawling sites, logging into platforms, posting content. Today you wire together Playwright + a search API + cookie managers + platform-specific scripts. Spectrawl is one package that does all of it.
npm install spectrawl
Spectrawl searches via Gemini Grounded Search (Google-quality results), scrapes the top pages for full content, and returns everything to your agent. Your agent's LLM reads the actual sources and forms its own answer — no pre-chewed summaries.
npm install spectrawl
export GEMINI_API_KEY=your-free-key # Get one at aistudio.google.com
const { Spectrawl } = require('spectrawl')
const web = new Spectrawl()
// Deep search — returns sources for your agent/LLM to process
const result = await web.deepSearch('how to build an MCP server in Node.js')
console.log(result.sources) // [{ title, url, content, score }]
// With AI summary (opt-in — uses extra Gemini call)
const withAnswer = await web.deepSearch('query', { summarize: true })
console.log(withAnswer.answer) // AI-generated answer with [1] [2] citations
// Fast mode — snippets only, skip scraping
const fast = await web.deepSearch('query', { mode: 'fast' })
// Basic search — raw results
const basic = await web.search('query')
Why no summary by default? Your agent already has an LLM. If we summarize AND your agent summarizes, you're paying two LLMs for one answer. We return rich sources — your agent does the rest.
| Tavily | Crawl4AI | Firecrawl | Stagehand | Spectrawl | |
|---|---|---|---|---|---|
| Speed | ~2s | ~5s | ~3s | ~3s | ~6-10s |
| Free tier | 1,000/mo | Unlimited | 500/mo | None | 5,000/mo |
| Returns | Snippets + AI | Markdown | Markdown/JSON | Structured | Full page + structured |
| Self-hosted | No | Yes | Yes | Yes | Yes |
| Anti-detect | No | No | No | No | Yes (Camoufox) |
| Block detection | No | No | No | No | 8 services |
| CAPTCHA solving | No | No | No | No | Yes (Gemini Vision) |
| Structured extraction | No | No | No | Yes | Yes |
| NL browser agent | No | No | No | Yes | Yes |
| Network capturing | No | Yes | No | No | Yes |
| Multi-page crawl | No | Yes | Yes | No | Yes (+ sitemap) |
| Platform posting | No | No | No | No | 24 adapters |
| Auth management | No | No | No | No | Cookie store + refresh |
Two modes: basic search and deep search.
const results = await web.search('query')
Returns raw search results from the engine cascade. Fast, lightweight.
const results = await web.deepSearch('query', { summarize: true })
Full pipeline: query expansion → parallel search → merge/dedup → rerank → scrape top N → optional AI summary with citations.
Default cascade: Gemini Grounded → Tavily → Brave
Gemini Grounded Search gives you Google-quality results through the Gemini API. Free tier: 5,000 grounded queries/month.
| Engine | Free Tier | Key Required | Default |
|---|---|---|---|
| Gemini Grounded | 5,000/month | GEMINI_API_KEY |
✅ Primary |
| Tavily | 1,000/month | TAVILY_API_KEY |
✅ 1st fallback |
| Brave | 2,000/month | BRAVE_API_KEY |
✅ 2nd fallback |
| DuckDuckGo | Unlimited | None | Available |
| Bing | Unlimited | None | Available |
| Serper | 2,500 trial | SERPER_API_KEY |
Available |
| Google CSE | 100/day | GOOGLE_CSE_KEY |
Available |
| Jina Reader | Unlimited | None | Available |
| SearXNG | Unlimited | Self-hosted | Available |
Query → Gemini Grounded + DDG (parallel)
→ Merge & deduplicate (12-19 results)
→ Source quality ranking (boost GitHub/SO/Reddit, penalize SEO spam)
→ Parallel scraping (Jina → readability → Playwright fallback)
→ Returns sources to your agent (AI summary opt-in with summarize: true)
DDG-only search, raw results, no AI answer. Works from home IPs. Datacenter IPs get rate-limited by DDG — recommend at minimum a free Gemini key.
Stealth browsing with anti-detection. Three tiers (auto-escalation):
npx spectrawl install-stealth)If tier 1 gets blocked, Spectrawl automatically escalates to tier 2 (if installed) or tier 3 (if configured). No manual intervention needed.
const page = await web.browse('https://example.com', {
screenshot: true, // Take a PNG screenshot
fullPage: true, // Full page screenshot (not just viewport)
html: true, // Return raw HTML alongside markdown
stealth: true, // Force stealth mode
camoufox: true, // Force Camoufox engine
noCache: true, // Bypass cache
auth: 'reddit' // Use stored auth cookies for this platform
})
{
content: "# Page Title\n\nExtracted markdown content...",
url: "https://example.com",
title: "Page Title",
statusCode: 200,
cached: false,
engine: "camoufox", // which engine was used
screenshot: Buffer<png>, // PNG buffer (JS) or base64 (HTTP)
html: "<html>...</html>", // raw HTML (if html: true)
blocked: false, // true if block page detected
blockInfo: null // { type: 'cloudflare', detail: '...' }
}
Spectrawl detects block/challenge pages from 8 anti-bot services and reports them in the response instead of returning garbage HTML:
When a block is detected, the response includes blocked: true and blockInfo: { type, detail }.
Some sites block all datacenter IPs regardless of stealth. Spectrawl automatically routes these through alternative APIs:
| Site | Problem | Fallback | Cost |
|---|---|---|---|
| Blocks all datacenter IPs | PullPush API — Reddit archive | Free | |
| Amazon | CAPTCHA wall on product pages | Jina Reader — server-side rendering | Free |
| X/Twitter | Login wall on posts | xAI Responses API with x_search |
~$0.06/post |
| HTTP 999, IP fingerprinting | Requires residential proxy (see below) | ~$7/GB |
These fallbacks activate automatically — just browse() the URL and Spectrawl picks the right path. No config needed for Reddit and Amazon. X requires XAI_API_KEY env var. LinkedIn requires a residential proxy.
LinkedIn fingerprints the IP where cookies were created. Even valid cookies get rejected from a different IP. Every free approach fails from datacenter servers:
The only working solution is a residential proxy. We recommend Bright Data for best results (72M+ residential IPs, ~99.7% success rate, dedicated social media unlockers). For budget use, Smartproxy ($7/GB, 55M IPs, 3-day free trial) works well at lower cost.
Setup:
# Bright Data (recommended)
npx spectrawl config set proxy '{"host":"brd.superproxy.io","port":22225,"username":"YOUR_ZONE_USER","password":"YOUR_PASS"}'
# Smartproxy (budget alternative)
npx spectrawl config set proxy '{"host":"gate.smartproxy.com","port":10001,"username":"YOUR_USER","password":"YOUR_PASS"}'
# Store your LinkedIn cookies (export from browser)
npx spectrawl login linkedin --account yourname --cookies ./linkedin-cookies.json
# Now browse LinkedIn normally
curl localhost:3900/browse -d '{"url":"https://www.linkedin.com/in/someone"}'
Other residential proxy providers that work:
⚠️ Avoid WebShare — recycled datacenter IPs marketed as residential, no HTTPS support.
Built-in CAPTCHA solver using Gemini Vision (free tier: 1,500 req/day):
The solver automatically detects CAPTCHA type and attempts resolution before returning the page.
Pull structured data from any page using LLM + optional CSS/XPath selectors. Like Stagehand's extract() but self-hosted and integrated with Spectrawl's anti-detect browsing.
const result = await web.extract('https://news.ycombinator.com', {
instruction: 'Extract the top 3 story titles and their point counts',
schema: {
type: 'object',
properties: {
stories: {
type: 'array',
items: {
type: 'object',
properties: {
title: { type: 'string' },
points: { type: 'number' }
}
}
}
}
}
})
// result.data = { stories: [{ title: "...", points: 210 }, ...] }
curl -X POST http://localhost:3900/extract \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"instruction": "Extract the page title and main heading",
"schema": {"type": "object", "properties": {"title": {"type": "string"}, "heading": {"type": "string"}}}
}'
Response:
{
"data": { "title": "Example Domain", "heading": "Example Domain" },
"url": "https://example.com",
"title": "Example Domain",
"contentLength": 129,
"duration": 679
}
Narrow extraction scope using CSS or XPath selectors — reduces tokens and improves accuracy:
const result = await web.extract('https://news.ycombinator.com', {
instruction: 'Extract all story titles',
selector: '.titleline', // CSS selector
// or: selector: 'xpath=//table[@class="itemlist"]'
schema: { type: 'object', properties: { titles: { type: 'array', items: { type: 'string' } } } }
})
For large pages, filter content by relevance before sending to the LLM — saves tokens:
const result = await web.extract('https://en.wikipedia.org/wiki/Node.js', {
instruction: 'Extract the creator and release date',
relevanceFilter: true // BM25 scoring keeps only relevant sections
})
// Content reduced from 50K+ chars to ~2K relevant chars
Already have the content? Skip the browse step:
const result = await web.extractFromContent(markdownContent, {
instruction: 'Extract all email addresses',
schema: { type: 'object', properties: { emails: { type: 'array', items: { type: 'string' } } } }
})
Uses Gemini Flash (free) by default. Falls back to OpenAI if configured.
Control a browser with natural language. Navigate, click, type, scroll — the LLM interprets the page and decides what to do.
const result = await web.agent('https://example.com', 'click the More Information link', {
maxSteps: 5, // max actions to take
screenshot: true // screenshot after completion
})
// result.success = true
// result.url = "https://www.iana.org/domains/reserved" (navigated!)
// result.steps = [{ step: 1, action: "click", elementIdx: 0, result: "clicked" }, ...]
curl -X POST http://localhost:3900/agent \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com", "instruction": "click the More Information link", "maxSteps": 3}'
Response:
{
"success": true,
"url": "https://www.iana.org/domains/reserved",
"title": "IANA — Reserved Domains",
"steps": [
{ "step": 1, "action": "click", "elementIdx": 0, "reason": "clicking the More Information link", "result": "clicked" }
],
"content": "...",
"duration": 5200
}
The agent can: click, type (fill inputs), select (dropdowns), press (keyboard keys), scroll (up/down).
Capture XHR/fetch requests made by a page during browsing — useful for discovering hidden APIs:
const result = await web.browse('https://example.com', {
captureNetwork: true,
captureNetworkHeaders: true, // include request headers
captureNetworkBody: true // include response bodies (<50KB)
})
// result.networkRequests = [
// { url: "https://api.example.com/data", method: "GET", status: 200, contentType: "application/json", body: "..." }
// ]
curl -X POST http://localhost:3900/browse \
-d '{"url": "https://example.com", "captureNetwork": true, "captureNetworkBody": true}'
Take screenshots of any page via browse:
const result = await web.browse('https://example.com', {
screenshot: true,
fullPage: true // optional: capture entire page, not just viewport
})
// result.screenshot is a PNG Buffer
fs.writeFileSync('screenshot.png', result.screenshot)
curl -X POST http://localhost:3900/browse \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com", "screenshot": true, "fullPage": true}'
Response:
{
"content": "# Page Title\n\nExtracted markdown...",
"url": "https://example.com",
"title": "Page Title",
"screenshot": "iVBORw0KGgo...base64-encoded-png...",
"cached": false
}
Note: Screenshots bypass the cache — each request renders a fresh page.
Multi-page website crawler with automatic RAM-based parallelization.
const result = await web.crawl('https://docs.example.com', {
depth: 2, // how many link levels to follow
maxPages: 50, // stop after N pages
format: 'markdown', // 'markdown' or 'html'
scope: 'domain', // 'domain' | 'subdomain' | 'path'
concurrency: 'auto', // auto-detect from available RAM, or set a number
merge: true, // merge all pages into one document
includePatterns: [], // regex patterns to include
excludePatterns: [], // regex patterns to skip
delay: 300, // ms between batch launches (politeness)
stealth: true // use anti-detect browsing
})
{
pages: [
{ url: 'https://docs.example.com/', content: '...', title: '...', statusCode: 200 },
{ url: 'https://docs.example.com/guide', content: '...', title: '...', statusCode: 200 },
// ...
],
stats: {
pagesScraped: 23,
duration: 45000,
concurrency: 4
}
}
Spectrawl auto-discovers sitemap.xml and pre-seeds the crawl queue — much faster than link-following for documentation sites:
const result = await web.crawl('https://docs.example.com', {
useSitemap: true, // enabled by default
maxPages: 20
})
// [crawl] Found sitemap at https://docs.example.com/sitemap.xml with 82 URLs
// [crawl] Pre-seeded 20 URLs from sitemap
Set useSitemap: false to disable and rely only on link discovery.
Get notified when a crawl completes:
curl -X POST http://localhost:3900/crawl \
-d '{"url": "https://docs.example.com", "maxPages": 50, "webhook": "https://your-server.com/webhook"}'
Spectrawl will POST the full crawl result to your webhook URL when finished.
For large sites, use async mode to avoid HTTP timeouts:
# Start a crawl job (returns immediately)
curl -X POST http://localhost:3900/crawl \
-d '{"url": "https://docs.example.com", "depth": 3, "maxPages": 100, "async": true}'
# Response: { "jobId": "abc123", "status": "running" }
# Check job status
curl http://localhost:3900/crawl/abc123
# List all jobs
curl http://localhost:3900/crawl/jobs
# Check system capacity
curl http://localhost:3900/crawl/capacity
Spectrawl estimates ~250MB per browser tab and calculates safe concurrency from available system RAM:
Persistent cookie storage (SQLite), multi-account management, automatic expiry detection.
// Add account
await web.auth.add('x', { account: '@myhandle', method: 'cookie', cookies })
// Check health
const accounts = await web.auth.getStatus()
// [{ platform: 'x', account: '@myhandle', status: 'valid', expiresAt: '...' }]
Cookie refresh cron fires events before accounts go stale (see Events).
Spectrawl emits events for auth state changes, rate limits, and action results. Subscribe to stay informed:
const { EVENTS } = require('spectrawl')
web.on(EVENTS.COOKIE_EXPIRING, (data) => {
console.log(`Cookie expiring for ${data.platform}:${data.account}`)
})
web.on(EVENTS.RATE_LIMITED, (data) => {
console.log(`Rate limited on ${data.platform}`)
})
// Wildcard — catch everything
web.on('*', ({ event, ...data }) => {
console.log(`Event: ${event}`, data)
})
| Event | When |
|---|---|
cookie_expiring |
Cookie approaching expiry |
cookie_expired |
Cookie has expired |
auth_failed |
Authentication attempt failed |
auth_refreshed |
Cookie successfully refreshed |
rate_limited |
Platform rate limit hit |
action_failed |
Platform action failed |
action_success |
Platform action succeeded |
health_check |
Periodic health check result |
Post to 24+ platforms with one API:
await web.act('github', 'create-issue', { repo: 'user/repo', title: 'Bug report', body: '...' })
await web.act('reddit', 'post', { subreddit: 'node', title: '...', text: '...' })
await web.act('devto', 'post', { title: '...', body: '...', tags: ['ai'] })
await web.act('huggingface', 'create-repo', { name: 'my-model', type: 'model' })
Live tested: GitHub ✅, Reddit ✅, Dev.to ✅, HuggingFace ✅, X (reads) ✅
| Platform | Auth Method | Actions |
|---|---|---|
| X/Twitter | Cookie + OAuth 1.0a | post |
| Cookie API | post, comment, delete | |
| Dev.to | REST API key | post, update |
| Hashnode | GraphQL API | post |
| Cookie API (Voyager) | post | |
| IndieHackers | Browser automation | post, comment |
| Medium | REST API | post |
| GitHub | REST v3 | repo, file, issue, release |
| Discord | Bot API | send, thread |
| Product Hunt | GraphQL v2 | launch, comment |
| Hacker News | Cookie API | submit, comment |
| YouTube | Data API v3 | comment |
| Quora | Browser automation | answer |
| HuggingFace | Hub API | repo, model card, upload |
| BetaList | REST API | submit |
| 14 Directories | Generic adapter | submit |
Built-in rate limiting, content dedup (MD5, 24h window), and dead letter queue for retries.
Spectrawl ranks results by domain trust — something most search tools don't do:
const web = new Spectrawl({
sourceRanker: {
boost: ['github.com', 'news.ycombinator.com'],
block: ['spamsite.com']
}
})
npx spectrawl serve --port 3900
| Method | Path | Description |
|---|---|---|
POST |
/search |
Search the web |
POST |
/browse |
Stealth browse a URL |
POST |
/crawl |
Crawl a website (sync or async) |
POST |
/extract |
Structured data extraction with LLM |
POST |
/agent |
Natural language browser actions |
POST |
/act |
Platform actions |
GET |
/status |
Auth account health |
GET |
/health |
Server health |
GET |
/crawl/jobs |
List async crawl jobs |
GET |
/crawl/:jobId |
Get crawl job status/results |
GET |
/crawl/capacity |
System crawl capacity |
curl -X POST http://localhost:3900/search \
-H 'Content-Type: application/json' \
-d '{"query": "best headless browsers 2026", "summarize": true}'
Response:
{
"sources": [
{
"title": "Top Headless Browsers in 2026",
"url": "https://example.com/article",
"snippet": "Short snippet from search...",
"content": "Full page markdown content (if scraped)...",
"source": "gemini-grounded",
"confidence": 0.95
}
],
"answer": "AI-generated summary with [1] citations... (only if summarize: true)",
"cached": false
}
curl -X POST http://localhost:3900/browse \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com", "screenshot": true, "fullPage": true}'
Response:
{
"content": "# Example Domain\n\nThis domain is for use in illustrative examples...",
"url": "https://example.com",
"title": "Example Domain",
"statusCode": 200,
"screenshot": "iVBORw0KGgoAAAANSUhEUg...base64...",
"cached": false,
"engine": "playwright"
}
curl -X POST http://localhost:3900/crawl \
-H 'Content-Type: application/json' \
-d '{"url": "https://docs.example.com", "depth": 2, "maxPages": 10}'
Response:
{
"pages": [
{
"url": "https://docs.example.com/",
"content": "# Docs Home\n\n...",
"title": "Documentation",
"statusCode": 200
}
],
"stats": {
"pagesScraped": 8,
"duration": 12000,
"concurrency": 4
}
}
curl -X POST http://localhost:3900/act \
-H 'Content-Type: application/json' \
-d '{"platform": "github", "action": "create-issue", "repo": "user/repo", "title": "Bug", "body": "Details..."}'
All errors follow RFC 9457 Problem Details format:
{
"type": "https://spectrawl.dev/errors/rate-limited",
"status": 429,
"title": "rate limited",
"detail": "Reddit rate limit: max 3 posts per hour",
"retryable": true
}
Error types: bad-request (400), unauthorized (401), forbidden (403), not-found (404), rate-limited (429), internal-error (500), upstream-error (502), service-unavailable (503).
Route browsing through residential or datacenter proxies. Required for LinkedIn — see Site-Specific Fallbacks for why.
{
"browse": {
"proxy": {
"host": "gate.smartproxy.com",
"port": 10001,
"username": "YOUR_USER",
"password": "YOUR_PASS"
}
}
}
The proxy is used for all Playwright and Camoufox browsing sessions. You can also start a local rotating proxy server that rotates through multiple upstream proxies:
npx spectrawl proxy --port 8080
Recommended providers:
| Provider | Price | IPs | Best For |
|---|---|---|---|
| Bright Data | $12+/GB | 72M | ⭐ Best quality, ~99.7% success, social unlockers |
| Smartproxy | $7/GB | 55M | Best budget option, 3-day free trial |
| IPRoyal | $7/GB | 32M | Good alternative |
| Oxylabs | $10+/GB | 100M+ | Enterprise-grade |
Works with any MCP-compatible agent (Claude, Cursor, OpenClaw, LangChain):
npx spectrawl mcp
| Tool | Description | Key Parameters |
|---|---|---|
web_search |
Search the web | query, summarize, scrapeTop, minResults |
web_browse |
Stealth browse a URL | url, auth, screenshot, html |
web_act |
Platform action | platform, action, account, text, title |
web_auth |
Manage auth | action (list/add/remove), platform, account |
web_status |
Check auth health | — |
npx spectrawl init # create spectrawl.json
npx spectrawl search "query" # search from terminal
npx spectrawl status # check auth health
npx spectrawl serve # start HTTP server
npx spectrawl mcp # start MCP server
npx spectrawl proxy # start rotating proxy server
npx spectrawl install-stealth # download Camoufox browser
npx spectrawl version # show version
spectrawl.json — full defaults:
{
"port": 3900,
"concurrency": 3,
"search": {
"cascade": ["gemini-grounded", "tavily", "brave"],
"scrapeTop": 5
},
"browse": {
"defaultEngine": "playwright",
"proxy": null,
"humanlike": {
"minDelay": 500,
"maxDelay": 2000,
"scrollBehavior": true
}
},
"auth": {
"refreshInterval": "4h",
"cookieStore": "./data/cookies.db"
},
"cache": {
"path": "./data/cache.db",
"searchTtl": 3600,
"scrapeTtl": 86400,
"screenshotTtl": 3600
},
"rateLimit": {
"x": { "postsPerHour": 5, "minDelayMs": 30000 },
"reddit": { "postsPerHour": 3, "minDelayMs": 600000 }
}
}
Spectrawl simulates human browsing patterns by default:
browse.humanlikeGEMINI_API_KEY Free — primary search + summarization (aistudio.google.com)
BRAVE_API_KEY Brave Search (2,000 free/month)
TAVILY_API_KEY Tavily Search (1,000 free/month)
SERPER_API_KEY Serper.dev (2,500 trial queries)
GITHUB_TOKEN For GitHub adapter
DEVTO_API_KEY For Dev.to adapter
HF_TOKEN For HuggingFace adapter
OPENAI_API_KEY Alternative LLM for summarization
ANTHROPIC_API_KEY Alternative LLM for summarization
MIT
Добавь это в claude_desktop_config.json и перезапусти Claude Desktop.
{
"mcpServers": {
"spectrawl": {
"command": "npx",
"args": []
}
}
}