loading…
Search for a command to run...
loading…
Provides web search and page fetch capabilities using a browser-based approach, enabling LLMs to search DuckDuckGo, Google, or Yandex and retrieve rendered HTML
Provides web search and page fetch capabilities using a browser-based approach, enabling LLMs to search DuckDuckGo, Google, or Yandex and retrieve rendered HTML from URLs.
Browser-backed web search and page fetch for local LLMs, exposed as MCP tools and a CLI.
The design history is tracked in docs/IMPLEMENTATION_PLAN.md.
This project is licensed under the MIT License.
crawly-mcpcrawly_mcpcrawly-clicrawly-mcpsearch(provider, context) runs a browser-backed search on duckduckgo (default), google, or yandex and returns up to 5 organic result URLs.fetch(urls, content_format) fetches 1..5 URLs and returns browser-rendered page content with per-URL pages, errors, and truncated fields. Use content_format="html" for raw HTML or content_format="text" for extracted readable text.page_search(url, query) searches for content on a single page. Tries known site-search facilities first (Algolia DocSearch, OpenSearch descriptor, Readthedocs API), then generic GET form detection, then find-in-page text as a fallback. Returns a mode discriminator plus up to 5 results with snippets and optional result URLs.context is intentionally the search query string for caller compatibility.
uv sync
chromium --version
For host usage, crawly defaults to launching a system Chromium binary. If Chromium is installed in a non-standard location, set:
PLAYWRIGHT_CHROMIUM_EXECUTABLE=/path/to/chromium
To force Playwright-managed Chromium instead of a host browser:
PLAYWRIGHT_BROWSER_SOURCE=bundled
Run the CLI directly:
uv run crawly-cli search --context "python async playwright"
uv run crawly-cli fetch https://example.com
uv run crawly-cli page-search --url https://docs.example.com/guide --query "authentication"
The page-search subcommand prints a JSON PageSearchResponse with mode, attempted, results_url, and results[].
Run the MCP server over stdio:
uv run crawly-mcp
Expose HTTP transport instead of stdio:
uv run crawly-mcp --transport streamable-http --host 127.0.0.1 --port 8000
The MCP server also reads:
CRAWLY_HOSTCRAWLY_PORTThe container image uses Patchright-managed Chromium on a plain Python Debian base and defaults to HTTP MCP on port 8000.
Build locally:
docker build -t crawly-mcp:local .
Run locally:
docker run --rm --init -p 8000:8000 crawly-mcp:local
Override the transport to stdio:
docker run --rm --init -i crawly-mcp:local crawly-mcp --transport stdio
Launch the stdio MCP server from the current checkout with an auto-build step:
./scripts/run_crawly_mcp_stdio_container.sh
Launch the HTTP MCP server from the current checkout:
./scripts/run_crawly_mcp_http_container.sh
The container defaults to:
PLAYWRIGHT_BROWSER_SOURCE=bundledCRAWLY_HOST=0.0.0.0CRAWLY_PORT=8000CRAWLY_FETCH_MAX_SIZE=1048576CRAWLY_PROFILE_DIR=/data/profilesCRAWLY_PROFILE_CLEANUP_ON_START=trueFor local LLMs with smaller context windows, call fetch(..., content_format="text") and lower the payload cap:
CRAWLY_FETCH_MAX_SIZE=16384 ./scripts/run_crawly_mcp_stdio_container.sh
The HTTP MCP endpoint is unauthenticated in v1. Deploy it behind localhost, a private network, or an auth/TLS reverse proxy.
Published images are intended to be:
ghcr.io/<owner>/crawly-mcp<dockerhub-namespace>/crawly-mcpThe first GHCR publish may need a one-time manual visibility change to make the package public.
Run the published GHCR image directly:
docker run --rm --init \
-p 8000:8000 \
-e CRAWLY_HOST=0.0.0.0 \
-e CRAWLY_PORT=8000 \
-e CRAWLY_FETCH_MAX_SIZE=16384 \
-e CRAWLY_BROWSER_LANG=en-US \
-e CRAWLY_BROWSER_LOCATION=America/New_York \
ghcr.io/dshein-alt/crawly-mcp:latest
The most important runtime overrides are:
CRAWLY_FETCH_MAX_SIZE: caps returned fetch payload size for both content_format="html" and content_format="text".CRAWLY_BROWSER_LANG: sets browser locale and primary Accept-Language.CRAWLY_BROWSER_LOCATION: sets browser timezone / location persona.For MCP clients that can launch a local command, point them at the project script so the server comes from the current checkout:
mcpServers:
- name: Crawly MCP
command: /path/to/crawly/scripts/run_crawly_mcp_stdio_container.sh
args: []
env:
CRAWLY_CONTAINER_ENGINE: docker
Replace /path/to/crawly with your checkout path. The launcher rebuilds
crawly-mcp:local before starting the stdio server so container contents stay aligned
with local source changes. Set CRAWLY_MCP_SKIP_BUILD=1 if you want to skip that build
when the local image is already current.
For clients that support HTTP MCP, start a local or published crawly-mcp HTTP server first,
then point the client at the running instance:
http://127.0.0.1:8000/mcp
For Continue, an HTTP MCP config looks like:
name: New MCP server
version: 0.0.1
schema: v1
mcpServers:
- name: Crawly
type: streamable-http
url: http://127.0.0.1:8000/mcp
The url must match an actually running crawly-mcp HTTP instance.
If your client's MCP config accepts direct URLs, the entry is typically shaped like:
mcpServers:
- name: Crawly MCP
url: http://127.0.0.1:8000/mcp
Set CRAWLY_HTTP_BIND_HOST or CRAWLY_HTTP_BIND_PORT before launching if you need the
listener on a different interface or port.
This repo includes two reusable instruction files for small-context web workflows:
web-search skill nameWeb SearchUse them when a small local model must search, fetch, and synthesize across multiple pages without overflowing context.
crawly uses patchright (a Playwright fork with bundled fingerprint patches) and keeps a small set of per-search-provider persistent profiles on disk. The following env vars tune the browser persona and search trace capture:
| Env var | Default | Purpose |
|---|---|---|
CRAWLY_BROWSER_LANG |
ru-RU |
Browser locale and primary Accept-Language value passed to Playwright. |
CRAWLY_BROWSER_LOCATION |
Europe/Moscow |
Browser timezone id. TZ is used only as a fallback when this env var is unset. |
CRAWLY_BROWSER_VIEWPORT |
1366x768 |
Browser viewport in WIDTHxHEIGHT form. Invalid values fall back to the default. |
CRAWLY_FETCH_MAX_SIZE |
1048576 |
Max bytes returned per fetched URL after rendering the configured content format. This limit applies to both raw HTML and extracted text. Lower it for local LLMs with small context windows. |
CRAWLY_USE_PERSISTENT_PROFILES |
true |
Toggle per-provider persistent search profiles. Set to false to make search() use a fresh incognito context per request (warm-up still runs). Useful for A/B-testing the persistence feature or for stateless deployments. |
CRAWLY_PROFILE_DIR |
~/.cache/crawly/profiles |
Parent directory for per-provider persistent profiles. Must be a writable mount in containers. Ignored when CRAWLY_USE_PERSISTENT_PROFILES=false. |
CRAWLY_PROFILE_CLEANUP_ON_START |
false |
Enable age-based profile cleanup at startup. Set to true in the Dockerfile entrypoint. Unsafe when multiple processes share the profile dir. |
CRAWLY_PROFILE_MAX_AGE_DAYS |
14 |
Age threshold for profile cleanup. |
CRAWLY_SEARCH_JITTER_MS |
500,1500 |
Min/max ms delay between warm-up and real query. Two-int CSV. |
CRAWLY_TRACE_DIR |
unset | Opt-in per-search artifact dump directory. When set, each search() writes meta.json, fingerprint.json, network.jsonl, page.html, and screenshot.png. |
Each provider (duckduckgo, google, yandex) keeps its own subdirectory under CRAWLY_PROFILE_DIR with cookies, localStorage, and session state. In Docker, mount a named volume at whatever path CRAWLY_PROFILE_DIR points to (default in the image: /data/profiles):
docker run -v crawly-profiles:/data/profiles crawly-mcp
scripts/fingerprint_check.py runs a set of JS assertions against a blank page to verify the browser's JS-visible fingerprint looks like real Chrome:
uv run python scripts/fingerprint_check.py --verbose
Exits non-zero if any check fails. CI runs this on release tags.
Tracing is disabled by default. Set CRAWLY_TRACE_DIR only when you want to compare an automated run with manually collected artifacts:
CRAWLY_TRACE_DIR=./dump/trace uv run crawly-mcp --transport streamable-http
Each traced search() call writes one directory containing:
meta.json with provider, query, warm-up/jitter data, final URL/title, and parsed result URLsfingerprint.json with JS-visible browser propertiesnetwork.jsonl with request/response/failure eventspage.html and screenshot.png from the terminal page statefetch() (fresh context per request). search() uses per-provider persistent contexts with on-disk profiles keyed by provider.PLAYWRIGHT_BROWSER_SOURCE=system uses a host Chromium binary (driven by patchright).PLAYWRIGHT_BROWSER_SOURCE=bundled uses patchright-managed Chromium (patchright install chromium).3.15s per page, 20s total for search, 35s total for fetch.http/https only, no embedded credentials, blocks loopback/private/link-local/reserved IPs before navigation and on browser subrequests.10s settle window. patchright provides fingerprint patches against common bot-detection checks; provider-specific warm-up hops and synthetic client-hint headers keep the browser identity stable across requests. No CAPTCHA solving or site-specific bypass logic.fetch() returns raw HTML by default, or extracted readable text when the request sets content_format="text".1 MiB per URL by default; set CRAWLY_FETCH_MAX_SIZE lower when you need smaller MCP payloads. This applies to both content_format="html" and content_format="text". Oversized responses are truncated and reported in truncated.robots.txt is not consulted in v1.source .venv/bin/activate
ruff check .
pytest
Smoke checks:
rg -n "web-search|web_search_mcp" README.md AGENTS.md CHANGELOG.md pyproject.toml src tests
.venv/bin/python scripts/http_mcp_smoke.py --url http://127.0.0.1:8000/mcp
Parser tests run against saved HTML fixtures; selector drift is an expected maintenance cost.
Добавь это в claude_desktop_config.json и перезапусти Claude Desktop.
{
"mcpServers": {
"web-search-mcp": {
"command": "npx",
"args": []
}
}
}Browser automation, scraping, screenshots
Browser automation and web scraping.
Plugin-based MCP server + Chrome extension that gives AI agents access to web applications through the user's authenticated browser session. 100+ plugins with a
1,500+ developer infrastructure deals, free tiers, and startup programs across 54 categories. Search deals, compare vendors, plan stacks, and track pricing chan