Web Search

Бесплатно

Provides web search and page fetch capabilities using a browser-based approach, enabling LLMs to search DuckDuckGo, Google, or Yandex and retrieve rendered HTML

автор: dshein-alt

GitHub

Описание

Provides web search and page fetch capabilities using a browser-based approach, enabling LLMs to search DuckDuckGo, Google, or Yandex and retrieve rendered HTML from URLs.

README

Browser-backed web search and page fetch for local LLMs, exposed as MCP tools and a CLI.

The design history is tracked in docs/IMPLEMENTATION_PLAN.md.

License

This project is licensed under the MIT License.

Naming

Python distribution: crawly-mcp
Import package: crawly_mcp
CLI executable: crawly-cli
MCP server executable: crawly-mcp

Tools

search(provider, context) runs a browser-backed search on duckduckgo (default), google, or yandex and returns up to 5 organic result URLs.
fetch(urls, content_format) fetches 1..5 URLs and returns browser-rendered page content with per-URL pages, errors, and truncated fields. Use content_format="html" for raw HTML or content_format="text" for extracted readable text.
page_search(url, query) searches for content on a single page. Tries known site-search facilities first (Algolia DocSearch, OpenSearch descriptor, Readthedocs API), then generic GET form detection, then find-in-page text as a fallback. Returns a mode discriminator plus up to 5 results with snippets and optional result URLs.

context is intentionally the search query string for caller compatibility.

Setup

uv sync
chromium --version

For host usage, crawly defaults to launching a system Chromium binary. If Chromium is installed in a non-standard location, set:

PLAYWRIGHT_CHROMIUM_EXECUTABLE=/path/to/chromium

To force Playwright-managed Chromium instead of a host browser:

PLAYWRIGHT_BROWSER_SOURCE=bundled

Usage

Run the CLI directly:

uv run crawly-cli search --context "python async playwright"
uv run crawly-cli fetch https://example.com
uv run crawly-cli page-search --url https://docs.example.com/guide --query "authentication"

The page-search subcommand prints a JSON PageSearchResponse with mode, attempted, results_url, and results[].

Run the MCP server over stdio:

uv run crawly-mcp

Expose HTTP transport instead of stdio:

uv run crawly-mcp --transport streamable-http --host 127.0.0.1 --port 8000

The MCP server also reads:

CRAWLY_HOST
CRAWLY_PORT

Container

The container image uses Patchright-managed Chromium on a plain Python Debian base and defaults to HTTP MCP on port 8000.

Build locally:

docker build -t crawly-mcp:local .

Run locally:

docker run --rm --init -p 8000:8000 crawly-mcp:local

Override the transport to stdio:

docker run --rm --init -i crawly-mcp:local crawly-mcp --transport stdio

Launch the stdio MCP server from the current checkout with an auto-build step:

./scripts/run_crawly_mcp_stdio_container.sh

Launch the HTTP MCP server from the current checkout:

./scripts/run_crawly_mcp_http_container.sh

The container defaults to:

PLAYWRIGHT_BROWSER_SOURCE=bundled
CRAWLY_HOST=0.0.0.0
CRAWLY_PORT=8000
CRAWLY_FETCH_MAX_SIZE=1048576
CRAWLY_PROFILE_DIR=/data/profiles
CRAWLY_PROFILE_CLEANUP_ON_START=true

For local LLMs with smaller context windows, call fetch(..., content_format="text") and lower the payload cap:

CRAWLY_FETCH_MAX_SIZE=16384 ./scripts/run_crawly_mcp_stdio_container.sh

The HTTP MCP endpoint is unauthenticated in v1. Deploy it behind localhost, a private network, or an auth/TLS reverse proxy.

Published images are intended to be:

ghcr.io/<owner>/crawly-mcp
<dockerhub-namespace>/crawly-mcp

The first GHCR publish may need a one-time manual visibility change to make the package public.

Integration Setup

Docker Run

Run the published GHCR image directly:

docker run --rm --init \
  -p 8000:8000 \
  -e CRAWLY_HOST=0.0.0.0 \
  -e CRAWLY_PORT=8000 \
  -e CRAWLY_FETCH_MAX_SIZE=16384 \
  -e CRAWLY_BROWSER_LANG=en-US \
  -e CRAWLY_BROWSER_LOCATION=America/New_York \
  ghcr.io/dshein-alt/crawly-mcp:latest

The most important runtime overrides are:

CRAWLY_FETCH_MAX_SIZE: caps returned fetch payload size for both content_format="html" and content_format="text".
CRAWLY_BROWSER_LANG: sets browser locale and primary Accept-Language.
CRAWLY_BROWSER_LOCATION: sets browser timezone / location persona.

MCP Client Config

For MCP clients that can launch a local command, point them at the project script so the server comes from the current checkout:

mcpServers:
  - name: Crawly MCP
    command: /path/to/crawly/scripts/run_crawly_mcp_stdio_container.sh
    args: []
    env:
      CRAWLY_CONTAINER_ENGINE: docker

Replace /path/to/crawly with your checkout path. The launcher rebuilds crawly-mcp:local before starting the stdio server so container contents stay aligned with local source changes. Set CRAWLY_MCP_SKIP_BUILD=1 if you want to skip that build when the local image is already current.

For clients that support HTTP MCP, start a local or published crawly-mcp HTTP server first, then point the client at the running instance:

http://127.0.0.1:8000/mcp

For Continue, an HTTP MCP config looks like:

name: New MCP server
version: 0.0.1
schema: v1
mcpServers:
  - name: Crawly
    type: streamable-http
    url: http://127.0.0.1:8000/mcp

The url must match an actually running crawly-mcp HTTP instance.

If your client's MCP config accepts direct URLs, the entry is typically shaped like:

mcpServers:
  - name: Crawly MCP
    url: http://127.0.0.1:8000/mcp

Set CRAWLY_HTTP_BIND_HOST or CRAWLY_HTTP_BIND_PORT before launching if you need the listener on a different interface or port.

Bundled Skill / Prompt

This repo includes two reusable instruction files for small-context web workflows:

skills/web-search/SKILL.md — Codex skill guidance under the web-search skill name
skills/continue-web-search.md — Continue-native invokable prompt named Web Search

Use them when a small local model must search, fetch, and synthesize across multiple pages without overflowing context.

Browser configuration

crawly uses patchright (a Playwright fork with bundled fingerprint patches) and keeps a small set of per-search-provider persistent profiles on disk. The following env vars tune the browser persona and search trace capture:

Env var	Default	Purpose
`CRAWLY_BROWSER_LANG`	`ru-RU`	Browser `locale` and primary `Accept-Language` value passed to Playwright.
`CRAWLY_BROWSER_LOCATION`	`Europe/Moscow`	Browser timezone id. `TZ` is used only as a fallback when this env var is unset.
`CRAWLY_BROWSER_VIEWPORT`	`1366x768`	Browser viewport in `WIDTHxHEIGHT` form. Invalid values fall back to the default.
`CRAWLY_FETCH_MAX_SIZE`	`1048576`	Max bytes returned per fetched URL after rendering the configured content format. This limit applies to both raw HTML and extracted text. Lower it for local LLMs with small context windows.
`CRAWLY_USE_PERSISTENT_PROFILES`	`true`	Toggle per-provider persistent search profiles. Set to `false` to make `search()` use a fresh incognito context per request (warm-up still runs). Useful for A/B-testing the persistence feature or for stateless deployments.
`CRAWLY_PROFILE_DIR`	`~/.cache/crawly/profiles`	Parent directory for per-provider persistent profiles. Must be a writable mount in containers. Ignored when `CRAWLY_USE_PERSISTENT_PROFILES=false`.
`CRAWLY_PROFILE_CLEANUP_ON_START`	`false`	Enable age-based profile cleanup at startup. Set to `true` in the Dockerfile entrypoint. Unsafe when multiple processes share the profile dir.
`CRAWLY_PROFILE_MAX_AGE_DAYS`	`14`	Age threshold for profile cleanup.
`CRAWLY_SEARCH_JITTER_MS`	`500,1500`	Min/max ms delay between warm-up and real query. Two-int CSV.
`CRAWLY_TRACE_DIR`	unset	Opt-in per-search artifact dump directory. When set, each `search()` writes `meta.json`, `fingerprint.json`, `network.jsonl`, `page.html`, and `screenshot.png`.

Profile persistence

Each provider (duckduckgo, google, yandex) keeps its own subdirectory under CRAWLY_PROFILE_DIR with cookies, localStorage, and session state. In Docker, mount a named volume at whatever path CRAWLY_PROFILE_DIR points to (default in the image: /data/profiles):

docker run -v crawly-profiles:/data/profiles crawly-mcp

Fingerprint canary

scripts/fingerprint_check.py runs a set of JS assertions against a blank page to verify the browser's JS-visible fingerprint looks like real Chrome:

uv run python scripts/fingerprint_check.py --verbose

Exits non-zero if any check fails. CI runs this on release tags.

Search tracing

Tracing is disabled by default. Set CRAWLY_TRACE_DIR only when you want to compare an automated run with manually collected artifacts:

CRAWLY_TRACE_DIR=./dump/trace uv run crawly-mcp --transport streamable-http

Each traced search() call writes one directory containing:

meta.json with provider, query, warm-up/jitter data, final URL/title, and parsed result URLs
fingerprint.json with JS-visible browser properties
network.jsonl with request/response/failure events
page.html and screenshot.png from the terminal page state

Design Notes

One shared incognito browser per process for fetch() (fresh context per request). search() uses per-provider persistent contexts with on-disk profiles keyed by provider.
PLAYWRIGHT_BROWSER_SOURCE=system uses a host Chromium binary (driven by patchright).
PLAYWRIGHT_BROWSER_SOURCE=bundled uses patchright-managed Chromium (patchright install chromium).
Global navigation concurrency cap of 3.
Timeouts: 15s per page, 20s total for search, 35s total for fetch.
SSRF guard: http/https only, no embedded credentials, blocks loopback/private/link-local/reserved IPs before navigation and on browser subrequests.
JavaScript challenge pages get a bounded 10s settle window. patchright provides fingerprint patches against common bot-detection checks; provider-specific warm-up hops and synthetic client-hint headers keep the browser identity stable across requests. No CAPTCHA solving or site-specific bypass logic.
fetch() returns raw HTML by default, or extracted readable text when the request sets content_format="text".
Returned fetch content is capped at 1 MiB per URL by default; set CRAWLY_FETCH_MAX_SIZE lower when you need smaller MCP payloads. This applies to both content_format="html" and content_format="text". Oversized responses are truncated and reported in truncated.
robots.txt is not consulted in v1.

Development

source .venv/bin/activate
ruff check .
pytest

Smoke checks:

rg -n "web-search|web_search_mcp" README.md AGENTS.md CHANGELOG.md pyproject.toml src tests
.venv/bin/python scripts/http_mcp_smoke.py --url http://127.0.0.1:8000/mcp

Parser tests run against saved HTML fixtures; selector drift is an expected maintenance cost.

from github.com/dshein-alt/crawly-mcp

Как установить

Добавь это в claude_desktop_config.json и перезапусти Claude Desktop.

{
  "mcpServers": {
    "web-search-mcp": {
      "command": "npx",
      "args": []
    }
  }
}

Web Search

Бесплатно

Provides web search and page fetch capabilities using a browser-based approach, enabling LLMs to search DuckDuckGo, Google, or Yandex and retrieve rendered HTML

автор: dshein-alt

GitHub

Описание

Provides web search and page fetch capabilities using a browser-based approach, enabling LLMs to search DuckDuckGo, Google, or Yandex and retrieve rendered HTML from URLs.

README

Browser-backed web search and page fetch for local LLMs, exposed as MCP tools and a CLI.

The design history is tracked in docs/IMPLEMENTATION_PLAN.md.

License

This project is licensed under the MIT License.

Naming

Python distribution: crawly-mcp
Import package: crawly_mcp
CLI executable: crawly-cli
MCP server executable: crawly-mcp

Tools

search(provider, context) runs a browser-backed search on duckduckgo (default), google, or yandex and returns up to 5 organic result URLs.
fetch(urls, content_format) fetches 1..5 URLs and returns browser-rendered page content with per-URL pages, errors, and truncated fields. Use content_format="html" for raw HTML or content_format="text" for extracted readable text.
page_search(url, query) searches for content on a single page. Tries known site-search facilities first (Algolia DocSearch, OpenSearch descriptor, Readthedocs API), then generic GET form detection, then find-in-page text as a fallback. Returns a mode discriminator plus up to 5 results with snippets and optional result URLs.

context is intentionally the search query string for caller compatibility.

Setup

uv sync
chromium --version

For host usage, crawly defaults to launching a system Chromium binary. If Chromium is installed in a non-standard location, set:

PLAYWRIGHT_CHROMIUM_EXECUTABLE=/path/to/chromium

To force Playwright-managed Chromium instead of a host browser:

PLAYWRIGHT_BROWSER_SOURCE=bundled

Usage

Run the CLI directly:

uv run crawly-cli search --context "python async playwright"
uv run crawly-cli fetch https://example.com
uv run crawly-cli page-search --url https://docs.example.com/guide --query "authentication"

The page-search subcommand prints a JSON PageSearchResponse with mode, attempted, results_url, and results[].

Run the MCP server over stdio:

uv run crawly-mcp

Expose HTTP transport instead of stdio:

uv run crawly-mcp --transport streamable-http --host 127.0.0.1 --port 8000

The MCP server also reads:

CRAWLY_HOST
CRAWLY_PORT

Container

The container image uses Patchright-managed Chromium on a plain Python Debian base and defaults to HTTP MCP on port 8000.

Build locally:

docker build -t crawly-mcp:local .

Run locally:

docker run --rm --init -p 8000:8000 crawly-mcp:local

Override the transport to stdio:

docker run --rm --init -i crawly-mcp:local crawly-mcp --transport stdio

Launch the stdio MCP server from the current checkout with an auto-build step:

./scripts/run_crawly_mcp_stdio_container.sh

Launch the HTTP MCP server from the current checkout:

./scripts/run_crawly_mcp_http_container.sh

The container defaults to:

PLAYWRIGHT_BROWSER_SOURCE=bundled
CRAWLY_HOST=0.0.0.0
CRAWLY_PORT=8000
CRAWLY_FETCH_MAX_SIZE=1048576
CRAWLY_PROFILE_DIR=/data/profiles
CRAWLY_PROFILE_CLEANUP_ON_START=true

For local LLMs with smaller context windows, call fetch(..., content_format="text") and lower the payload cap:

CRAWLY_FETCH_MAX_SIZE=16384 ./scripts/run_crawly_mcp_stdio_container.sh

The HTTP MCP endpoint is unauthenticated in v1. Deploy it behind localhost, a private network, or an auth/TLS reverse proxy.

Published images are intended to be:

ghcr.io/<owner>/crawly-mcp
<dockerhub-namespace>/crawly-mcp

The first GHCR publish may need a one-time manual visibility change to make the package public.

Integration Setup

Docker Run

Run the published GHCR image directly:

docker run --rm --init \
  -p 8000:8000 \
  -e CRAWLY_HOST=0.0.0.0 \
  -e CRAWLY_PORT=8000 \
  -e CRAWLY_FETCH_MAX_SIZE=16384 \
  -e CRAWLY_BROWSER_LANG=en-US \
  -e CRAWLY_BROWSER_LOCATION=America/New_York \
  ghcr.io/dshein-alt/crawly-mcp:latest

The most important runtime overrides are:

CRAWLY_FETCH_MAX_SIZE: caps returned fetch payload size for both content_format="html" and content_format="text".
CRAWLY_BROWSER_LANG: sets browser locale and primary Accept-Language.
CRAWLY_BROWSER_LOCATION: sets browser timezone / location persona.

MCP Client Config

For MCP clients that can launch a local command, point them at the project script so the server comes from the current checkout:

mcpServers:
  - name: Crawly MCP
    command: /path/to/crawly/scripts/run_crawly_mcp_stdio_container.sh
    args: []
    env:
      CRAWLY_CONTAINER_ENGINE: docker

For clients that support HTTP MCP, start a local or published crawly-mcp HTTP server first, then point the client at the running instance:

http://127.0.0.1:8000/mcp

For Continue, an HTTP MCP config looks like:

name: New MCP server
version: 0.0.1
schema: v1
mcpServers:
  - name: Crawly
    type: streamable-http
    url: http://127.0.0.1:8000/mcp

The url must match an actually running crawly-mcp HTTP instance.

If your client's MCP config accepts direct URLs, the entry is typically shaped like:

mcpServers:
  - name: Crawly MCP
    url: http://127.0.0.1:8000/mcp

Set CRAWLY_HTTP_BIND_HOST or CRAWLY_HTTP_BIND_PORT before launching if you need the listener on a different interface or port.

Bundled Skill / Prompt

This repo includes two reusable instruction files for small-context web workflows:

skills/web-search/SKILL.md — Codex skill guidance under the web-search skill name
skills/continue-web-search.md — Continue-native invokable prompt named Web Search

Use them when a small local model must search, fetch, and synthesize across multiple pages without overflowing context.

Browser configuration

Env var	Default	Purpose
`CRAWLY_BROWSER_LANG`	`ru-RU`	Browser `locale` and primary `Accept-Language` value passed to Playwright.
`CRAWLY_BROWSER_LOCATION`	`Europe/Moscow`	Browser timezone id. `TZ` is used only as a fallback when this env var is unset.
`CRAWLY_BROWSER_VIEWPORT`	`1366x768`	Browser viewport in `WIDTHxHEIGHT` form. Invalid values fall back to the default.
`CRAWLY_FETCH_MAX_SIZE`	`1048576`	Max bytes returned per fetched URL after rendering the configured content format. This limit applies to both raw HTML and extracted text. Lower it for local LLMs with small context windows.
`CRAWLY_USE_PERSISTENT_PROFILES`	`true`	Toggle per-provider persistent search profiles. Set to `false` to make `search()` use a fresh incognito context per request (warm-up still runs). Useful for A/B-testing the persistence feature or for stateless deployments.
`CRAWLY_PROFILE_DIR`	`~/.cache/crawly/profiles`	Parent directory for per-provider persistent profiles. Must be a writable mount in containers. Ignored when `CRAWLY_USE_PERSISTENT_PROFILES=false`.
`CRAWLY_PROFILE_CLEANUP_ON_START`	`false`	Enable age-based profile cleanup at startup. Set to `true` in the Dockerfile entrypoint. Unsafe when multiple processes share the profile dir.
`CRAWLY_PROFILE_MAX_AGE_DAYS`	`14`	Age threshold for profile cleanup.
`CRAWLY_SEARCH_JITTER_MS`	`500,1500`	Min/max ms delay between warm-up and real query. Two-int CSV.
`CRAWLY_TRACE_DIR`	unset	Opt-in per-search artifact dump directory. When set, each `search()` writes `meta.json`, `fingerprint.json`, `network.jsonl`, `page.html`, and `screenshot.png`.

Profile persistence

docker run -v crawly-profiles:/data/profiles crawly-mcp

Fingerprint canary

scripts/fingerprint_check.py runs a set of JS assertions against a blank page to verify the browser's JS-visible fingerprint looks like real Chrome:

uv run python scripts/fingerprint_check.py --verbose

Exits non-zero if any check fails. CI runs this on release tags.

Search tracing

Tracing is disabled by default. Set CRAWLY_TRACE_DIR only when you want to compare an automated run with manually collected artifacts:

CRAWLY_TRACE_DIR=./dump/trace uv run crawly-mcp --transport streamable-http

Each traced search() call writes one directory containing:

meta.json with provider, query, warm-up/jitter data, final URL/title, and parsed result URLs
fingerprint.json with JS-visible browser properties
network.jsonl with request/response/failure events
page.html and screenshot.png from the terminal page state

Design Notes

One shared incognito browser per process for fetch() (fresh context per request). search() uses per-provider persistent contexts with on-disk profiles keyed by provider.
PLAYWRIGHT_BROWSER_SOURCE=system uses a host Chromium binary (driven by patchright).
PLAYWRIGHT_BROWSER_SOURCE=bundled uses patchright-managed Chromium (patchright install chromium).
Global navigation concurrency cap of 3.
Timeouts: 15s per page, 20s total for search, 35s total for fetch.
SSRF guard: http/https only, no embedded credentials, blocks loopback/private/link-local/reserved IPs before navigation and on browser subrequests.
JavaScript challenge pages get a bounded 10s settle window. patchright provides fingerprint patches against common bot-detection checks; provider-specific warm-up hops and synthetic client-hint headers keep the browser identity stable across requests. No CAPTCHA solving or site-specific bypass logic.
fetch() returns raw HTML by default, or extracted readable text when the request sets content_format="text".
Returned fetch content is capped at 1 MiB per URL by default; set CRAWLY_FETCH_MAX_SIZE lower when you need smaller MCP payloads. This applies to both content_format="html" and content_format="text". Oversized responses are truncated and reported in truncated.
robots.txt is not consulted in v1.

Development

source .venv/bin/activate
ruff check .
pytest

Smoke checks:

rg -n "web-search|web_search_mcp" README.md AGENTS.md CHANGELOG.md pyproject.toml src tests
.venv/bin/python scripts/http_mcp_smoke.py --url http://127.0.0.1:8000/mcp

Parser tests run against saved HTML fixtures; selector drift is an expected maintenance cost.

from github.com/dshein-alt/crawly-mcp

Как установить

Добавь это в claude_desktop_config.json и перезапусти Claude Desktop.

{
  "mcpServers": {
    "web-search-mcp": {
      "command": "npx",
      "args": []
    }
  }
}

Command Palette

Web Search

Описание

README

License

Naming

Tools

Setup

Usage

Container

Integration Setup

Docker Run

MCP Client Config

Bundled Skill / Prompt

Browser configuration

Profile persistence

Fingerprint canary

Search tracing

Design Notes

Development

Как установить

Похожие MCP

Playwright

Puppeteer

opentabs-dev/opentabs

robhunter/agentdeals

Web Search

Описание

README

License

Naming

Tools

Setup

Usage

Container

Integration Setup

Docker Run

MCP Client Config

Bundled Skill / Prompt

Browser configuration

Profile persistence

Fingerprint canary

Search tracing

Design Notes

Development

Как установить

Похожие MCP

Playwright

Puppeteer

opentabs-dev/opentabs

robhunter/agentdeals