Skip to main content
Glama

Crawly-MCP

Browser-backed web search and page fetch for local LLMs, exposed as MCP tools and a CLI.

The design history is tracked in docs/IMPLEMENTATION_PLAN.md.

License

This project is licensed under the MIT License.

Naming

  • Python distribution: crawly-mcp

  • Import package: crawly_mcp

  • CLI executable: crawly-cli

  • MCP server executable: crawly-mcp

Tools

  • search(provider, context) runs a browser-backed search on duckduckgo (default), google, or yandex and returns up to 5 organic result URLs. The opt-in searxng provider routes the query through a JSON-API call to a SearXNG instance you supply via CRAWLY_SEARXNG_URL; see below.

  • fetch(urls, content_format) fetches 1..5 URLs and returns browser-rendered page content with per-URL pages, errors, and truncated fields. Use content_format="html" for raw HTML or content_format="text" for extracted readable text.

  • page_search(url, query) searches for content on a single page. Tries known site-search facilities first (Algolia DocSearch, OpenSearch descriptor, Readthedocs API), then generic GET form detection, then find-in-page text as a fallback. Returns a mode discriminator plus up to 5 results with snippets and optional result URLs.

context is intentionally the search query string for caller compatibility.

Setup

uv sync
chromium --version

For host usage, crawly defaults to launching a system Chromium binary. If Chromium is installed in a non-standard location, set:

PLAYWRIGHT_CHROMIUM_EXECUTABLE=/path/to/chromium

To force Playwright-managed Chromium instead of a host browser:

PLAYWRIGHT_BROWSER_SOURCE=bundled

SearXNG (opt-in)

The searxng provider is opt-in. To use it, both conditions must hold:

  1. Pass provider="searxng" to the search MCP tool (or --provider searxng on the CLI). It is not the default.

  2. Set CRAWLY_SEARXNG_URL to the URL of a SearXNG instance whose JSON output is enabled (search.formats: [..., json] in its settings.yml).

Without CRAWLY_SEARXNG_URL, the provider returns an invalid_input error.

CRAWLY_SEARXNG_URL=http://127.0.0.1:8080/ \
  uv run crawly-cli search --provider searxng --context "..."

There is no instance registry and no automatic cross-provider fallback — failures from your SearXNG instance surface directly to the caller. This is deliberate: pinning is an explicit choice and crawly respects it.

Run alongside a local SearXNG (compose example)

A self-contained recipe is in examples/searxng-compose/: a docker-compose.yml that runs both searxng and crawly-mcp on a shared network, plus a settings.yml with search.formats: [html, json] and server.limiter: false (the two things crawly's searxng provider needs from any instance). The crawly image is pulled from GHCR by default; ports are controlled via a .env file.

cd examples/searxng-compose
cp .env.example .env        # edit if you want a non-default host port or local image
docker compose up -d
# MCP HTTP at http://127.0.0.1:10000/mcp/   (CRAWLY_HOST_PORT)

SearXNG itself is reachable only on the compose network (http://searxng:8080/ from crawly); no host port is mapped by default.

See examples/searxng-compose/README.md for the env-var reference and a smoke-test snippet.

Why public instances generally won't work

Public SearXNG instances listed on https://searx.space typically respond to programmatic clients with 429, an HTTP redirect to /, or a 200 with an empty results page — driven by SearXNG's built-in botdetection middleware. The community of instance operators actively discourages automated scraping (it burns down their upstream Google/Bing rate-limit budget). This is why the provider expects a SearXNG you control: self-hosted, on your LAN, or otherwise configured to permit JSON access for trusted clients.

Usage

Run the CLI directly:

uv run crawly-cli search --context "python async playwright"
uv run crawly-cli fetch https://example.com
uv run crawly-cli page-search --url https://docs.example.com/guide --query "authentication"

The page-search subcommand prints a JSON PageSearchResponse with mode, attempted, results_url, and results[].

Run the MCP server over stdio:

uv run crawly-mcp

Expose HTTP transport instead of stdio:

uv run crawly-mcp --transport streamable-http --host 127.0.0.1 --port 8000

The MCP server also reads:

  • CRAWLY_HOST

  • CRAWLY_PORT

Container

The container image uses Patchright-managed Chromium on a plain Python Debian base and defaults to HTTP MCP on port 8000.

Build locally:

docker build -t crawly-mcp:local .

Run locally:

docker run --rm --init -p 8000:8000 crawly-mcp:local

Override the transport to stdio:

docker run --rm --init -i crawly-mcp:local crawly-mcp --transport stdio

Launch the stdio MCP server from the current checkout with an auto-build step:

./scripts/run_crawly_mcp_stdio_container.sh

Launch the HTTP MCP server from the current checkout:

./scripts/run_crawly_mcp_http_container.sh

The container defaults to:

  • PLAYWRIGHT_BROWSER_SOURCE=bundled

  • CRAWLY_HOST=0.0.0.0

  • CRAWLY_PORT=8000

  • CRAWLY_FETCH_MAX_SIZE=1048576

  • CRAWLY_PROFILE_DIR=/data/profiles

  • CRAWLY_PROFILE_CLEANUP_ON_START=true

For local LLMs with smaller context windows, call fetch(..., content_format="text") and lower the payload cap:

CRAWLY_FETCH_MAX_SIZE=16384 ./scripts/run_crawly_mcp_stdio_container.sh

The HTTP MCP endpoint is unauthenticated in v1. Deploy it behind localhost, a private network, or an auth/TLS reverse proxy.

Published images are intended to be:

  • ghcr.io/<owner>/crawly-mcp

  • <dockerhub-namespace>/crawly-mcp

The first GHCR publish may need a one-time manual visibility change to make the package public.

Integration Setup

Docker Run

Run the published GHCR image directly:

docker run --rm --init \
  -p 8000:8000 \
  -e CRAWLY_HOST=0.0.0.0 \
  -e CRAWLY_PORT=8000 \
  -e CRAWLY_FETCH_MAX_SIZE=16384 \
  -e CRAWLY_BROWSER_LANG=en-US \
  -e CRAWLY_BROWSER_LOCATION=America/New_York \
  ghcr.io/dshein-alt/crawly-mcp:latest

The most important runtime overrides are:

  • CRAWLY_FETCH_MAX_SIZE: caps returned fetch payload size for both content_format="html" and content_format="text".

  • CRAWLY_BROWSER_LANG: sets browser locale and primary Accept-Language.

  • CRAWLY_BROWSER_LOCATION: sets browser timezone / location persona.

MCP Client Config

For MCP clients that can launch a local command, point them at the project script so the server comes from the current checkout:

mcpServers:
  - name: Crawly MCP
    command: /path/to/crawly/scripts/run_crawly_mcp_stdio_container.sh
    args: []
    env:
      CRAWLY_CONTAINER_ENGINE: docker

Replace /path/to/crawly with your checkout path. The launcher rebuilds crawly-mcp:local before starting the stdio server so container contents stay aligned with local source changes. Set CRAWLY_MCP_SKIP_BUILD=1 if you want to skip that build when the local image is already current.

For clients that support HTTP MCP, start a local or published crawly-mcp HTTP server first, then point the client at the running instance:

http://127.0.0.1:8000/mcp

For Continue, an HTTP MCP config looks like:

name: New MCP server
version: 0.0.1
schema: v1
mcpServers:
  - name: Crawly
    type: streamable-http
    url: http://127.0.0.1:8000/mcp

The url must match an actually running crawly-mcp HTTP instance.

If your client's MCP config accepts direct URLs, the entry is typically shaped like:

mcpServers:
  - name: Crawly MCP
    url: http://127.0.0.1:8000/mcp

Set CRAWLY_HTTP_BIND_HOST or CRAWLY_HTTP_BIND_PORT before launching if you need the listener on a different interface or port.

Bundled Skill / Prompt

This repo includes two reusable instruction files for small-context web workflows:

Use them when a small local model must search, fetch, and synthesize across multiple pages without overflowing context.

Browser configuration

crawly uses patchright (a Playwright fork with bundled fingerprint patches) and keeps a small set of per-search-provider persistent profiles on disk. The following env vars tune the browser persona and search trace capture:

Env var

Default

Purpose

CRAWLY_BROWSER_LANG

ru-RU

Browser locale and primary Accept-Language value passed to Playwright.

CRAWLY_BROWSER_LOCATION

Europe/Moscow

Browser timezone id. TZ is used only as a fallback when this env var is unset.

CRAWLY_BROWSER_VIEWPORT

1366x768

Browser viewport in WIDTHxHEIGHT form. Invalid values fall back to the default.

CRAWLY_FETCH_MAX_SIZE

1048576

Max bytes returned per fetched URL after rendering the configured content format. This limit applies to both raw HTML and extracted text. Lower it for local LLMs with small context windows.

CRAWLY_USE_PERSISTENT_PROFILES

true

Toggle per-provider persistent search profiles. Set to false to make search() use a fresh incognito context per request (warm-up still runs). Useful for A/B-testing the persistence feature or for stateless deployments.

CRAWLY_PROFILE_DIR

~/.cache/crawly/profiles

Parent directory for per-provider persistent profiles. Must be a writable mount in containers. Ignored when CRAWLY_USE_PERSISTENT_PROFILES=false.

CRAWLY_PROFILE_CLEANUP_ON_START

false

Enable age-based profile cleanup at startup. Set to true in the Dockerfile entrypoint. Unsafe when multiple processes share the profile dir.

CRAWLY_PROFILE_MAX_AGE_DAYS

14

Age threshold for profile cleanup.

CRAWLY_SEARCH_JITTER_MS

500,1500

Min/max ms delay between warm-up and real query. Two-int CSV.

CRAWLY_TRACE_DIR

unset

Opt-in per-search artifact dump directory. When set, each search() writes meta.json, fingerprint.json, network.jsonl, page.html, and screenshot.png.

Profile persistence

Each provider (duckduckgo, google, yandex) keeps its own subdirectory under CRAWLY_PROFILE_DIR with cookies, localStorage, and session state. In Docker, mount a named volume at whatever path CRAWLY_PROFILE_DIR points to (default in the image: /data/profiles):

docker run -v crawly-profiles:/data/profiles crawly-mcp

Fingerprint canary

scripts/fingerprint_check.py runs a set of JS assertions against a blank page to verify the browser's JS-visible fingerprint looks like real Chrome:

uv run python scripts/fingerprint_check.py --verbose

Exits non-zero if any check fails. CI runs this on release tags.

Search tracing

Tracing is disabled by default. Set CRAWLY_TRACE_DIR only when you want to compare an automated run with manually collected artifacts:

CRAWLY_TRACE_DIR=./dump/trace uv run crawly-mcp --transport streamable-http

Each traced search() call writes one directory containing:

  • meta.json with provider, query, warm-up/jitter data, final URL/title, and parsed result URLs

  • fingerprint.json with JS-visible browser properties

  • network.jsonl with request/response/failure events

  • page.html and screenshot.png from the terminal page state

Design Notes

  • One shared incognito browser per process for fetch() (fresh context per request). search() uses per-provider persistent contexts with on-disk profiles keyed by provider.

  • PLAYWRIGHT_BROWSER_SOURCE=system uses a host Chromium binary (driven by patchright).

  • PLAYWRIGHT_BROWSER_SOURCE=bundled uses patchright-managed Chromium (patchright install chromium).

  • Global navigation concurrency cap of 3.

  • Timeouts: 15s per page, 20s total for search, 35s total for fetch.

  • SSRF guard: http/https only, no embedded credentials, blocks loopback/private/link-local/reserved IPs before navigation and on browser subrequests.

  • JavaScript challenge pages get a bounded 10s settle window. patchright provides fingerprint patches against common bot-detection checks; provider-specific warm-up hops and synthetic client-hint headers keep the browser identity stable across requests. No CAPTCHA solving or site-specific bypass logic.

  • fetch() returns raw HTML by default, or extracted readable text when the request sets content_format="text".

  • Returned fetch content is capped at 1 MiB per URL by default; set CRAWLY_FETCH_MAX_SIZE lower when you need smaller MCP payloads. This applies to both content_format="html" and content_format="text". Oversized responses are truncated and reported in truncated.

  • robots.txt is not consulted in v1.

Development

source .venv/bin/activate
ruff check .
pytest

Smoke checks:

rg -n "web-search|web_search_mcp" README.md AGENTS.md CHANGELOG.md pyproject.toml src tests
.venv/bin/python scripts/http_mcp_smoke.py --url http://127.0.0.1:8000/mcp

Parser tests run against saved HTML fixtures; selector drift is an expected maintenance cost.

A
license - permissive license
-
quality - not tested
C
maintenance

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/dshein-alt/crawly-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server