web-search-mcp
Provides web search capabilities using DuckDuckGo as the search provider.
Provides web search capabilities using Google as the search provider.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@web-search-mcpsearch duckduckgo for latest AI news"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Crawly-MCP
Browser-backed web search and page fetch for local LLMs, exposed as MCP tools and a CLI.
The design history is tracked in docs/IMPLEMENTATION_PLAN.md.
License
This project is licensed under the MIT License.
Naming
Python distribution:
crawly-mcpImport package:
crawly_mcpCLI executable:
crawly-cliMCP server executable:
crawly-mcp
Tools
search(provider, context)runs a browser-backed search onduckduckgo(default),google, oryandexand returns up to 5 organic result URLs. The opt-insearxngprovider routes the query through a JSON-API call to a SearXNG instance you supply viaCRAWLY_SEARXNG_URL; see below.fetch(urls, content_format)fetches1..5URLs and returns browser-rendered page content with per-URLpages,errors, andtruncatedfields. Usecontent_format="html"for raw HTML orcontent_format="text"for extracted readable text.page_search(url, query)searches for content on a single page. Tries known site-search facilities first (Algolia DocSearch, OpenSearch descriptor, Readthedocs API), then generic GET form detection, then find-in-page text as a fallback. Returns amodediscriminator plus up to 5 results with snippets and optional result URLs.
context is intentionally the search query string for caller compatibility.
Setup
uv sync
chromium --versionFor host usage, crawly defaults to launching a system Chromium binary. If Chromium is installed in a non-standard location, set:
PLAYWRIGHT_CHROMIUM_EXECUTABLE=/path/to/chromiumTo force Playwright-managed Chromium instead of a host browser:
PLAYWRIGHT_BROWSER_SOURCE=bundledSearXNG (opt-in)
The searxng provider is opt-in. To use it, both conditions must hold:
Pass
provider="searxng"to thesearchMCP tool (or--provider searxngon the CLI). It is not the default.Set
CRAWLY_SEARXNG_URLto the URL of a SearXNG instance whose JSON output is enabled (search.formats: [..., json]in itssettings.yml).
Without CRAWLY_SEARXNG_URL, the provider returns an invalid_input error.
CRAWLY_SEARXNG_URL=http://127.0.0.1:8080/ \
uv run crawly-cli search --provider searxng --context "..."There is no instance registry and no automatic cross-provider fallback — failures from your SearXNG instance surface directly to the caller. This is deliberate: pinning is an explicit choice and crawly respects it.
Run alongside a local SearXNG (compose example)
A self-contained recipe is in examples/searxng-compose/: a docker-compose.yml that runs both searxng and crawly-mcp on a shared network, plus a settings.yml with search.formats: [html, json] and server.limiter: false (the two things crawly's searxng provider needs from any instance). The crawly image is pulled from GHCR by default; ports are controlled via a .env file.
cd examples/searxng-compose
cp .env.example .env # edit if you want a non-default host port or local image
docker compose up -d
# MCP HTTP at http://127.0.0.1:10000/mcp/ (CRAWLY_HOST_PORT)SearXNG itself is reachable only on the compose network (http://searxng:8080/ from crawly); no host port is mapped by default.
See examples/searxng-compose/README.md for the env-var reference and a smoke-test snippet.
Why public instances generally won't work
Public SearXNG instances listed on https://searx.space typically respond to programmatic clients with 429, an HTTP redirect to /, or a 200 with an empty results page — driven by SearXNG's built-in botdetection middleware. The community of instance operators actively discourages automated scraping (it burns down their upstream Google/Bing rate-limit budget). This is why the provider expects a SearXNG you control: self-hosted, on your LAN, or otherwise configured to permit JSON access for trusted clients.
Usage
Run the CLI directly:
uv run crawly-cli search --context "python async playwright"
uv run crawly-cli fetch https://example.com
uv run crawly-cli page-search --url https://docs.example.com/guide --query "authentication"The page-search subcommand prints a JSON PageSearchResponse with mode, attempted, results_url, and results[].
Run the MCP server over stdio:
uv run crawly-mcpExpose HTTP transport instead of stdio:
uv run crawly-mcp --transport streamable-http --host 127.0.0.1 --port 8000The MCP server also reads:
CRAWLY_HOSTCRAWLY_PORT
Container
The container image uses Patchright-managed Chromium on a plain Python Debian base and defaults to HTTP MCP on port 8000.
Build locally:
docker build -t crawly-mcp:local .Run locally:
docker run --rm --init -p 8000:8000 crawly-mcp:localOverride the transport to stdio:
docker run --rm --init -i crawly-mcp:local crawly-mcp --transport stdioLaunch the stdio MCP server from the current checkout with an auto-build step:
./scripts/run_crawly_mcp_stdio_container.shLaunch the HTTP MCP server from the current checkout:
./scripts/run_crawly_mcp_http_container.shThe container defaults to:
PLAYWRIGHT_BROWSER_SOURCE=bundledCRAWLY_HOST=0.0.0.0CRAWLY_PORT=8000CRAWLY_FETCH_MAX_SIZE=1048576CRAWLY_PROFILE_DIR=/data/profilesCRAWLY_PROFILE_CLEANUP_ON_START=true
For local LLMs with smaller context windows, call fetch(..., content_format="text") and lower the payload cap:
CRAWLY_FETCH_MAX_SIZE=16384 ./scripts/run_crawly_mcp_stdio_container.shThe HTTP MCP endpoint is unauthenticated in v1. Deploy it behind localhost, a private network, or an auth/TLS reverse proxy.
Published images are intended to be:
ghcr.io/<owner>/crawly-mcp<dockerhub-namespace>/crawly-mcp
The first GHCR publish may need a one-time manual visibility change to make the package public.
Integration Setup
Docker Run
Run the published GHCR image directly:
docker run --rm --init \
-p 8000:8000 \
-e CRAWLY_HOST=0.0.0.0 \
-e CRAWLY_PORT=8000 \
-e CRAWLY_FETCH_MAX_SIZE=16384 \
-e CRAWLY_BROWSER_LANG=en-US \
-e CRAWLY_BROWSER_LOCATION=America/New_York \
ghcr.io/dshein-alt/crawly-mcp:latestThe most important runtime overrides are:
CRAWLY_FETCH_MAX_SIZE: caps returned fetch payload size for bothcontent_format="html"andcontent_format="text".CRAWLY_BROWSER_LANG: sets browser locale and primaryAccept-Language.CRAWLY_BROWSER_LOCATION: sets browser timezone / location persona.
MCP Client Config
For MCP clients that can launch a local command, point them at the project script so the server comes from the current checkout:
mcpServers:
- name: Crawly MCP
command: /path/to/crawly/scripts/run_crawly_mcp_stdio_container.sh
args: []
env:
CRAWLY_CONTAINER_ENGINE: dockerReplace /path/to/crawly with your checkout path. The launcher rebuilds
crawly-mcp:local before starting the stdio server so container contents stay aligned
with local source changes. Set CRAWLY_MCP_SKIP_BUILD=1 if you want to skip that build
when the local image is already current.
For clients that support HTTP MCP, start a local or published crawly-mcp HTTP server first,
then point the client at the running instance:
http://127.0.0.1:8000/mcpFor Continue, an HTTP MCP config looks like:
name: New MCP server
version: 0.0.1
schema: v1
mcpServers:
- name: Crawly
type: streamable-http
url: http://127.0.0.1:8000/mcpThe url must match an actually running crawly-mcp HTTP instance.
If your client's MCP config accepts direct URLs, the entry is typically shaped like:
mcpServers:
- name: Crawly MCP
url: http://127.0.0.1:8000/mcpSet CRAWLY_HTTP_BIND_HOST or CRAWLY_HTTP_BIND_PORT before launching if you need the
listener on a different interface or port.
Bundled Skill / Prompt
This repo includes two reusable instruction files for small-context web workflows:
skills/web-search/SKILL.md — Codex skill guidance under the
web-searchskill nameskills/continue-web-search.md — Continue-native invokable prompt named
Web Search
Use them when a small local model must search, fetch, and synthesize across multiple pages without overflowing context.
Browser configuration
crawly uses patchright (a Playwright fork with bundled fingerprint patches) and keeps a small set of per-search-provider persistent profiles on disk. The following env vars tune the browser persona and search trace capture:
Env var | Default | Purpose |
|
| Browser |
|
| Browser timezone id. |
|
| Browser viewport in |
|
| Max bytes returned per fetched URL after rendering the configured content format. This limit applies to both raw HTML and extracted text. Lower it for local LLMs with small context windows. |
|
| Toggle per-provider persistent search profiles. Set to |
|
| Parent directory for per-provider persistent profiles. Must be a writable mount in containers. Ignored when |
|
| Enable age-based profile cleanup at startup. Set to |
|
| Age threshold for profile cleanup. |
|
| Min/max ms delay between warm-up and real query. Two-int CSV. |
| unset | Opt-in per-search artifact dump directory. When set, each |
Profile persistence
Each provider (duckduckgo, google, yandex) keeps its own subdirectory under CRAWLY_PROFILE_DIR with cookies, localStorage, and session state. In Docker, mount a named volume at whatever path CRAWLY_PROFILE_DIR points to (default in the image: /data/profiles):
docker run -v crawly-profiles:/data/profiles crawly-mcpFingerprint canary
scripts/fingerprint_check.py runs a set of JS assertions against a blank page to verify the browser's JS-visible fingerprint looks like real Chrome:
uv run python scripts/fingerprint_check.py --verboseExits non-zero if any check fails. CI runs this on release tags.
Search tracing
Tracing is disabled by default. Set CRAWLY_TRACE_DIR only when you want to compare an automated run with manually collected artifacts:
CRAWLY_TRACE_DIR=./dump/trace uv run crawly-mcp --transport streamable-httpEach traced search() call writes one directory containing:
meta.jsonwith provider, query, warm-up/jitter data, final URL/title, and parsed result URLsfingerprint.jsonwith JS-visible browser propertiesnetwork.jsonlwith request/response/failure eventspage.htmlandscreenshot.pngfrom the terminal page state
Design Notes
One shared incognito browser per process for
fetch()(fresh context per request).search()uses per-provider persistent contexts with on-disk profiles keyed by provider.PLAYWRIGHT_BROWSER_SOURCE=systemuses a host Chromium binary (driven by patchright).PLAYWRIGHT_BROWSER_SOURCE=bundleduses patchright-managed Chromium (patchright install chromium).Global navigation concurrency cap of
3.Timeouts:
15sper page,20stotal forsearch,35stotal forfetch.SSRF guard:
http/httpsonly, no embedded credentials, blocks loopback/private/link-local/reserved IPs before navigation and on browser subrequests.JavaScript challenge pages get a bounded
10ssettle window.patchrightprovides fingerprint patches against common bot-detection checks; provider-specific warm-up hops and synthetic client-hint headers keep the browser identity stable across requests. No CAPTCHA solving or site-specific bypass logic.fetch()returns raw HTML by default, or extracted readable text when the request setscontent_format="text".Returned fetch content is capped at
1 MiBper URL by default; setCRAWLY_FETCH_MAX_SIZElower when you need smaller MCP payloads. This applies to bothcontent_format="html"andcontent_format="text". Oversized responses are truncated and reported intruncated.robots.txtis not consulted in v1.
Development
source .venv/bin/activate
ruff check .
pytestSmoke checks:
rg -n "web-search|web_search_mcp" README.md AGENTS.md CHANGELOG.md pyproject.toml src tests
.venv/bin/python scripts/http_mcp_smoke.py --url http://127.0.0.1:8000/mcpParser tests run against saved HTML fixtures; selector drift is an expected maintenance cost.
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Appeared in Searches
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/dshein-alt/crawly-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server