Skip to main content
Glama

SoloCrawl is a small, self-hosted, fully async Python tool that does three things well: it searches the web across many sources at once, scrapes pages into clean markdown that's ready for an LLM, and looks up the latest package version from official registries. It runs on your machine, for free, and works the moment you install it β€” nothing to sign up for, no keys to paste.

It's built for individual developers and people tinkering with local LLMs. Use it as an MCP server (LM Studio, Claude Desktop, OpenCode, …), straight from the CLI, or as a Python library.

✨ Highlights

πŸ”Œ Zero config

Search, scrape, and package lookup all work out of the box β€” no accounts, no keys.

πŸ”Ž Federated search

Queries 11 possible sources, merges them with Reciprocal Rank Fusion and de-duplicates URLs into one clean ranking.

πŸ“„ Smart scraping

HTML β†’ tidy markdown via trafilatura/readability, with a Playwright browser fallback for JS-heavy pages β€” used only when it's actually needed.

πŸ“¦ Live package versions

10 ecosystems (PyPI, npm, crates.io, Maven, Go, …) resolved live from official registries β€” never a stale local DB.

πŸ€– MCP-native

Drops straight into local LLM tooling as a stdio MCP server with five ready tools.

⚑ Fully async, bounded

One shared HTTP client, one recycled browser, global + per-domain concurrency limits. Fast without hammering anyone.

🧩 Hackable

Add a search or package provider as a single self-registering file β€” the core stays untouched.

πŸ”’ Safe by default

Blocks localhost/private/cloud-metadata targets and honours robots.txt.

Related MCP server: Local-MCP-server

πŸš€ Quick start

The recommended way to install SoloCrawl is pipx β€” it drops the solocrawl and solocrawl-mcp commands onto your PATH in their own isolated environment, so you can run them from anywhere without juggling a virtualenv:

git clone https://github.com/hlavacm/solocrawl.git
pipx install ./solocrawl     # or an absolute path: pipx install /path/to/solocrawl

Already on PyPI? Then it's just pipx install solocrawl β€” no checkout needed.

That's it β€” now run the three core commands from any directory:

# Scrape a page to markdown
solocrawl scrape https://example.com

# Federated web search (Wikipedia + DuckDuckGo + StackExchange by default)
solocrawl search "python asyncio semaphore" --limit 5

# Live package version lookup
solocrawl package requests --ecosystem pypi

Updating

# Installed from PyPI:
pipx upgrade solocrawl

# Installed from a local checkout β€” pull the latest changes, then reinstall:
cd /path/to/solocrawl && git pull && pipx install --force .   # alias: pipx reinstall solocrawl

After upgrading, restart your MCP client (LM Studio, Cursor, Claude Desktop, …) so it picks up the new solocrawl-mcp binary β€” your mcp.json needs no changes as long as it points at solocrawl-mcp on your PATH.

What you can do

πŸ”Ž Search the web

One query, many sources, a single merged ranking β€” no single provider deciding everything for you.

solocrawl search "python asyncio semaphore" --limit 5

# Pick exactly which sources to hit, and get machine-readable output
solocrawl search "django orm" --sources wikipedia,stackexchange --json

πŸ“„ Scrape a page to clean markdown

Turn any URL into LLM-ready markdown with page metadata (title, author, date, …) as front-matter.

solocrawl scrape https://example.com

# Save to a file, or force the browser for a JS-rendered page
solocrawl scrape https://example.com --out page.md
solocrawl scrape https://example.com --force-browser

πŸ“¦ Look up package versions

The current version β€” and the one matching your constraint β€” straight from the official registry.

solocrawl package react --ecosystem npm --constraint ">=18,<19"
solocrawl package monolog/monolog --ecosystem packagist --json
solocrawl package some-lib --ecosystem pypi --allow-prerelease

πŸ§ͺ Research in one shot

The classic LLM workflow β€” search, scrape the top hits, and get back one aggregated, cited report.

solocrawl research "python asyncio semaphore" --depth 3

πŸ—‚οΈ Batch-scrape many URLs

Fetch a whole list at once under the same bounded concurrency; --out-dir writes one file per URL.

solocrawl batch https://example.com https://www.python.org --out-dir /tmp/scrape
solocrawl batch --from-file urls.txt --out-dir /tmp/scrape

…and see what's available

# List every registered provider (search + package), default vs. opt-in
solocrawl providers

πŸ€– Use it with your local LLM (MCP)

This is where SoloCrawl really shines β€” give your local model (LM Studio, OpenCode, Claude Desktop, …) the ability to search, scrape, and check versions. The pipx install from the Quick start already put solocrawl-mcp on your PATH, so all that's left is pointing your MCP client at it.

LM Studio / Claude Desktop β€” ready-to-use config at examples/mcp.json. Drop it into your client's MCP settings (mcp.json):

{
  "mcpServers": {
    "solocrawl": {
      "command": "solocrawl-mcp",
      "args": [],
      "env": {
        "SOLOCRAWL_LOG_LEVEL": "INFO",
        "SOLOCRAWL_LOG_FILE": "~/.local/state/solocrawl/mcp.log"
      }
    }
  }
}

OpenCode β€” uses a different config format. Copy examples/opencode.jsonc into ~/.config/opencode/opencode.jsonc (global) or opencode.jsonc in your project root. OpenCode expects type: "local", command as an array, and environment instead of env.

If your MCP client doesn't inherit your shell PATH, replace "solocrawl-mcp" with the full path from which solocrawl-mcp (typically ~/.local/bin/solocrawl-mcp after pipx install). Logs go to stderr (visible in LM Studio Developer Logs) and optionally to SOLOCRAWL_LOG_FILE.

The server exposes five tools:

  • web_search(query, limit=5, sources=None) β€” federated search across enabled providers

  • scrape(url) β€” fetch and extract markdown (with page metadata) from a URL

  • research(query, depth=3) β€” search, scrape the top results, and return an aggregated cited report

  • package_version(name, ecosystem, constraint=None, allow_prerelease=False) β€” live registry lookup

  • list_providers(provider_type="all") β€” list registered search/package providers (default vs. opt-in)

To check the active version and command path:

pipx list | grep solocrawl
which solocrawl-mcp

Working from a local clone? A pipx-installed solocrawl-mcp is a snapshot β€” editing the repo does not update the command on your PATH, so your MCP client keeps running the old code. After changing the source, refresh it with pipx install --force . (or install once with pipx install --editable . so future edits are picked up automatically).

🐍 Use it from Python

import asyncio

from solocrawl.config import load_config
from solocrawl.core.search import federated_search, select_providers
from solocrawl.core.search.providers import duckduckgo, stackexchange, wikipedia  # noqa: F401

async def main() -> None:
    providers = select_providers(load_config())
    results = await federated_search(providers, "asyncio python", limit=3)
    for result in results:
        print(result.title, result.url)

asyncio.run(main())

See examples/library_search.py for a runnable example.

Search providers

Default (zero-config, always enabled):

Provider

Source

wikipedia

MediaWiki API

duckduckgo

ddgs package

stackexchange

Stack Exchange API (Stack Overflow)

Opt-in (enable with SOLOCRAWL_ENABLE_PROVIDERS):

Provider

Source

wikidata

Wikidata entity search

hackernews

Hacker News (Algolia)

arxiv

arXiv Atom API

pubmed

PubMed/NCBI E-utilities

github

GitHub repository search

mdn

MDN Web Docs search

reddit

Reddit post search (search.json)

searxng

Self-hosted SearXNG (set SOLOCRAWL_SEARXNG_URL)

SOLOCRAWL_ENABLE_PROVIDERS=arxiv,hackernews solocrawl search "transformer attention" --limit 6
SOLOCRAWL_ENABLE_PROVIDERS=github,mdn solocrawl search "fetch api" --limit 6

Package registries

Default ecosystems: PyPI, npm, Packagist, crates.io, NuGet, Maven Central, RubyGems, Go modules, pub.dev, Swift. Versions are always fetched live from official registries β€” SoloCrawl does not maintain its own version database. Swift packages have no central registry, so versions come from the repository's git tags (owner/repo on GitHub).

solocrawl package serde --ecosystem crates
solocrawl package Newtonsoft.Json --ecosystem nuget
solocrawl package org.junit.jupiter:junit-jupiter --ecosystem maven
solocrawl package github.com/gorilla/mux --ecosystem go
solocrawl package apple/swift-argument-parser --ecosystem swift

Optional extras

# Browser fallback for JS-heavy pages (Playwright)
pip install -e ".[browser]"
playwright install chromium

solocrawl scrape https://example.com --force-browser

# Install everything
pip install -e ".[all]"

βš™οΈ Configuration

All defaults work with no configuration. Everything below is optional and uses the SOLOCRAWL_ prefix. For local development, copy .env.dist to .env and uncomment what you need β€” SoloCrawl loads .env automatically via python-dotenv, and existing shell environment variables take precedence.

Variable

Default

Purpose

SOLOCRAWL_ENABLE_PROVIDERS

(empty)

Comma-separated opt-in provider names

SOLOCRAWL_SEARXNG_URL

(empty)

Base URL of a self-hosted SearXNG instance (enables the searxng provider)

SOLOCRAWL_RESPECT_ROBOTS

true

Honour robots.txt on scrape (fail-open); set false to skip

SOLOCRAWL_CACHE_TTL_SECONDS

0

In-memory fetch cache TTL in seconds (0 = disabled)

SOLOCRAWL_MAX_CONCURRENCY

10

Global fetch concurrency limit

SOLOCRAWL_PER_DOMAIN_LIMIT

2

Per-domain concurrency limit

SOLOCRAWL_TIMEOUT_SECONDS

30

Per-request timeout in seconds

SOLOCRAWL_MAX_RETRIES

3

Retries on network errors / rate limits

SOLOCRAWL_MAX_RESPONSE_BYTES

10485760

Cap on fetched response body size (10 MiB); larger bodies are truncated

SOLOCRAWL_PROXY_ENABLED

false

Enable optional proxy layer

SOLOCRAWL_PROXY_MODE

list

Proxy mode: list (rotate a pool) or endpoint (single rotating endpoint)

SOLOCRAWL_PROXY_LIST

(empty)

Comma-separated proxy URLs

SOLOCRAWL_PROXY_ENDPOINT

(empty)

Single rotating proxy endpoint

SOLOCRAWL_PROXY_USERNAME

(empty)

Proxy auth username

SOLOCRAWL_PROXY_PASSWORD

(empty)

Proxy auth password

SOLOCRAWL_ALLOW_INTERNAL_URLS

false

Allow scraping localhost/private IPs (dev only)

SOLOCRAWL_USER_AGENT

(SoloCrawl default)

Override HTTP User-Agent for API requests

SOLOCRAWL_BROWSER_ALLOWED

true

Allow Playwright fallback when installed

SOLOCRAWL_LOG_LEVEL

WARNING

Log level: DEBUG, INFO, WARNING, ERROR

SOLOCRAWL_LOG_FILE

(empty)

Optional log file path (also logs to stderr)

πŸ”’ Security note on URL fetching

By default SoloCrawl refuses to fetch localhost, link-local, private, reserved, and cloud-metadata addresses. It checks literal hosts, DNS-resolved A/AAAA records, HTTP redirect targets, and Playwright's final browser URL. SoloCrawl is still a single-user local tool, not a hostile-multi-tenant proxy β€” do not expose it to untrusted network callers. SOLOCRAWL_ALLOW_INTERNAL_URLS=true disables these internal-target checks entirely (intended for trusted local development only).

🧩 Extending it

The whole point of the plugin layout is that adding a source is a single self-registering file β€” the core never changes. To add a search provider:

  1. Create src/solocrawl/core/search/providers/myprovider.py implementing SearchProvider.

  2. Register with @register("myprovider", zero_config=True) or as opt-in.

  3. Import the module in src/solocrawl/core/search/providers/__init__.py so registration runs.

  4. Add fixture-based tests in tests/.

The same pattern applies to package providers under src/solocrawl/core/packages/providers/.

Development

Work from a checkout in a virtualenv with an editable install β€” this also drops the solocrawl and solocrawl-mcp scripts into .venv/bin/:

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

Then run the quality gate:

ruff check . && ruff format --check .
pyright
pytest

# …or all in one line:
ruff check . && pyright && pytest

Ethics and terms of use

SoloCrawl is built for individual developers and local LLM tooling. It respects the robots.txt and terms of service of target sites β€” scrape consults robots.txt and refuses disallowed URLs by default (fail-open on errors; opt out with SOLOCRAWL_RESPECT_ROBOTS=false). The proxy and scraping features are not intended to bypass site rules, captchas, or anti-bot systems. Use responsibly and stay within legitimate access patterns.

License

MIT β€” see LICENSE.

Install Server
A
license - permissive license
A
quality
B
maintenance

Maintenance

–Maintainers
–Response time
–Release cycle
1Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/hlavacm/solocrawl'

If you have feedback or need assistance with the MCP directory API, please join our Discord server