Skip to main content
Glama
eitan3
by eitan3

scrapy-mcp

A headless web-scraping MCP server built on Scrapy. It exposes Scrapy's scraping primitives — polite fetching, CSS/XPath extraction, link and table extraction, sitemap and robots.txt reading, and bounded asynchronous crawls — as MCP tools an agent can call over stdio.

  • Headless, no rendering. Pages are fetched and parsed as HTML; no browser, no JavaScript execution. This keeps the footprint tiny — it runs comfortably on weak machines.

  • Reactor-safe. Every operation runs in a short-lived Scrapy subprocess, so Twisted's reactor never lives inside the asyncio MCP server (no ReactorNotRestartable), and memory is reclaimed after each call.

  • Polite by default. Obeys robots.txt, throttles with AutoThrottle, and enforces hard page/depth caps so a crawl can't run away.

Install / run

Run straight from PyPI with uv — no install step:

uvx scrapy-mcp

Or install it:

uv pip install scrapy-mcp
scrapy-mcp

The server speaks MCP over stdio. Point any MCP client at it. For Claude Desktop, add to claude_desktop_config.json:

{
  "mcpServers": {
    "scrapy": {
      "command": "uvx",
      "args": ["scrapy-mcp"]
    }
  }
}

Related MCP server: Scrapy MCP Server

Tools

Tool

What it does

fetch_page(url, format, max_bytes, obey_robots)

Fetch one page as markdown (default), text, or html.

extract(url, selectors, obey_robots)

Pull structured fields with CSS/XPath selectors.

extract_tables(url, max_tables, obey_robots)

Extract every HTML <table> as {headers, rows}.

extract_links(url, same_domain, pattern, limit, obey_robots)

List de-duplicated links on a page.

get_sitemap(url, limit, obey_robots)

Read a sitemap (gzip + sitemap-index aware).

check_robots(url, user_agent)

Is a URL crawlable? Returns the crawl-delay and sitemaps.

start_crawl(start_url, allow_patterns, deny_patterns, max_pages, max_depth, same_domain, selectors, ...)

Start a bounded BFS crawl; returns a job_id.

crawl_status(job_id)

State + pages scraped for a crawl.

crawl_results(job_id, cursor, limit)

Page through a crawl's scraped items.

cancel_crawl(job_id)

Stop a running crawl; keep results so far.

Selector format (extract / start_crawl)

selectors maps an output field to a selector. Each value is either a CSS string (first match) or an object for more control:

{
  "title": "h1::text",
  "price": "span.price::text",
  "all_links": {"css": "a::attr(href)", "all": true},
  "first_heading": {"xpath": "//h1/text()"}
}

"all": true returns every match as a list; otherwise the first match is returned.

Crawls are asynchronous

start_crawl returns immediately with a job_id. The crawl runs as a detached worker that streams results to disk, so it survives a server restart. Poll crawl_status(job_id), then read items with crawl_results(job_id) (safe to call mid-crawl for partial results). Jobs are stored under the system temp dir and reclaimed after 7 days (configurable).

Configuration

All settings are optional environment variables (sensible, polite defaults tuned for a weak host). They're how you tune a uvx scrapy-mcp deployment.

Variable

Default

Meaning

SCRAPY_MCP_USER_AGENT

scrapy-mcp/<version> …

User-Agent header.

SCRAPY_MCP_OBEY_ROBOTS

true

Obey robots.txt.

SCRAPY_MCP_DOWNLOAD_DELAY

0.5

Seconds between requests to a host.

SCRAPY_MCP_CONCURRENT_REQUESTS

8

Global concurrency.

SCRAPY_MCP_CONCURRENT_REQUESTS_PER_DOMAIN

4

Per-host concurrency.

SCRAPY_MCP_DOWNLOAD_TIMEOUT

30

Per-request timeout (s).

SCRAPY_MCP_RETRY_TIMES

2

Retries on transient failures.

SCRAPY_MCP_AUTOTHROTTLE

true

Adapt delay to server latency.

SCRAPY_MCP_MAX_BYTES

50000

Max characters returned per page (then truncated).

SCRAPY_MCP_REQUEST_TIMEOUT

60

Wall-clock cap for a blocking single fetch (s).

SCRAPY_MCP_DEFAULT_MAX_PAGES / _MAX_PAGES_CAP

50 / 1000

Crawl page default / hard cap.

SCRAPY_MCP_DEFAULT_MAX_DEPTH / _MAX_DEPTH_CAP

2 / 10

Crawl depth default / hard cap.

SCRAPY_MCP_JOB_DIR

<tmp>/scrapy_mcp_jobs

Where crawl jobs are stored.

SCRAPY_MCP_JOB_TTL_DAYS

7

Delete crawl jobs older than this (0 disables).

SCRAPY_MCP_LOG_LEVEL

ERROR

Scrapy log level (to stderr).

Development

uv venv
uv pip install -e ".[dev]"
uv run pytest          # unit tests (no network)
uv build               # build wheel + sdist into dist/

License

MIT © Eitan Hadar

A
license - permissive license
-
quality - not tested
C
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/eitan3/Scrapy_MCP_Scraper'

If you have feedback or need assistance with the MCP directory API, please join our Discord server