What can you do with this server?

The scrapedatshi-mcp server enables AI-powered web scraping, crawling, data extraction, and RAG pipeline orchestration directly from Claude Desktop — no code required. Scraping & Chunking * scrape_url — Scrape a single URL and return RAG-ready text chunks (with optional JS rendering and RAG 2.0 contextual enrichment). * chunk_file — Upload and chunk local files (PDF, MD, TXT, YAML, JSON) into structured text segments. Crawling * crawl_site — Crawl an entire website via sitemap or spider mode, returning chunks from all pages with auto-batching for large sites (200+ pages). Structured Data Extraction * extract_data — Use an LLM to extract specific schema fields (e.g. product name, price, stock) from a single URL. * extract_crawl — Crawl a site and extract structured fields from every page. Vector DB / RAG Pipelines * sync_to_vectordb — Scrape a URL, embed chunks, and inject into a vector DB in one call. * ingest_file — Upload a local file, embed its chunks, and inject into a vector DB. * autorag — Full AutoRAG pipeline: crawl an entire domain, chunk, embed, and inject all content into your vector DB with automatic batching. * inspect_vectordb — Read vector DB metadata (dimensions, vector count, suggested models). * query_vectordb — Semantically query your vector DB for the most relevant chunks. Supported embedding providers: OpenAI, Cohere, Gemini, Mistral, Voyage AI, Ollama. Supported vector databases: Pinecone, Qdrant, ChromaDB, Supabase, Weaviate, MongoDB, Azure Cosmos DB, LanceDB. Provider & Key Management * verify_provider_key — Validate LLM/embedding API keys and retrieve live model lists. * list_embedding_providers — List all supported embedding providers with notes. * list_vector_db_providers — List all supported vector DBs with required config fields. Guidance * get_usage_guide — Access a guided workflow wizard to select the right tool and follow the correct pre-flight sequence. API keys are configured securely via environment variables in the Claude Desktop config, not entered in chat.

Which integrations are available for this server?

Provides both LLM and embedding capabilities, supporting models like gemini-1.5-flash and text-embedding-004. Allows syncing scraped data to MongoDB Atlas as a vector database for RAG pipelines. Provides local embedding models for vector generation, such as nomic-embed-text and mxbai-embed-large. Provides LLM and embedding capabilities, supporting models like gpt-4o-mini and text-embedding-3-small. Allows syncing scraped data to Supabase (pgvector) as a vector database.

How do I use scrapedatshi-mcp?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@scrapedatshi-mcp Scrape https://docs.example.com/getting-started and show me the chunks" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

scrapedatshi-mcp

by mxchris18

Overview Schema Related Servers Score Discussions

Python

Remote

The scrapedatshi-mcp server enables AI-powered web scraping, crawling, data extraction, and RAG pipeline orchestration directly from Claude Desktop — no code required.

Scraping & Chunking

scrape_url — Scrape a single URL and return RAG-ready text chunks (with optional JS rendering and RAG 2.0 contextual enrichment).
chunk_file — Upload and chunk local files (PDF, MD, TXT, YAML, JSON) into structured text segments.

Crawling

crawl_site — Crawl an entire website via sitemap or spider mode, returning chunks from all pages with auto-batching for large sites (200+ pages).

Structured Data Extraction

extract_data — Use an LLM to extract specific schema fields (e.g. product name, price, stock) from a single URL.
extract_crawl — Crawl a site and extract structured fields from every page.

Vector DB / RAG Pipelines

sync_to_vectordb — Scrape a URL, embed chunks, and inject into a vector DB in one call.
ingest_file — Upload a local file, embed its chunks, and inject into a vector DB.
autorag — Full AutoRAG pipeline: crawl an entire domain, chunk, embed, and inject all content into your vector DB with automatic batching.
inspect_vectordb — Read vector DB metadata (dimensions, vector count, suggested models).
query_vectordb — Semantically query your vector DB for the most relevant chunks.

Supported embedding providers: OpenAI, Cohere, Gemini, Mistral, Voyage AI, Ollama. Supported vector databases: Pinecone, Qdrant, ChromaDB, Supabase, Weaviate, MongoDB, Azure Cosmos DB, LanceDB.

Provider & Key Management

verify_provider_key — Validate LLM/embedding API keys and retrieve live model lists.
list_embedding_providers — List all supported embedding providers with notes.
list_vector_db_providers — List all supported vector DBs with required config fields.

Guidance

get_usage_guide — Access a guided workflow wizard to select the right tool and follow the correct pre-flight sequence.

API keys are configured securely via environment variables in the Claude Desktop config, not entered in chat.

scrapedatshi-mcp

MCP (Model Context Protocol) server for the scrapedatshi RAG pipeline API.

Use scrapedatshi's scraping, crawling, extraction, and vector DB sync tools directly from Claude Desktop — no code required.

What you can do

Just talk to Claude naturally:

"Scrape https://docs.example.com and give me the chunks"
"Chunk this PDF URL: https://my-bucket.s3.amazonaws.com/report.pdf" — PDF URLs are automatically detected and extracted
"Crawl https://example.com/products and extract the title and price from every page"
"Sync https://docs.example.com to my Pinecone index using OpenAI embeddings"
"Crawl the entire docs.stripe.com site (all 800 pages) and inject it into my Pinecone index" — large sites are auto-batched server-side, no manual pagination needed
"What embedding providers does scrapedatshi support?"
"Inspect my Pinecone index and tell me what embedding model was used"
"Query my Pinecone index for information about API authentication"
"Ingest all the JSON files in my ./scrapy_output/ folder into my Pinecone index"

Related MCP server: Firecrawl MCP Server

Tools exposed

Tool	What it does
`verify_provider_key`	Verify an LLM or embedding API key + get live model list
`get_usage_guide`	Returns the guided wizard flow and tool selection reference
`scrape_url`	Scrape & chunk a single URL into RAG-ready text segments
`chunk_file`	Upload a local file (PDF, MD, TXT, CSV, XLSX, DOCX, IPYNB, HTML, XML, code files, etc.) and chunk it into RAG-ready segments
`crawl_site`	Crawl an entire site (sitemap or spider mode) and return all chunks
`extract_data`	Extract structured schema fields from a URL using your LLM
`extract_crawl`	Multi-page schema extraction via site crawl
`sync_to_vectordb`	Full pipeline: scrape URL → embed → inject into your vector DB
`ingest_file`	Full pipeline: upload local file → embed → inject into your vector DB
`ingest_scraped`	Full pipeline: bulk-ingest a folder of pre-scraped files → embed → inject into your vector DB
`autorag`	Full pipeline: crawl entire site → chunk → embed → inject into your vector DB (large sites auto-batched)
`inspect_vectordb`	Read vector DB metadata: dimension, vector count, suggested embedding models (free)
`query_vectordb`	Semantic search: embed a query and retrieve the most relevant chunks from your vector DB
`rag_chat`	RAG Chat: retrieve top-N chunks from your vector DB and generate a grounded LLM answer
`list_embedding_providers`	Discover supported embedding providers + model notes
`list_vector_db_providers`	Discover supported vector DBs + required config fields

Prerequisites

scrapedatshi account — Sign up at scrapedatshi.com
Add credits — Billing portal
Get your API key — starts with sds_...
Claude Desktop — Download here
Python 3.10+ — python.org

Installation

Option A — Install from PyPI (recommended, works with `uvx`)

pip install scrapedatshi-mcp

Or use uv for isolated installs:

uv tool install scrapedatshi-mcp

Option B — Install from source (local development)

git clone https://github.com/scrapedatshi/scrapedatshi-mcp.git
cd scrapedatshi-mcp
pip install -e .

Claude Desktop configuration

Easiest way to find your config file: Open Claude Desktop → Settings → Developer → Edit Config

Alternatively, the file is located at:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

Recommended — `uvx` with all provider SDKs (auto-updates on restart)

{
  "mcpServers": {
    "scrapedatshi": {
      "command": "uvx",
      "args": [
        "--from", "scrapedatshi-mcp[all]",
        "--refresh",
        "scrapedatshi-mcp"
      ],
      "env": {
        "SCRAPEDATSHI_API_KEY": "sds_your_key_here"
      }
    }
  }
}

[all] installs all provider SDKs (OpenAI, Anthropic, Gemini, Voyage AI) so verify_provider_key works for any provider
--refresh checks PyPI for updates every time Claude Desktop starts — no manual reinstalls needed

If installed via pip (using `python`)

{
  "mcpServers": {
    "scrapedatshi": {
      "command": "python",
      "args": ["-m", "scrapedatshi_mcp.server"],
      "env": {
        "SCRAPEDATSHI_API_KEY": "sds_your_key_here"
      }
    }
  }
}

If cloned from source (absolute path)

{
  "mcpServers": {
    "scrapedatshi": {
      "command": "python",
      "args": ["/absolute/path/to/scrapedatshi-mcp/scrapedatshi_mcp/server.py"],
      "env": {
        "SCRAPEDATSHI_API_KEY": "sds_your_key_here"
      }
    }
  }
}

Restart Claude Desktop after saving the config.

Secure key configuration (BYOK)

You bring your own LLM, embedding, and vector DB keys. The server resolves keys in this priority order:

Argument passed in the tool call — explicit override
Environment variable in the MCP config — preferred secure path (keys never appear in chat)
Clear error message if neither is found

Add your provider keys to the env block in claude_desktop_config.json:

{
  "mcpServers": {
    "scrapedatshi": {
      "command": "uvx",
      "args": [
        "--from", "scrapedatshi-mcp[all]",
        "--refresh",
        "scrapedatshi-mcp"
      ],
      "env": {
        "SCRAPEDATSHI_API_KEY": "sds_your_key_here",

        "OPENAI_API_KEY": "sk-...",
        "ANTHROPIC_API_KEY": "sk-ant-...",
        "GEMINI_API_KEY": "AIza...",

        "COHERE_API_KEY": "...",
        "MISTRAL_API_KEY": "...",
        "VOYAGE_API_KEY": "...",

        "PINECONE_API_KEY": "pc-...",
        "QDRANT_API_KEY": "...",
        "WEAVIATE_API_KEY": "..."
      }
    }
  }
}

Once set, Claude will automatically use these keys without asking you to type them in chat.

Fetch Mode

Starting in v0.5.0, the MCP server uses local-fetch mode by default — URLs are fetched on the machine running Claude Desktop (your IP), and only the HTML processing runs on our server. This is cheaper and keeps your IP off our server.

`SCRAPEDATSHI_FETCH_MODE=local` (default)

The MCP server fetches URLs using the machine's own IP address, then submits the raw HTML to our server for processing.

✅ Your IP is used — not our server's
✅ Billed at the standard per-URL rate ($0.0020)
✅ Faster — no double-hop latency

`SCRAPEDATSHI_FETCH_MODE=server`

Our server fetches the URL. Use this if Claude Desktop is running in a restricted environment without outbound HTTP access, or if you need server-managed IP rotation.

⚠️ Our server's IP is used
⚠️ Billed at 2× the standard rate ($0.0040 / URL)
✅ Works from restricted environments

To enable server fetch, add SCRAPEDATSHI_FETCH_MODE to your MCP config:

{
  "mcpServers": {
    "scrapedatshi": {
      "command": "uvx",
      "args": ["--from", "scrapedatshi-mcp[all]", "--refresh", "scrapedatshi-mcp"],
      "env": {
        "SCRAPEDATSHI_API_KEY": "sds_your_key_here",
        "SCRAPEDATSHI_FETCH_MODE": "server"
      }
    }
  }
}

Supported environment variables

Variable	Used for
`SCRAPEDATSHI_API_KEY`	scrapedatshi API key (required)
`SCRAPEDATSHI_FETCH_MODE`	`local` (default) or `server` — see Fetch Mode above
`OPENAI_API_KEY`	OpenAI LLM + embedding
`ANTHROPIC_API_KEY`	Anthropic LLM (Claude)
`GEMINI_API_KEY`	Google Gemini LLM + embedding
`COHERE_API_KEY`	Cohere embedding
`MISTRAL_API_KEY`	Mistral embedding
`VOYAGE_API_KEY`	Voyage AI embedding
`PINECONE_API_KEY`	Pinecone vector DB
`QDRANT_API_KEY`	Qdrant vector DB (optional for local)
`WEAVIATE_API_KEY`	Weaviate vector DB (optional for local)

Authenticated Scraping (v0.5.1+)

For pages behind a login wall, you can pass your session cookies and/or custom headers to scrape_url and crawl_site. Credentials are only sent to URLs within the permitted domain scope — they are never leaked to external domains.

Just tell Claude:

"Scrape https://internal.company.com/wiki/api-docs — use my session cookie: abc123"

Claude will call scrape_url with:

{
  "url": "https://internal.company.com/wiki/api-docs",
  "cookies": {"session": "abc123"},
  "headers": {"Authorization": "Bearer eyJ..."}
}

Authenticated crawl with subdomain scope

"Crawl https://company.com including wiki.company.com and docs.company.com — use session cookie abc123"

Claude will call crawl_site with:

{
  "url": "https://company.com",
  "cookies": {"session": "abc123"},
  "allow_subdomains": true,
  "max_pages": 20
}

Security model:

Cookies and headers are only sent to URLs within the permitted domain scope — never to external domains discovered during crawling
allow_subdomains: false (default): only the exact hostname receives credentials
allow_subdomains: true: credentials are shared with subdomains of the root domain (e.g. wiki.company.com when root is company.com). Multi-part TLDs (.co.uk, .com.br) are handled safely.
Credentials are never forwarded to the scrapedatshi server — they stay on the machine running Claude Desktop

Enterprise SSO / MFA — Session Capture (v0.6.4+)

For enterprise portals protected by Okta, Duo, or any SSO/MFA flow that blocks automated login, use the SDK's capture_session() utility to authenticate manually in a real browser, then pass the captured session state to Claude via the storage_state parameter.

Step 1 — Capture the session locally (run once):

pip install scrapedatshi[auth]
playwright install chromium

from scrapedatshi.auth import capture_session
import json

state = capture_session(
    "https://internal.company.com/login",
    save_to="session.auth.json",   # gitignored automatically
)

This opens a real browser window. Log in manually (including any MFA prompts), then press Enter. The session state is saved to session.auth.json.

Step 2 — Tell Claude to use the saved session:

"Crawl https://internal.company.com using the session state in session.auth.json"

Claude will call crawl_site with the storage_state parameter containing the captured session.

⚠ Security: session.auth.json contains live authentication tokens. Never commit it to version control. The SDK's .gitignore template automatically filters *.auth.json files.

Example conversations

Scrape a single page

You: Scrape https://docs.example.com/getting-started and show me the chunks.

Claude calls scrape_url and returns the chunked content with token counts and credit usage.

Crawl a documentation site

You: Crawl https://docs.example.com — just the first 5 pages.

Claude calls crawl_site with max_pages=5 and returns all chunks from all pages.

Extract structured data from a product page

You: Extract the product name, price, and whether it's in stock from https://example.com/products/widget-pro

Claude calls extract_data with a schema it constructs from your request, using your OpenAI key from the env config.

Extract data from an entire product catalogue

You: Crawl https://example.com/products and extract the title and price from every product page. Limit to 10 pages.

Claude calls extract_crawl with max_pages=10 and returns per-page extraction results.

Sync a page to your vector DB

You: Sync https://docs.example.com to my Pinecone index. The index host is https://my-index-abc123.svc.pinecone.io. Use OpenAI text-embedding-3-small.

Claude calls sync_to_vectordb. If OPENAI_API_KEY and PINECONE_API_KEY are set in your env config, no keys need to be typed in chat.

Discover what's supported

You: What embedding providers does scrapedatshi support?

Claude calls list_embedding_providers and returns a formatted list with model notes.

You: What fields do I need to configure for Qdrant?

Claude calls list_vector_db_providers and returns the required and optional fields for each provider.

Supported providers

Embedding providers

Key	Provider
`openai`	OpenAI (text-embedding-3-small, text-embedding-3-large, ada-002)
`cohere`	Cohere (embed-english-v3.0, embed-multilingual-v3.0)
`gemini`	Google Gemini (text-embedding-004, gemini-embedding-001)
`mistral`	Mistral (mistral-embed)
`voyage`	Voyage AI (voyage-3, voyage-3-lite, voyage-code-3)
`ollama`	Ollama local (nomic-embed-text, mxbai-embed-large, etc.)

Vector databases

Key	Provider
`pinecone`	Pinecone
`qdrant`	Qdrant
`chroma`	ChromaDB (local)
`supabase`	Supabase (pgvector)
`weaviate`	Weaviate
`mongodb`	MongoDB Atlas
`azure_cosmos`	Azure Cosmos DB (NoSQL)
`azure_cosmos_mongo`	Azure Cosmos DB (MongoDB API)
`lancedb`	LanceDB (local)

LLM providers (for extraction + contextual retrieval)

Key	Provider
`openai`	OpenAI (gpt-4o-mini, gpt-4o, etc.)
`anthropic`	Anthropic (claude-3-haiku, claude-3-5-sonnet, etc.)
`gemini`	Google Gemini (gemini-1.5-flash, gemini-1.5-pro, etc.)

Billing

Credits are deducted from your scrapedatshi account after each successful API call
Failed requests are not charged
Every tool response includes credits_used and credits_remaining
LLM, embedding, and vector DB costs are billed directly by your chosen providers — scrapedatshi only charges for scraping and orchestration
Top up at scrapedatshi.com/portal/billing

Per-URL rates

Mode	Rate	When
Local fetch (default)	$0.0020 / URL	`SCRAPEDATSHI_FETCH_MODE=local` (default)
Server fetch	$0.0040 / URL	`SCRAPEDATSHI_FETCH_MODE=server`
Spider crawl (server)	$0.0050 / URL	`/v1/spider` — server-side link-following
Chunk fee	$0.0005 / chunk	All routes
Injection fee	$0.0030 / chunk	sync_to_vectordb, ingest_file, autorag
Contextual Retrieval	$0.0010 / chunk	When `contextual_retrieval=true`
Vector query	$0.0002 / chunk	query_vectordb, rag_chat

Auto-Batching for Large Sites

When you ask Claude to crawl a large site (more than 200 pages), the autorag and crawl_site tools automatically split the job into sequential batches server-side. You don't need to do anything special — just ask Claude to crawl the site and it handles the rest.

You: Crawl the entire docs.stripe.com site and inject everything into my Pinecone index.

Claude calls autorag with a high max_pages value. If the site has 600 pages, the server processes it as 3 batches of 200 pages each and returns the combined result.

The response will include auto_batched: true and batches_processed: N when batching occurred.

Safety limits

To prevent runaway credit usage and client timeouts:

crawl_site: defaults to 10 pages, maximum 200 per batch (auto-batched for larger jobs)
autorag: defaults to 5 pages, no hard upper limit — large jobs are auto-batched
extract_crawl: defaults to 5 pages, maximum 50 per call

Claude will always confirm page limits with you before calling multi-page tools.

Troubleshooting

Contextual Retrieval fails — "model no longer available"

LLM providers periodically deprecate older models. If you see an error like "This model is no longer available", run verify_provider_key again to get the current list of available models for your key, then select a current model.

Current recommended models for contextual retrieval:

Gemini: gemini-2.5-flash or gemini-2.0-flash-001 (not gemini-2.0-flash — deprecated)
OpenAI: any current gpt-4o or gpt-4.1 series model
Anthropic: any current claude-3-5 or claude-3-7 series model

Provider model & deprecation pages:

OpenAI: platform.openai.com/docs/deprecations
Anthropic: docs.anthropic.com/en/docs/about-claude/models
Google Gemini: ai.google.dev/gemini-api/docs/models
Cohere: docs.cohere.com/docs/models
Mistral: docs.mistral.ai/getting-started/models
Voyage AI: docs.voyageai.com/docs/embeddings

Contextual Retrieval fails — "quota exceeded"

Your LLM provider API key has no remaining credits. Add credits at your provider's billing page. Note that scrapedatshi credits are separate from your LLM provider credits — you need both.

`verify_provider_key` returns no models

If key verification succeeds but returns an empty model list, your API key may be restricted to specific model families or your account may have limited access. Check your provider's dashboard for account restrictions.

Claude Desktop doesn't show scrapedatshi tools

Make sure you saved claude_desktop_config.json correctly (valid JSON, no trailing commas)
Fully quit and reopen Claude Desktop — a simple window close is not enough
Check that uvx is installed: run uvx --version in your terminal
If using --refresh, the first startup may take a few seconds to download the package

License

MIT — see LICENSE

Install Server

license - permissive license

quality

maintenance

How are these scores calculated?

Maintenance

–Maintainers

–Response time

–Release cycle

–Releases (12mo)

Commit activity

Resources

GitHub Repository

Need Help?

Related Servers

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Tools

View all tools

Latest Blog Posts

Who's Calling? MCP Hosts Are an Identity Blind Spot (And the Spec Knows It)
By Om-Shree-0709 on July 25, 2026.
mcp
Agent Identity
OAuth 2.1
Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/mxchris18/scrapedatshi-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

scrapedatshi-mcp

What you can do

Tools exposed

Prerequisites

Installation

Option A — Install from PyPI (recommended, works with uvx)

Option B — Install from source (local development)

Claude Desktop configuration

Recommended — uvx with all provider SDKs (auto-updates on restart)

If installed via pip (using python)

If cloned from source (absolute path)

Secure key configuration (BYOK)

Fetch Mode

SCRAPEDATSHI_FETCH_MODE=local (default)

SCRAPEDATSHI_FETCH_MODE=server

Supported environment variables

Authenticated Scraping (v0.5.1+)

Scrape a login-walled page

Authenticated crawl with subdomain scope

Enterprise SSO / MFA — Session Capture (v0.6.4+)

Example conversations

Scrape a single page

Crawl a documentation site

Extract structured data from a product page

Extract data from an entire product catalogue

Sync a page to your vector DB

Discover what's supported

Supported providers

Embedding providers

Vector databases

LLM providers (for extraction + contextual retrieval)

Billing

Per-URL rates

Auto-Batching for Large Sites

Safety limits

Troubleshooting

Contextual Retrieval fails — "model no longer available"

Contextual Retrieval fails — "quota exceeded"

verify_provider_key returns no models

Claude Desktop doesn't show scrapedatshi tools

License

Maintenance

Resources

Looking for Admin?

Tools

Latest Blog Posts

MCP directory API

Option A — Install from PyPI (recommended, works with `uvx`)

Recommended — `uvx` with all provider SDKs (auto-updates on restart)

If installed via pip (using `python`)

`SCRAPEDATSHI_FETCH_MODE=local` (default)

`SCRAPEDATSHI_FETCH_MODE=server`

`verify_provider_key` returns no models