scrapedatshi-mcp
Provides both LLM and embedding capabilities, supporting models like gemini-1.5-flash and text-embedding-004.
Allows syncing scraped data to MongoDB Atlas as a vector database for RAG pipelines.
Provides local embedding models for vector generation, such as nomic-embed-text and mxbai-embed-large.
Provides LLM and embedding capabilities, supporting models like gpt-4o-mini and text-embedding-3-small.
Allows syncing scraped data to Supabase (pgvector) as a vector database.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@scrapedatshi-mcpScrape https://docs.example.com/getting-started and show me the chunks"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
scrapedatshi-mcp
MCP (Model Context Protocol) server for the scrapedatshi RAG pipeline API.
Use scrapedatshi's scraping, crawling, extraction, and vector DB sync tools directly from Claude Desktop — no code required.
What you can do
Just talk to Claude naturally:
"Scrape https://docs.example.com and give me the chunks"
"Crawl https://example.com/products and extract the title and price from every page"
"Sync https://docs.example.com to my Pinecone index using OpenAI embeddings"
"What embedding providers does scrapedatshi support?"
Related MCP server: Hyperbrowser MCP Server
Tools exposed
Tool | What it does |
| Scrape & chunk a single URL into RAG-ready text segments |
| Crawl an entire site (sitemap or spider mode) and return all chunks |
| Extract structured schema fields from a URL using your LLM |
| Multi-page schema extraction via site crawl |
| Full pipeline: scrape → embed → inject into your vector DB |
| Discover supported embedding providers + model notes |
| Discover supported vector DBs + required config fields |
Prerequisites
scrapedatshi account — Sign up at scrapedatshi.com
Add credits — Billing portal
Get your API key — starts with
sds_...Claude Desktop — Download here
Python 3.10+ — python.org
Installation
Option A — Install from PyPI (recommended, works with uvx)
pip install scrapedatshi-mcpOr use uv for isolated installs:
uv tool install scrapedatshi-mcpOption B — Install from source (local development)
git clone https://github.com/mxchris18/scrapedatshi-mcp.git
cd scrapedatshi-mcp
pip install -e .Claude Desktop configuration
Open your Claude Desktop config file:
macOS:
~/Library/Application Support/Claude/claude_desktop_config.jsonWindows:
%APPDATA%\Claude\claude_desktop_config.json
If installed via PyPI / pip (using uvx)
{
"mcpServers": {
"scrapedatshi": {
"command": "uvx",
"args": ["scrapedatshi-mcp"],
"env": {
"SCRAPEDATSHI_API_KEY": "sds_your_key_here"
}
}
}
}If installed via pip (using python)
{
"mcpServers": {
"scrapedatshi": {
"command": "python",
"args": ["-m", "scrapedatshi_mcp.server"],
"env": {
"SCRAPEDATSHI_API_KEY": "sds_your_key_here"
}
}
}
}If cloned from source (absolute path)
{
"mcpServers": {
"scrapedatshi": {
"command": "python",
"args": ["/absolute/path/to/scrapedatshi-mcp/scrapedatshi_mcp/server.py"],
"env": {
"SCRAPEDATSHI_API_KEY": "sds_your_key_here"
}
}
}
}Restart Claude Desktop after saving the config.
Secure key configuration (BYOK)
You bring your own LLM, embedding, and vector DB keys. The server resolves keys in this priority order:
Argument passed in the tool call — explicit override
Environment variable in the MCP config — preferred secure path (keys never appear in chat)
Clear error message if neither is found
Add your provider keys to the env block in claude_desktop_config.json:
{
"mcpServers": {
"scrapedatshi": {
"command": "uvx",
"args": ["scrapedatshi-mcp"],
"env": {
"SCRAPEDATSHI_API_KEY": "sds_your_key_here",
"OPENAI_API_KEY": "sk-...",
"ANTHROPIC_API_KEY": "sk-ant-...",
"GEMINI_API_KEY": "AIza...",
"COHERE_API_KEY": "...",
"MISTRAL_API_KEY": "...",
"VOYAGE_API_KEY": "...",
"PINECONE_API_KEY": "pc-...",
"QDRANT_API_KEY": "...",
"WEAVIATE_API_KEY": "..."
}
}
}
}Once set, Claude will automatically use these keys without asking you to type them in chat.
Supported environment variables
Variable | Used for |
| scrapedatshi API key (required) |
| OpenAI LLM + embedding |
| Anthropic LLM (Claude) |
| Google Gemini LLM + embedding |
| Cohere embedding |
| Mistral embedding |
| Voyage AI embedding |
| Pinecone vector DB |
| Qdrant vector DB (optional for local) |
| Weaviate vector DB (optional for local) |
Example conversations
Scrape a single page
You: Scrape https://docs.example.com/getting-started and show me the chunks.
Claude calls scrape_url and returns the chunked content with token counts and credit usage.
Crawl a documentation site
You: Crawl https://docs.example.com — just the first 5 pages.
Claude calls crawl_site with max_pages=5 and returns all chunks from all pages.
Extract structured data from a product page
You: Extract the product name, price, and whether it's in stock from https://example.com/products/widget-pro
Claude calls extract_data with a schema it constructs from your request, using your OpenAI key from the env config.
Extract data from an entire product catalogue
You: Crawl https://example.com/products and extract the title and price from every product page. Limit to 10 pages.
Claude calls extract_crawl with max_pages=10 and returns per-page extraction results.
Sync a page to your vector DB
You: Sync https://docs.example.com to my Pinecone index. The index host is https://my-index-abc123.svc.pinecone.io. Use OpenAI text-embedding-3-small.
Claude calls sync_to_vectordb. If OPENAI_API_KEY and PINECONE_API_KEY are set in your env config, no keys need to be typed in chat.
Discover what's supported
You: What embedding providers does scrapedatshi support?
Claude calls list_embedding_providers and returns a formatted list with model notes.
You: What fields do I need to configure for Qdrant?
Claude calls list_vector_db_providers and returns the required and optional fields for each provider.
Supported providers
Embedding providers
Key | Provider |
| OpenAI (text-embedding-3-small, text-embedding-3-large, ada-002) |
| Cohere (embed-english-v3.0, embed-multilingual-v3.0) |
| Google Gemini (text-embedding-004, gemini-embedding-001) |
| Mistral (mistral-embed) |
| Voyage AI (voyage-3, voyage-3-lite, voyage-code-3) |
| Ollama local (nomic-embed-text, mxbai-embed-large, etc.) |
Vector databases
Key | Provider |
| Pinecone |
| Qdrant |
| ChromaDB (local) |
| Supabase (pgvector) |
| Weaviate |
| MongoDB Atlas |
| Azure Cosmos DB (NoSQL) |
| Azure Cosmos DB (MongoDB API) |
| LanceDB (local) |
LLM providers (for extraction + contextual retrieval)
Key | Provider |
| OpenAI (gpt-4o-mini, gpt-4o, etc.) |
| Anthropic (claude-3-haiku, claude-3-5-sonnet, etc.) |
| Google Gemini (gemini-1.5-flash, gemini-1.5-pro, etc.) |
Billing
Credits are deducted from your scrapedatshi account after each successful API call
Failed requests are not charged
Every tool response includes
credits_usedandcredits_remainingLLM, embedding, and vector DB costs are billed directly by your chosen providers — scrapedatshi only charges for scraping and orchestration
Top up at scrapedatshi.com/portal/billing
Safety limits
To prevent runaway credit usage and client timeouts:
crawl_site: defaults to 10 pages, maximum 200extract_crawl: defaults to 5 pages, maximum 50 per call
Claude will always confirm page limits with you before calling multi-page tools.
License
MIT — see LICENSE
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
- Your AI Chatbot Just Exposed Your CEO's Salary to an InternBy Om-Shree-0709 on .Agent IdentityMCP SecurityOAuth Delegation
- Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)By Om-Shree-0709 on .Agentic AiPrompt InjectionWebAssembly
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/mxchris18/scrapedatshi-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server