Website Scraper MCP Server
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Website Scraper MCP Serverscrape and index https://example.com into AI Search"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Website Scraper MCP Server
A production-ready MCP (Model Context Protocol) server that allows any MCP-compatible AI agent to scrape websites, crawl internal pages, clean content, chunk it, and index everything into Azure AI Search — all through a clean, typed tool interface.
Table of Contents
Related MCP server: Crawl4AI+SearXNG MCP Server
Architecture
website_scraper_mcp/
├── app.py ← Entry point (stdio / SSE transport)
├── server.py ← MCP server + tool dispatcher
├── config.py ← Pydantic Settings (env vars)
├── models.py ← Input/Output Pydantic models
└── tools/
├── scrape.py ← Tool 1 – static/dynamic detection + scraping
├── crawl.py ← Tool 2 – BFS crawler, robots.txt aware
├── clean.py ← Tool 3 – Trafilatura + BS4 content cleaning
├── chunk.py ← Tool 4 – sliding window chunking
└── azure_ai_search.py ← Tools 5 & 7 – index + searchTools
# | Tool | Description |
1 |
| Detect static/dynamic, scrape title/content/links |
2 |
| BFS crawl with depth limit + robots.txt |
3 |
| Strip noise HTML, return readable text |
4 |
| Sliding window chunks (~1 000 chars, 200 overlap) |
5 |
| Upload chunks to Azure AI Search |
6 |
| End-to-end pipeline: crawl → clean → chunk → index |
7 |
| Full-text search on the Azure AI Search index |
Installation
Prerequisites
Python 3.11+
Azure AI Search service (free tier works for testing)
Steps
# 1. Clone the repo
git clone https://github.com/your-org/website-scraper-mcp.git
cd website-scraper-mcp
# 2. Create and activate a virtual environment
python -m venv .venv
# Windows
.venv\Scripts\activate
# Linux/Mac
source .venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Install Playwright browsers (Chromium)
playwright install chromium
# 5. Copy and fill environment variables
cp .env.example .env
# Edit .env with your Azure credentialsRunning Locally
stdio mode (default — for MCP clients / AI agents)
python -m website_scraper_mcp.app
# or
python -m website_scraper_mcp.app --transport stdioSSE mode (HTTP endpoint for browser-based / HTTP clients)
python -m website_scraper_mcp.app --transport sse --port 8000
# Server available at http://localhost:8000/sseRunning with Docker
# Build and start in SSE mode
docker compose up --build
# Stop
docker compose downThe container exposes port 8000 for SSE transport.
Environment Variables
Variable | Default | Description |
| (required) | Azure AI Search service URL |
| (required) | Admin API key |
|
| Target index name |
|
| Playwright page load timeout (ms) |
|
| Run Chromium headless |
|
| Maximum crawl depth |
|
| Hard cap on pages per crawl |
|
| Polite delay between requests |
|
| Characters per chunk |
|
| Overlap between consecutive chunks |
|
| Python logging level |
Sample MCP Client
Run the included example after starting the server in SSE mode:
python examples/mcp_client_example.pyOr configure it in your MCP-compatible agent (e.g. Claude Desktop mcp_config.json):
{
"mcpServers": {
"website-scraper": {
"command": "python",
"args": ["-m", "website_scraper_mcp.app", "--transport", "stdio"],
"cwd": "/path/to/website-scraper-mcp",
"env": {
"AZURE_SEARCH_ENDPOINT": "https://your-service.search.windows.net",
"AZURE_SEARCH_KEY": "your-key",
"AZURE_SEARCH_INDEX_NAME": "website-content"
}
}
}
}Example API Requests
Via MCP client (Python SDK)
import asyncio
from mcp import ClientSession
from mcp.client.sse import sse_client
async def demo():
async with sse_client("http://localhost:8000/sse") as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# Scrape a single page
result = await session.call_tool("scrape_website", {"url": "https://example.com"})
print(result)
# Full pipeline
result = await session.call_tool("index_website", {
"url": "https://example.com",
"max_depth": 2
})
print(result)
# Search
result = await session.call_tool("search_index", {
"query": "What services does the company provide?",
"top": 5
})
print(result)
asyncio.run(demo())Tool input/output examples
scrape_website
// Input
{"url": "https://example.com"}
// Output
{
"title": "Example Domain",
"url": "https://example.com",
"content": "This domain is for use in illustrative examples...",
"links": ["https://www.iana.org/domains/example"],
"is_dynamic": false,
"metadata": {"description": "..."}
}index_website
// Input
{"url": "https://example.com", "max_depth": 2}
// Output
{
"url": "https://example.com",
"pages_crawled": 4,
"total_chunks": 38,
"indexed_documents": 38,
"failed_documents": 0,
"status": "success",
"errors": []
}search_index
// Input
{"query": "What services does the company provide?", "top": 5}
// Output
{
"query": "What services does the company provide?",
"total_results": 3,
"hits": [
{
"id": "abc123",
"url": "https://example.com/services",
"title": "Our Services",
"content": "We provide cloud, AI, and data services...",
"chunk_number": 0,
"score": 9.8
}
]
}Error Handling
The server handles all errors gracefully and returns structured JSON error responses:
{
"error": "HTTP 404 when fetching https://example.com/missing",
"tool": "scrape_website"
}Handled errors include: invalid URLs, HTTP 4xx/5xx, timeouts, Playwright failures, Azure Search quota errors, network issues, and duplicate document IDs.
License
MIT
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/lalit9168/web-scrapping'
If you have feedback or need assistance with the MCP directory API, please join our Discord server