Website Scraper MCP Server
Server Configuration
Describes the environment variables required to run the server.
| Name | Required | Description | Default |
|---|---|---|---|
| LOG_LEVEL | No | Python logging level | INFO |
| CHUNK_SIZE | No | Characters per chunk | 1000 |
| CHUNK_OVERLAP | No | Overlap between consecutive chunks | 200 |
| MAX_CRAWL_DEPTH | No | Maximum crawl depth | 2 |
| AZURE_SEARCH_KEY | Yes | Admin API key | |
| MAX_PAGES_PER_SITE | No | Hard cap on pages per crawl | 100 |
| CRAWL_DELAY_SECONDS | No | Polite delay between requests | 0.5 |
| PLAYWRIGHT_HEADLESS | No | Run Chromium headless | true |
| AZURE_SEARCH_ENDPOINT | Yes | Azure AI Search service URL | |
| PLAYWRIGHT_TIMEOUT_MS | No | Playwright page load timeout (ms) | 30000 |
| AZURE_SEARCH_INDEX_NAME | No | Target index name | website-content |
Capabilities
Features and capabilities supported by this server
| Capability | Details |
|---|---|
| tools | {
"listChanged": false
} |
| experimental | {} |
Tools
Functions exposed to the LLM to take actions
| Name | Description |
|---|---|
| scrape_websiteA | Scrape a single web page. Automatically detects whether the page is static (uses httpx + BeautifulSoup) or dynamic/JS-rendered (uses Playwright headless Chromium). Returns the page title, clean content, all internal/external links, and page metadata. |
| crawl_websiteA | BFS-crawl an entire website starting from the given root URL. Only follows internal (same-domain) links. Respects robots.txt. Avoids duplicate URLs. Limits crawl depth and total page count. Returns every scraped page with title, content, and links. |
| clean_contentA | Clean raw HTML by removing scripts, styles, navigation bars, footers, cookie banners, ads, and other noise. Returns readable plain text keeping only main article content, headings, paragraphs, tables, and lists. |
| chunk_contentA | Split clean text into overlapping chunks (~1 000 characters each, 200-character overlap). Each chunk has a unique deterministic ID derived from the URL and position. Useful for preparing text for vector embedding or search indexing. |
| scrape_full_siteA | End-to-end pipeline: crawl every internal page of a website, clean the HTML of each page, and optionally split into chunks. Returns a structured result with every page's title, clean content, links, metadata, and (if requested) text chunks. Handles both static and dynamic pages automatically. |
Prompts
Interactive templates invoked by user choice
| Name | Description |
|---|---|
No prompts | |
Resources
Contextual data attached and managed by the client
| Name | Description |
|---|---|
No resources | |
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/lalit9168/web-scrapping'
If you have feedback or need assistance with the MCP directory API, please join our Discord server