Schema | Website Scraper MCP Server

Website Scraper MCP Server

Describes the environment variables required to run the server.

Name	Required	Description	Default
`LOG_LEVEL`	No	Python logging level	INFO
`CHUNK_SIZE`	No	Characters per chunk	1000
`CHUNK_OVERLAP`	No	Overlap between consecutive chunks	200
`MAX_CRAWL_DEPTH`	No	Maximum crawl depth	2
`AZURE_SEARCH_KEY`	Yes	Admin API key
`MAX_PAGES_PER_SITE`	No	Hard cap on pages per crawl	100
`CRAWL_DELAY_SECONDS`	No	Polite delay between requests	0.5
`PLAYWRIGHT_HEADLESS`	No	Run Chromium headless	true
`AZURE_SEARCH_ENDPOINT`	Yes	Azure AI Search service URL
`PLAYWRIGHT_TIMEOUT_MS`	No	Playwright page load timeout (ms)	30000
`AZURE_SEARCH_INDEX_NAME`	No	Target index name	website-content

Features and capabilities supported by this server

Capability	Details
`tools`	{ "listChanged": false }
`experimental`	{}

Functions exposed to the LLM to take actions

Name	Description
scrape_websiteA	Scrape a single web page. Automatically detects whether the page is static (uses httpx + BeautifulSoup) or dynamic/JS-rendered (uses Playwright headless Chromium). Returns the page title, clean content, all internal/external links, and page metadata.
crawl_websiteA	BFS-crawl an entire website starting from the given root URL. Only follows internal (same-domain) links. Respects robots.txt. Avoids duplicate URLs. Limits crawl depth and total page count. Returns every scraped page with title, content, and links.
clean_contentA	Clean raw HTML by removing scripts, styles, navigation bars, footers, cookie banners, ads, and other noise. Returns readable plain text keeping only main article content, headings, paragraphs, tables, and lists.
chunk_contentA	Split clean text into overlapping chunks (~1 000 characters each, 200-character overlap). Each chunk has a unique deterministic ID derived from the URL and position. Useful for preparing text for vector embedding or search indexing.
scrape_full_siteA	End-to-end pipeline: crawl every internal page of a website, clean the HTML of each page, and optionally split into chunks. Returns a structured result with every page's title, clean content, links, metadata, and (if requested) text chunks. Handles both static and dynamic pages automatically.

Interactive templates invoked by user choice

Name	Description
No prompts

Contextual data attached and managed by the client

Name	Description
No resources

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/lalit9168/web-scrapping'

If you have feedback or need assistance with the MCP directory API, please join our Discord server