Skip to main content
Glama
lalit9168

Website Scraper MCP Server

by lalit9168

Server Configuration

Describes the environment variables required to run the server.

NameRequiredDescriptionDefault
LOG_LEVELNoPython logging levelINFO
CHUNK_SIZENoCharacters per chunk1000
CHUNK_OVERLAPNoOverlap between consecutive chunks200
MAX_CRAWL_DEPTHNoMaximum crawl depth2
AZURE_SEARCH_KEYYesAdmin API key
MAX_PAGES_PER_SITENoHard cap on pages per crawl100
CRAWL_DELAY_SECONDSNoPolite delay between requests0.5
PLAYWRIGHT_HEADLESSNoRun Chromium headlesstrue
AZURE_SEARCH_ENDPOINTYesAzure AI Search service URL
PLAYWRIGHT_TIMEOUT_MSNoPlaywright page load timeout (ms)30000
AZURE_SEARCH_INDEX_NAMENoTarget index namewebsite-content

Capabilities

Features and capabilities supported by this server

CapabilityDetails
tools
{
  "listChanged": false
}
experimental
{}

Tools

Functions exposed to the LLM to take actions

NameDescription
scrape_websiteA

Scrape a single web page. Automatically detects whether the page is static (uses httpx + BeautifulSoup) or dynamic/JS-rendered (uses Playwright headless Chromium). Returns the page title, clean content, all internal/external links, and page metadata.

crawl_websiteA

BFS-crawl an entire website starting from the given root URL. Only follows internal (same-domain) links. Respects robots.txt. Avoids duplicate URLs. Limits crawl depth and total page count. Returns every scraped page with title, content, and links.

clean_contentA

Clean raw HTML by removing scripts, styles, navigation bars, footers, cookie banners, ads, and other noise. Returns readable plain text keeping only main article content, headings, paragraphs, tables, and lists.

chunk_contentA

Split clean text into overlapping chunks (~1 000 characters each, 200-character overlap). Each chunk has a unique deterministic ID derived from the URL and position. Useful for preparing text for vector embedding or search indexing.

scrape_full_siteA

End-to-end pipeline: crawl every internal page of a website, clean the HTML of each page, and optionally split into chunks. Returns a structured result with every page's title, clean content, links, metadata, and (if requested) text chunks. Handles both static and dynamic pages automatically.

Prompts

Interactive templates invoked by user choice

NameDescription

No prompts

Resources

Contextual data attached and managed by the client

NameDescription

No resources

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/lalit9168/web-scrapping'

If you have feedback or need assistance with the MCP directory API, please join our Discord server