Skip to main content
Glama

scrape_webpage

Extract content and metadata from any webpage by providing its URL, returning results in JSON format for structured data analysis and integration.

Instructions

Scrape content and metadata from a single webpage using Crawl4AI.

Args: url: The URL of the webpage to scrape

Returns: List containing TextContent with the result as JSON.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYes

Implementation Reference

  • main.py:10-24 (handler)
    The MCP tool handler for 'scrape_webpage', decorated with @mcp.tool(). It accepts a URL parameter and delegates execution to the scrape_url helper function from tools/scrape.py, returning scraped content as typed MCP content.
    @mcp.tool() async def scrape_webpage( url: str, ) -> list[types.TextContent | types.ImageContent | types.EmbeddedResource]: """ Scrape content and metadata from a single webpage using Crawl4AI. Args: url: The URL of the webpage to scrape Returns: List containing TextContent with the result as JSON. """ return await scrape_url(url)
  • Supporting helper function scrape_url that performs the actual webpage scraping using Crawl4AI's AsyncWebCrawler. Handles URL validation, browser configuration, crawling, error handling, and formats output as MCP TextContent.
    async def scrape_url(url: str) -> List[Any]: """Scrape a webpage using crawl4ai with simple implementation. Args: url: The URL to scrape Returns: A list containing TextContent object with the result as JSON """ try: # Simple validation for domains/subdomains with http(s) url_pattern = re.compile( r"^(?:https?://)?(?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?\.)+[A-Za-z]{2,}(?:/[^/\s]*)*$" ) if not url_pattern.match(url): return [ types.TextContent( type="text", text=json.dumps( { "success": False, "url": url, "error": "Invalid URL format", } ), ) ] # Add https:// if missing if not url.startswith("http://") and not url.startswith("https://"): url = f"https://{url}" # Use default configurations with minimal customization browser_config = BrowserConfig( browser_type="chromium", headless=True, ignore_https_errors=True, verbose=False, extra_args=[ "--no-sandbox", "--disable-setuid-sandbox", "--disable-dev-shm-usage", ], ) run_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, verbose=False, page_timeout=30 * 1000, # Convert to milliseconds ) async with AsyncWebCrawler(config=browser_config) as crawler: result = await asyncio.wait_for( crawler.arun( url=url, config=run_config, ), timeout=30, ) # Create response in the format requested return [ types.TextContent( type="text", text=json.dumps({"markdown": result.markdown}) ) ] except Exception as e: return [ types.TextContent( type="text", text=json.dumps({"success": False, "url": url, "error": str(e)}), ) ]

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ritvij14/crawl4ai-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server