Skip to main content
Glama

scrape_webpage

Extract content and metadata from any webpage by providing its URL, returning results in JSON format for structured data analysis and integration.

Instructions

Scrape content and metadata from a single webpage using Crawl4AI.

Args: url: The URL of the webpage to scrape

Returns: List containing TextContent with the result as JSON.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYes

Implementation Reference

  • main.py:10-24 (handler)
    The MCP tool handler for 'scrape_webpage', decorated with @mcp.tool(). It accepts a URL parameter and delegates execution to the scrape_url helper function from tools/scrape.py, returning scraped content as typed MCP content.
    @mcp.tool()
    async def scrape_webpage(
        url: str,
    ) -> list[types.TextContent | types.ImageContent | types.EmbeddedResource]:
        """
        Scrape content and metadata from a single webpage using Crawl4AI.
    
        Args:
            url: The URL of the webpage to scrape
    
        Returns:
            List containing TextContent with the result as JSON.
        """
        return await scrape_url(url)
  • Supporting helper function scrape_url that performs the actual webpage scraping using Crawl4AI's AsyncWebCrawler. Handles URL validation, browser configuration, crawling, error handling, and formats output as MCP TextContent.
    async def scrape_url(url: str) -> List[Any]:
        """Scrape a webpage using crawl4ai with simple implementation.
    
        Args:
            url: The URL to scrape
    
        Returns:
            A list containing TextContent object with the result as JSON
        """
    
        try:
            # Simple validation for domains/subdomains with http(s)
            url_pattern = re.compile(
                r"^(?:https?://)?(?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?\.)+[A-Za-z]{2,}(?:/[^/\s]*)*$"
            )
    
            if not url_pattern.match(url):
                return [
                    types.TextContent(
                        type="text",
                        text=json.dumps(
                            {
                                "success": False,
                                "url": url,
                                "error": "Invalid URL format",
                            }
                        ),
                    )
                ]
    
            # Add https:// if missing
            if not url.startswith("http://") and not url.startswith("https://"):
                url = f"https://{url}"
    
            # Use default configurations with minimal customization
            browser_config = BrowserConfig(
                browser_type="chromium",
                headless=True,
                ignore_https_errors=True,
                verbose=False,
                extra_args=[
                    "--no-sandbox",
                    "--disable-setuid-sandbox",
                    "--disable-dev-shm-usage",
                ],
            )
            run_config = CrawlerRunConfig(
                cache_mode=CacheMode.BYPASS,
                verbose=False,
                page_timeout=30 * 1000,  # Convert to milliseconds
            )
    
            async with AsyncWebCrawler(config=browser_config) as crawler:
                result = await asyncio.wait_for(
                    crawler.arun(
                        url=url,
                        config=run_config,
                    ),
                    timeout=30,
                )
    
                # Create response in the format requested
                return [
                    types.TextContent(
                        type="text", text=json.dumps({"markdown": result.markdown})
                    )
                ]
    
        except Exception as e:
            return [
                types.TextContent(
                    type="text",
                    text=json.dumps({"success": False, "url": url, "error": str(e)}),
                )
            ]
Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ritvij14/crawl4ai-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server