scrape_webpage

Extract content and metadata from any webpage by providing its URL, returning results in JSON format for structured data analysis and integration.

Instructions

Scrape content and metadata from a single webpage using Crawl4AI.

Args: url: The URL of the webpage to scrape

Returns: List containing TextContent with the result as JSON.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`url`	Yes

Implementation Reference

main.py:10-24 (handler)

The MCP tool handler for 'scrape_webpage', decorated with @mcp.tool(). It accepts a URL parameter and delegates execution to the scrape_url helper function from tools/scrape.py, returning scraped content as typed MCP content.

@mcp.tool()
async def scrape_webpage(
    url: str,
) -> list[types.TextContent | types.ImageContent | types.EmbeddedResource]:
    """
    Scrape content and metadata from a single webpage using Crawl4AI.

    Args:
        url: The URL of the webpage to scrape

    Returns:
        List containing TextContent with the result as JSON.
    """
    return await scrape_url(url)

tools/scrape.py:10-84 (helper)

Supporting helper function scrape_url that performs the actual webpage scraping using Crawl4AI's AsyncWebCrawler. Handles URL validation, browser configuration, crawling, error handling, and formats output as MCP TextContent.

async def scrape_url(url: str) -> List[Any]:
    """Scrape a webpage using crawl4ai with simple implementation.

    Args:
        url: The URL to scrape

    Returns:
        A list containing TextContent object with the result as JSON
    """

    try:
        # Simple validation for domains/subdomains with http(s)
        url_pattern = re.compile(
            r"^(?:https?://)?(?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?\.)+[A-Za-z]{2,}(?:/[^/\s]*)*$"
        )

        if not url_pattern.match(url):
            return [
                types.TextContent(
                    type="text",
                    text=json.dumps(
                        {
                            "success": False,
                            "url": url,
                            "error": "Invalid URL format",
                        }
                    ),
                )
            ]

        # Add https:// if missing
        if not url.startswith("http://") and not url.startswith("https://"):
            url = f"https://{url}"

        # Use default configurations with minimal customization
        browser_config = BrowserConfig(
            browser_type="chromium",
            headless=True,
            ignore_https_errors=True,
            verbose=False,
            extra_args=[
                "--no-sandbox",
                "--disable-setuid-sandbox",
                "--disable-dev-shm-usage",
            ],
        )
        run_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            verbose=False,
            page_timeout=30 * 1000,  # Convert to milliseconds
        )

        async with AsyncWebCrawler(config=browser_config) as crawler:
            result = await asyncio.wait_for(
                crawler.arun(
                    url=url,
                    config=run_config,
                ),
                timeout=30,
            )

            # Create response in the format requested
            return [
                types.TextContent(
                    type="text", text=json.dumps({"markdown": result.markdown})
                )
            ]

    except Exception as e:
        return [
            types.TextContent(
                type="text",
                text=json.dumps({"success": False, "url": url, "error": str(e)}),
            )
        ]

crawl4ai-mcp

scrape_webpage

Instructions

Input Schema

Implementation Reference

Other Tools

Related Tools

Latest Blog Posts

MCP directory API