webcrawl_scrape
Fetch a web page by URL and extract its main article content as markdown.
Instructions
Fetch a URL and extract main content as markdown.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | The URL to scrape | |
| timeout | No | Request timeout in seconds (default: 30) |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||
Implementation Reference
- src/webcrawl_mcp/server.py:14-30 (handler)The MCP tool handler for 'webcrawl_scrape'. Decorated with @mcp.tool, it calls scrape() and returns content + source.
@mcp.tool async def webcrawl_scrape(url: str, timeout: int = DEFAULT_TIMEOUT) -> dict: """Fetch a URL and extract main content as markdown. Args: url: The URL to scrape timeout: Request timeout in seconds (default: 30) Returns: Dict with: - content: markdown of the page's main content - source: one of "static_http", "static_http_retry", "firecrawl_transport_fallback", "firecrawl_quality_fallback" (see Issue #1) """ result = await scrape(url, timeout) return {"content": result.content, "source": result.source} - src/webcrawl_mcp/server.py:11-11 (registration)Tool registration via FastMCP. The 'mcp' FastMCP instance is used with @mcp.tool decorator to register webcrawl_scrape (and other tools).
mcp = FastMCP("Webcrawl") - src/webcrawl_mcp/scraper.py:29-47 (schema)ProvenanceSource type literal and ScrapeResult dataclass define the output schema returned by webcrawl_scrape.
ProvenanceSource = Literal[ "static_http", "static_http_retry", "firecrawl_transport_fallback", "firecrawl_quality_fallback", ] @dataclass(frozen=True) class ScrapeResult: """Scrape output with provenance. Attributes: content: Extracted markdown content source: How the content was obtained (see ProvenanceSource) """ content: str source: ProvenanceSource - src/webcrawl_mcp/scraper.py:251-298 (helper)The core scrape() function called by the handler. Fetches HTML, extracts content via trafilatura/markdownify, with fallback to Firecrawl and caching.
async def scrape(url: str, timeout: int = DEFAULT_TIMEOUT) -> ScrapeResult: """Fetch URL and extract main content as markdown. Dispatch: - 2xx → local extraction (trafilatura → markdownify); Firecrawl as a quality fallback if the result is below MIN_CONTENT_LENGTH. - {403, 429, 503} → polite retry (429 only) and/or Firecrawl transport fallback, gated on POLITE_MODE and FALLBACK_ON_TRANSPORT_ERROR. See Issue #1 for design rationale. Args: url: The URL to scrape timeout: Request timeout in seconds Returns: ScrapeResult carrying content and provenance source. """ cached = cache.get(url) if cached is not None: return cached kind, payload, source = await _fetch_html_or_fallback(url, timeout) if kind == "firecrawl": result = ScrapeResult(content=payload, source=source) cache.set(url, result) return result content = _extract(payload, url) if _is_low_quality(content) and firecrawl_configured(): print( f"[webcrawl] content still low quality, trying Firecrawl for {url}", file=sys.stderr, ) firecrawl_content = await scrape_with_firecrawl(url, timeout) if firecrawl_content and len(firecrawl_content) > len(content or ""): result = ScrapeResult( content=firecrawl_content, source="firecrawl_quality_fallback", ) cache.set(url, result) return result result = ScrapeResult(content=content, source=source) cache.set(url, result) return result