fetch_content
Extract and parse webpage content from any URL to retrieve readable text for analysis and processing.
Instructions
Fetch and parse content from a webpage URL.
Args:
url: The webpage URL to fetch content from
ctx: MCP context for logging
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes |
Implementation Reference
- Handler function for the 'fetch_content' tool. Registered via @mcp.tool() decorator. Accepts a URL and Context, delegates to WebContentFetcher.fetch_and_parse method to fetch, parse, and clean the webpage content, returning the processed text.@mcp.tool() async def fetch_content(url: str, ctx: Context) -> str: """ Fetch and parse content from a webpage URL. Args: url: The webpage URL to fetch content from ctx: MCP context for logging """ return await fetcher.fetch_and_parse(url, ctx)
- Core implementation logic for fetching webpage content. Performs rate-limited HTTP GET request using httpx, parses HTML with BeautifulSoup, removes unwanted elements, extracts and cleans text content, truncates if necessary, and handles various errors gracefully.async def fetch_and_parse(self, url: str, ctx: Context) -> str: """Fetch and parse content from a webpage""" try: await self.rate_limiter.acquire() await ctx.info(f"Fetching content from: {url}") async with httpx.AsyncClient() as client: response = await client.get( url, headers={ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" }, follow_redirects=True, timeout=30.0, ) response.raise_for_status() # Parse the HTML soup = BeautifulSoup(response.text, "html.parser") # Remove script and style elements for element in soup(["script", "style", "nav", "header", "footer"]): element.decompose() # Get the text content text = soup.get_text() # Clean up the text lines = (line.strip() for line in text.splitlines()) chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) text = " ".join(chunk for chunk in chunks if chunk) # Remove extra whitespace text = re.sub(r"\s+", " ", text).strip() # Truncate if too long if len(text) > 8000: text = text[:8000] + "... [content truncated]" await ctx.info( f"Successfully fetched and parsed content ({len(text)} characters)" ) return text except httpx.TimeoutException: await ctx.error(f"Request timed out for URL: {url}") return "Error: The request timed out while trying to fetch the webpage." except httpx.HTTPError as e: await ctx.error(f"HTTP error occurred while fetching {url}: {str(e)}") return f"Error: Could not access the webpage ({str(e)})" except Exception as e: await ctx.error(f"Error fetching content from {url}: {str(e)}") return f"Error: An unexpected error occurred while fetching the webpage ({str(e)})"
- Class that initializes the WebContentFetcher with a RateLimiter (20 requests per minute), instantiated globally as 'fetcher' for use by the fetch_content tool.class WebContentFetcher: def __init__(self): self.rate_limiter = RateLimiter(requests_per_minute=20)
- src/duckduckgo_mcp_server/server.py:232-232 (registration)The @mcp.tool() decorator registers the fetch_content function as an MCP tool with the name 'fetch_content'.@mcp.tool()