get_page_content
Extract web page content including title, description, and main text from any URL for analysis and information retrieval.
Instructions
Fetch and extract content from a web page.
Returns the page title, description, and main content.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | URL to fetch content from |
Implementation Reference
- mcp_duckduckgo/tools.py:114-215 (handler)The main asynchronous handler function for the 'get_page_content' tool. It fetches the web page using httpx, parses it with BeautifulSoup, extracts title, meta description, and main content using multiple selectors, and returns structured data including domain extracted via helper.@mcp_server.tool() async def get_page_content( url: str = Field(..., description="URL to fetch content from"), ctx: Context = Field(default_factory=Context), ) -> Dict[str, Any]: """ Fetch and extract content from a web page. Returns the page title, description, and main content. """ logger.info("Fetching content from: %s", url) try: # Get HTTP client from context http_client = getattr(ctx, "http_client", None) if not http_client: http_client = httpx.AsyncClient(timeout=15.0) close_client = True else: close_client = False try: headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" } response = await http_client.get(url, headers=headers, timeout=15) response.raise_for_status() soup = BeautifulSoup(response.text, "html.parser") # Extract title title = "" title_tag = soup.find("title") if title_tag: title = title_tag.get_text().strip() # Extract description from meta tags description = "" meta_desc = soup.find("meta", attrs={"name": "description"}) if meta_desc: description = meta_desc.get("content", "").strip() # type: ignore[union-attr] # Extract main content (try common content selectors) content_text = "" content_selectors = [ "main article", "article", '[role="main"]', ".content", ".article-content", ".post-content", "#content", "#article", ".entry-content", ] for selector in content_selectors: main_content = soup.select_one(selector) if main_content: content_text = main_content.get_text().strip() break # If no content found, get all paragraphs if not content_text: paragraphs = soup.find_all("p")[:5] # First 5 paragraphs content_text = "\n\n".join(p.get_text().strip() for p in paragraphs) # Clean up content (first 500 chars for preview) content_preview = ( content_text[:500] + "..." if len(content_text) > 500 else content_text ) return { "url": url, "title": title, "description": description, "content": content_text, "content_preview": content_preview, "domain": extract_domain(url), "status": "success", } finally: if close_client: await http_client.aclose() except Exception as e: logger.error("Failed to fetch content from %s: %s", url, e) return { "url": url, "title": "", "description": "", "content": "", "content_preview": f"Error: {str(e)}", "domain": extract_domain(url), "status": "error", "error": str(e), }
- mcp_duckduckgo/server.py:46-47 (registration)Registration of the tool occurs here in create_mcp_server() by calling register_search_tools(server), which defines and registers the get_page_content handler using the @mcp_server.tool() decorator.# Register tools directly with the server instance register_search_tools(server)
- mcp_duckduckgo/search.py:61-77 (helper)Helper utility function used by get_page_content to extract the domain from the URL for the response dictionary.def extract_domain(url: str) -> str: """ Extract domain from URL. Args: url: URL string to extract domain from Returns: Lowercase domain name or empty string if parsing fails """ try: parsed = urllib.parse.urlparse(url) return parsed.netloc.lower() except Exception as e: logger.debug("Failed to extract domain from URL %s: %s", url, e) return ""
- mcp_duckduckgo/tools.py:115-118 (schema)Pydantic-based input schema definition using Field for validation and descriptions, output is Dict[str, Any].async def get_page_content( url: str = Field(..., description="URL to fetch content from"), ctx: Context = Field(default_factory=Context), ) -> Dict[str, Any]: