Skip to main content
Glama
gianlucamazza

MCP DuckDuckGo Search Plugin

get_page_content

Extract web page content including title, description, and main text from any URL for analysis and information retrieval.

Instructions

Fetch and extract content from a web page.

Returns the page title, description, and main content.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesURL to fetch content from

Implementation Reference

  • The main asynchronous handler function for the 'get_page_content' tool. It fetches the web page using httpx, parses it with BeautifulSoup, extracts title, meta description, and main content using multiple selectors, and returns structured data including domain extracted via helper.
    @mcp_server.tool()
    async def get_page_content(
        url: str = Field(..., description="URL to fetch content from"),
        ctx: Context = Field(default_factory=Context),
    ) -> Dict[str, Any]:
        """
        Fetch and extract content from a web page.
    
        Returns the page title, description, and main content.
        """
        logger.info("Fetching content from: %s", url)
    
        try:
            # Get HTTP client from context
            http_client = getattr(ctx, "http_client", None)
            if not http_client:
                http_client = httpx.AsyncClient(timeout=15.0)
                close_client = True
            else:
                close_client = False
    
            try:
                headers = {
                    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
                }
    
                response = await http_client.get(url, headers=headers, timeout=15)
                response.raise_for_status()
    
                soup = BeautifulSoup(response.text, "html.parser")
    
                # Extract title
                title = ""
                title_tag = soup.find("title")
                if title_tag:
                    title = title_tag.get_text().strip()
    
                # Extract description from meta tags
                description = ""
                meta_desc = soup.find("meta", attrs={"name": "description"})
                if meta_desc:
                    description = meta_desc.get("content", "").strip()  # type: ignore[union-attr]
    
                # Extract main content (try common content selectors)
                content_text = ""
                content_selectors = [
                    "main article",
                    "article",
                    '[role="main"]',
                    ".content",
                    ".article-content",
                    ".post-content",
                    "#content",
                    "#article",
                    ".entry-content",
                ]
    
                for selector in content_selectors:
                    main_content = soup.select_one(selector)
                    if main_content:
                        content_text = main_content.get_text().strip()
                        break
    
                # If no content found, get all paragraphs
                if not content_text:
                    paragraphs = soup.find_all("p")[:5]  # First 5 paragraphs
                    content_text = "\n\n".join(p.get_text().strip() for p in paragraphs)
    
                # Clean up content (first 500 chars for preview)
                content_preview = (
                    content_text[:500] + "..."
                    if len(content_text) > 500
                    else content_text
                )
    
                return {
                    "url": url,
                    "title": title,
                    "description": description,
                    "content": content_text,
                    "content_preview": content_preview,
                    "domain": extract_domain(url),
                    "status": "success",
                }
    
            finally:
                if close_client:
                    await http_client.aclose()
    
        except Exception as e:
            logger.error("Failed to fetch content from %s: %s", url, e)
            return {
                "url": url,
                "title": "",
                "description": "",
                "content": "",
                "content_preview": f"Error: {str(e)}",
                "domain": extract_domain(url),
                "status": "error",
                "error": str(e),
            }
  • Registration of the tool occurs here in create_mcp_server() by calling register_search_tools(server), which defines and registers the get_page_content handler using the @mcp_server.tool() decorator.
    # Register tools directly with the server instance
    register_search_tools(server)
  • Helper utility function used by get_page_content to extract the domain from the URL for the response dictionary.
    def extract_domain(url: str) -> str:
        """
        Extract domain from URL.
    
        Args:
            url: URL string to extract domain from
    
        Returns:
            Lowercase domain name or empty string if parsing fails
        """
        try:
            parsed = urllib.parse.urlparse(url)
            return parsed.netloc.lower()
        except Exception as e:
            logger.debug("Failed to extract domain from URL %s: %s", url, e)
            return ""
  • Pydantic-based input schema definition using Field for validation and descriptions, output is Dict[str, Any].
    async def get_page_content(
        url: str = Field(..., description="URL to fetch content from"),
        ctx: Context = Field(default_factory=Context),
    ) -> Dict[str, Any]:

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/gianlucamazza/mcp-duckduckgo'

If you have feedback or need assistance with the MCP directory API, please join our Discord server