Skip to main content
Glama

extract_links

Extract and filter links from webpages by domain, specify internal links only, or exclude unwanted domains using this specialized web scraping tool.

Instructions

Extract all links from a webpage.

This tool is specialized for link extraction and can filter links by domain, extract only internal links, or exclude specific domains.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
requestYes

Implementation Reference

  • Core handler for the 'extract_links' MCP tool. Validates input URL, scrapes page using WebScraper.simple method, extracts raw links from content.links, applies domain-based filtering (internal_only, filter_domains, exclude_domains), categorizes links as internal/external, returns structured LinksResponse.
    @app.tool()
    async def extract_links(
        url: Annotated[
            str,
            Field(
                ...,
                description="""目标网页 URL,必须包含协议前缀(http://或https://),将从此页面提取所有链接。
                    支持 http 和 https 协议的有效 URL 格式""",
            ),
        ],
        filter_domains: Annotated[
            Optional[List[str]],
            Field(
                default=None,
                description="""白名单域名列表,仅包含这些域名的链接。设置后只返回指定域名的链接。
                    示例:["example.com", "subdomain.example.com", "blog.example.org"]""",
            ),
        ],
        exclude_domains: Annotated[
            Optional[List[str]],
            Field(
                default=None,
                description="""黑名单域名列表,排除这些域名的链接。用于过滤广告、跟踪器等不需要的外部链接。
                    示例:["ads.com", "tracker.net", "analytics.google.com"]""",
            ),
        ],
        internal_only: Annotated[
            bool,
            Field(
                default=False,
                description="是否仅提取内部链接(同域名链接)。设为 True 时只返回与源页面相同域名的链接,忽略所有外部链接",
            ),
        ],
    ) -> LinksResponse:
        """
        Extract all links from a webpage.
    
        This tool is specialized for link extraction and can filter links by domain,
        extract only internal links, or exclude specific domains.
    
        Returns:
            LinksResponse object containing success status, extracted links list, and optional filtering statistics.
            Each link includes url, text, and additional attributes if available.
        """
        try:
            # Validate inputs
            parsed = urlparse(url)
            if not parsed.scheme or not parsed.netloc:
                raise ValueError("Invalid URL format")
    
            logger.info(f"Extracting links from: {url}")
    
            # Scrape the page to get links
            scrape_result = await web_scraper.scrape_url(
                url=url,
                method="simple",  # Use simple method for link extraction
            )
    
            if "error" in scrape_result:
                return LinksResponse(
                    success=False,
                    url=url,
                    total_links=0,
                    links=[],
                    internal_links_count=0,
                    external_links_count=0,
                    error=scrape_result["error"],
                )
    
            # Extract and filter links
            all_links = scrape_result.get("content", {}).get("links", [])
            base_domain = urlparse(url).netloc
    
            filtered_links = []
            for link in all_links:
                link_url = link.get("url", "")
                if not link_url:
                    continue
    
                link_domain = urlparse(link_url).netloc
    
                # Apply filters
                if internal_only and link_domain != base_domain:
                    continue
    
                if filter_domains and link_domain not in filter_domains:
                    continue
    
                if exclude_domains and link_domain in exclude_domains:
                    continue
    
                filtered_links.append(
                    LinkItem(
                        url=link_url,
                        text=link.get("text", "").strip(),
                        is_internal=link_domain == base_domain,
                    )
                )
    
            internal_count = sum(1 for link in filtered_links if link.is_internal)
            external_count = len(filtered_links) - internal_count
    
            return LinksResponse(
                success=True,
                url=url,
                total_links=len(filtered_links),
                links=filtered_links,
                internal_links_count=internal_count,
                external_links_count=external_count,
            )
    
        except Exception as e:
            logger.error(f"Error extracting links from {url}: {str(e)}")
            return LinksResponse(
                success=False,
                url=url,
                total_links=0,
                links=[],
                internal_links_count=0,
                external_links_count=0,
                error=str(e),
            )
  • Pydantic schemas defining the input parameters (via Annotated Fields) and output response structure (LinksResponse with LinkItem list) for the extract_links tool.
    class LinkItem(BaseModel):
        """Individual link item model."""
    
        url: str = Field(..., description="链接URL")
        text: str = Field(..., description="链接文本")
        is_internal: bool = Field(..., description="是否为内部链接")
    
    
    class LinksResponse(BaseModel):
        """Response model for link extraction."""
    
        success: bool = Field(..., description="操作是否成功")
        url: str = Field(..., description="源页面URL")
        total_links: int = Field(..., description="总链接数量")
        links: List[LinkItem] = Field(..., description="提取的链接列表")
        internal_links_count: int = Field(..., description="内部链接数量")
        external_links_count: int = Field(..., description="外部链接数量")
        error: Optional[str] = Field(default=None, description="错误信息(如果有)")
  • Core link extraction helper in SimpleScraper.scrape (used by extract_links via method='simple'). Uses BeautifulSoup to find all <a href=True> tags, resolves relative URLs with urljoin, extracts link text, produces list of dicts with 'url' and 'text' consumed by the main handler.
    result["content"]["links"] = [
        {
            "url": urljoin(url, str(a.get("href", ""))),
            "text": a.get_text(strip=True),
        }
        for a in soup.find_all("a", href=True)
        if hasattr(a, "get")
  • Identical link extraction logic in SeleniumScraper.scrape (alternative scraper method), using BeautifulSoup on rendered page_source.
    # Default extraction
    result["content"]["text"] = soup.get_text(strip=True)
    result["content"]["links"] = [
        {
            "url": urljoin(url, str(a.get("href", ""))),
            "text": a.get_text(strip=True),
        }
        for a in soup.find_all("a", href=True)
        if hasattr(a, "get")
    ]
    result["content"]["images"] = [
        {
            "src": urljoin(url, str(img.get("src", ""))),
            "alt": str(img.get("alt", "")),
        }
        for img in soup.find_all("img", src=True)
        if hasattr(img, "get")
    ]
  • FastMCP tool registration decorator for extract_links function.
    @app.tool()
Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ThreeFish-AI/scrapy-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server