Skip to main content
Glama

extract_links

Extract and filter links from webpages by domain, specify internal links only, or exclude unwanted domains using this specialized web scraping tool.

Instructions

Extract all links from a webpage.

This tool is specialized for link extraction and can filter links by domain, extract only internal links, or exclude specific domains.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
requestYes

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault

No arguments

Implementation Reference

  • Core handler for the 'extract_links' MCP tool. Validates input URL, scrapes page using WebScraper.simple method, extracts raw links from content.links, applies domain-based filtering (internal_only, filter_domains, exclude_domains), categorizes links as internal/external, returns structured LinksResponse.
    @app.tool()
    async def extract_links(
        url: Annotated[
            str,
            Field(
                ...,
                description="""目标网页 URL,必须包含协议前缀(http://或https://),将从此页面提取所有链接。
                    支持 http 和 https 协议的有效 URL 格式""",
            ),
        ],
        filter_domains: Annotated[
            Optional[List[str]],
            Field(
                default=None,
                description="""白名单域名列表,仅包含这些域名的链接。设置后只返回指定域名的链接。
                    示例:["example.com", "subdomain.example.com", "blog.example.org"]""",
            ),
        ],
        exclude_domains: Annotated[
            Optional[List[str]],
            Field(
                default=None,
                description="""黑名单域名列表,排除这些域名的链接。用于过滤广告、跟踪器等不需要的外部链接。
                    示例:["ads.com", "tracker.net", "analytics.google.com"]""",
            ),
        ],
        internal_only: Annotated[
            bool,
            Field(
                default=False,
                description="是否仅提取内部链接(同域名链接)。设为 True 时只返回与源页面相同域名的链接,忽略所有外部链接",
            ),
        ],
    ) -> LinksResponse:
        """
        Extract all links from a webpage.
    
        This tool is specialized for link extraction and can filter links by domain,
        extract only internal links, or exclude specific domains.
    
        Returns:
            LinksResponse object containing success status, extracted links list, and optional filtering statistics.
            Each link includes url, text, and additional attributes if available.
        """
        try:
            # Validate inputs
            parsed = urlparse(url)
            if not parsed.scheme or not parsed.netloc:
                raise ValueError("Invalid URL format")
    
            logger.info(f"Extracting links from: {url}")
    
            # Scrape the page to get links
            scrape_result = await web_scraper.scrape_url(
                url=url,
                method="simple",  # Use simple method for link extraction
            )
    
            if "error" in scrape_result:
                return LinksResponse(
                    success=False,
                    url=url,
                    total_links=0,
                    links=[],
                    internal_links_count=0,
                    external_links_count=0,
                    error=scrape_result["error"],
                )
    
            # Extract and filter links
            all_links = scrape_result.get("content", {}).get("links", [])
            base_domain = urlparse(url).netloc
    
            filtered_links = []
            for link in all_links:
                link_url = link.get("url", "")
                if not link_url:
                    continue
    
                link_domain = urlparse(link_url).netloc
    
                # Apply filters
                if internal_only and link_domain != base_domain:
                    continue
    
                if filter_domains and link_domain not in filter_domains:
                    continue
    
                if exclude_domains and link_domain in exclude_domains:
                    continue
    
                filtered_links.append(
                    LinkItem(
                        url=link_url,
                        text=link.get("text", "").strip(),
                        is_internal=link_domain == base_domain,
                    )
                )
    
            internal_count = sum(1 for link in filtered_links if link.is_internal)
            external_count = len(filtered_links) - internal_count
    
            return LinksResponse(
                success=True,
                url=url,
                total_links=len(filtered_links),
                links=filtered_links,
                internal_links_count=internal_count,
                external_links_count=external_count,
            )
    
        except Exception as e:
            logger.error(f"Error extracting links from {url}: {str(e)}")
            return LinksResponse(
                success=False,
                url=url,
                total_links=0,
                links=[],
                internal_links_count=0,
                external_links_count=0,
                error=str(e),
            )
  • Pydantic schemas defining the input parameters (via Annotated Fields) and output response structure (LinksResponse with LinkItem list) for the extract_links tool.
    class LinkItem(BaseModel):
        """Individual link item model."""
    
        url: str = Field(..., description="链接URL")
        text: str = Field(..., description="链接文本")
        is_internal: bool = Field(..., description="是否为内部链接")
    
    
    class LinksResponse(BaseModel):
        """Response model for link extraction."""
    
        success: bool = Field(..., description="操作是否成功")
        url: str = Field(..., description="源页面URL")
        total_links: int = Field(..., description="总链接数量")
        links: List[LinkItem] = Field(..., description="提取的链接列表")
        internal_links_count: int = Field(..., description="内部链接数量")
        external_links_count: int = Field(..., description="外部链接数量")
        error: Optional[str] = Field(default=None, description="错误信息(如果有)")
  • Core link extraction helper in SimpleScraper.scrape (used by extract_links via method='simple'). Uses BeautifulSoup to find all <a href=True> tags, resolves relative URLs with urljoin, extracts link text, produces list of dicts with 'url' and 'text' consumed by the main handler.
    result["content"]["links"] = [
        {
            "url": urljoin(url, str(a.get("href", ""))),
            "text": a.get_text(strip=True),
        }
        for a in soup.find_all("a", href=True)
        if hasattr(a, "get")
  • Identical link extraction logic in SeleniumScraper.scrape (alternative scraper method), using BeautifulSoup on rendered page_source.
    # Default extraction
    result["content"]["text"] = soup.get_text(strip=True)
    result["content"]["links"] = [
        {
            "url": urljoin(url, str(a.get("href", ""))),
            "text": a.get_text(strip=True),
        }
        for a in soup.find_all("a", href=True)
        if hasattr(a, "get")
    ]
    result["content"]["images"] = [
        {
            "src": urljoin(url, str(img.get("src", ""))),
            "alt": str(img.get("alt", "")),
        }
        for img in soup.find_all("img", src=True)
        if hasattr(img, "get")
    ]
  • FastMCP tool registration decorator for extract_links function.
    @app.tool()

Tool Definition Quality

Score is being calculated. Check back soon.

Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ThreeFish-AI/scrapy-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server