extract_links

Extract and filter links from webpages by domain, specify internal links only, or exclude unwanted domains using this specialized web scraping tool.

Instructions

Extract all links from a webpage.

This tool is specialized for link extraction and can filter links by domain, extract only internal links, or exclude specific domains.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`request`	Yes

Implementation Reference

extractor/server.py:436-558 (handler)
Core handler for the 'extract_links' MCP tool. Validates input URL, scrapes page using WebScraper.simple method, extracts raw links from content.links, applies domain-based filtering (internal_only, filter_domains, exclude_domains), categorizes links as internal/external, returns structured LinksResponse.
@app.tool() async def extract_links( url: Annotated[ str, Field( ..., description="""目标网页 URL，必须包含协议前缀（http://或https://），将从此页面提取所有链接。支持 http 和 https 协议的有效 URL 格式""", ), ], filter_domains: Annotated[ Optional[List[str]], Field( default=None, description="""白名单域名列表，仅包含这些域名的链接。设置后只返回指定域名的链接。示例：["example.com", "subdomain.example.com", "blog.example.org"]""", ), ], exclude_domains: Annotated[ Optional[List[str]], Field( default=None, description="""黑名单域名列表，排除这些域名的链接。用于过滤广告、跟踪器等不需要的外部链接。示例：["ads.com", "tracker.net", "analytics.google.com"]""", ), ], internal_only: Annotated[ bool, Field( default=False, description="是否仅提取内部链接（同域名链接）。设为 True 时只返回与源页面相同域名的链接，忽略所有外部链接", ), ], ) -> LinksResponse: """ Extract all links from a webpage. This tool is specialized for link extraction and can filter links by domain, extract only internal links, or exclude specific domains. Returns: LinksResponse object containing success status, extracted links list, and optional filtering statistics. Each link includes url, text, and additional attributes if available. """ try: # Validate inputs parsed = urlparse(url) if not parsed.scheme or not parsed.netloc: raise ValueError("Invalid URL format") logger.info(f"Extracting links from: {url}") # Scrape the page to get links scrape_result = await web_scraper.scrape_url( url=url, method="simple", # Use simple method for link extraction ) if "error" in scrape_result: return LinksResponse( success=False, url=url, total_links=0, links=[], internal_links_count=0, external_links_count=0, error=scrape_result["error"], ) # Extract and filter links all_links = scrape_result.get("content", {}).get("links", []) base_domain = urlparse(url).netloc filtered_links = [] for link in all_links: link_url = link.get("url", "") if not link_url: continue link_domain = urlparse(link_url).netloc # Apply filters if internal_only and link_domain != base_domain: continue if filter_domains and link_domain not in filter_domains: continue if exclude_domains and link_domain in exclude_domains: continue filtered_links.append( LinkItem( url=link_url, text=link.get("text", "").strip(), is_internal=link_domain == base_domain, ) ) internal_count = sum(1 for link in filtered_links if link.is_internal) external_count = len(filtered_links) - internal_count return LinksResponse( success=True, url=url, total_links=len(filtered_links), links=filtered_links, internal_links_count=internal_count, external_links_count=external_count, ) except Exception as e: logger.error(f"Error extracting links from {url}: {str(e)}") return LinksResponse( success=False, url=url, total_links=0, links=[], internal_links_count=0, external_links_count=0, error=str(e), )
extractor/server.py:85-102 (schema)
Pydantic schemas defining the input parameters (via Annotated Fields) and output response structure (LinksResponse with LinkItem list) for the extract_links tool.
class LinkItem(BaseModel): """Individual link item model.""" url: str = Field(..., description="链接URL") text: str = Field(..., description="链接文本") is_internal: bool = Field(..., description="是否为内部链接") class LinksResponse(BaseModel): """Response model for link extraction.""" success: bool = Field(..., description="操作是否成功") url: str = Field(..., description="源页面URL") total_links: int = Field(..., description="总链接数量") links: List[LinkItem] = Field(..., description="提取的链接列表") internal_links_count: int = Field(..., description="内部链接数量") external_links_count: int = Field(..., description="外部链接数量") error: Optional[str] = Field(default=None, description="错误信息（如果有）")
extractor/scraper.py:414-420 (helper)
Core link extraction helper in SimpleScraper.scrape (used by extract_links via method='simple'). Uses BeautifulSoup to find all <a href=True> tags, resolves relative URLs with urljoin, extracts link text, produces list of dicts with 'url' and 'text' consumed by the main handler.
result["content"]["links"] = [ { "url": urljoin(url, str(a.get("href", ""))), "text": a.get_text(strip=True), } for a in soup.find_all("a", href=True) if hasattr(a, "get")
extractor/scraper.py:284-301 (helper)
Identical link extraction logic in SeleniumScraper.scrape (alternative scraper method), using BeautifulSoup on rendered page_source.
# Default extraction result["content"]["text"] = soup.get_text(strip=True) result["content"]["links"] = [ { "url": urljoin(url, str(a.get("href", ""))), "text": a.get_text(strip=True), } for a in soup.find_all("a", href=True) if hasattr(a, "get") ] result["content"]["images"] = [ { "src": urljoin(url, str(img.get("src", ""))), "alt": str(img.get("alt", "")), } for img in soup.find_all("img", src=True) if hasattr(img, "get") ]
extractor/server.py:436-436 (registration)
FastMCP tool registration decorator for extract_links function.
@app.tool()

Scrapy MCP Server

extract_links

Instructions

Input Schema

Implementation Reference

Other Tools

Related Tools

Latest Blog Posts

MCP directory API