extract_links
Extract and filter links from webpages by domain, specify internal links only, or exclude unwanted domains using this specialized web scraping tool.
Instructions
Extract all links from a webpage.
This tool is specialized for link extraction and can filter links by domain, extract only internal links, or exclude specific domains.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| request | Yes |
Input Schema (JSON Schema)
{
"$defs": {
"ExtractLinksRequest": {
"description": "Request model for extracting links from a page.",
"properties": {
"exclude_domains": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Exclude links from these domains",
"title": "Exclude Domains"
},
"filter_domains": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Only include links from these domains",
"title": "Filter Domains"
},
"internal_only": {
"default": false,
"description": "Only extract internal links",
"title": "Internal Only",
"type": "boolean"
},
"url": {
"description": "URL to extract links from",
"title": "Url",
"type": "string"
}
},
"required": [
"url"
],
"title": "ExtractLinksRequest",
"type": "object"
}
},
"properties": {
"request": {
"$ref": "#/$defs/ExtractLinksRequest",
"title": "Request"
}
},
"required": [
"request"
],
"type": "object"
}
Implementation Reference
- extractor/server.py:436-558 (handler)Core handler for the 'extract_links' MCP tool. Validates input URL, scrapes page using WebScraper.simple method, extracts raw links from content.links, applies domain-based filtering (internal_only, filter_domains, exclude_domains), categorizes links as internal/external, returns structured LinksResponse.@app.tool() async def extract_links( url: Annotated[ str, Field( ..., description="""目标网页 URL,必须包含协议前缀(http://或https://),将从此页面提取所有链接。 支持 http 和 https 协议的有效 URL 格式""", ), ], filter_domains: Annotated[ Optional[List[str]], Field( default=None, description="""白名单域名列表,仅包含这些域名的链接。设置后只返回指定域名的链接。 示例:["example.com", "subdomain.example.com", "blog.example.org"]""", ), ], exclude_domains: Annotated[ Optional[List[str]], Field( default=None, description="""黑名单域名列表,排除这些域名的链接。用于过滤广告、跟踪器等不需要的外部链接。 示例:["ads.com", "tracker.net", "analytics.google.com"]""", ), ], internal_only: Annotated[ bool, Field( default=False, description="是否仅提取内部链接(同域名链接)。设为 True 时只返回与源页面相同域名的链接,忽略所有外部链接", ), ], ) -> LinksResponse: """ Extract all links from a webpage. This tool is specialized for link extraction and can filter links by domain, extract only internal links, or exclude specific domains. Returns: LinksResponse object containing success status, extracted links list, and optional filtering statistics. Each link includes url, text, and additional attributes if available. """ try: # Validate inputs parsed = urlparse(url) if not parsed.scheme or not parsed.netloc: raise ValueError("Invalid URL format") logger.info(f"Extracting links from: {url}") # Scrape the page to get links scrape_result = await web_scraper.scrape_url( url=url, method="simple", # Use simple method for link extraction ) if "error" in scrape_result: return LinksResponse( success=False, url=url, total_links=0, links=[], internal_links_count=0, external_links_count=0, error=scrape_result["error"], ) # Extract and filter links all_links = scrape_result.get("content", {}).get("links", []) base_domain = urlparse(url).netloc filtered_links = [] for link in all_links: link_url = link.get("url", "") if not link_url: continue link_domain = urlparse(link_url).netloc # Apply filters if internal_only and link_domain != base_domain: continue if filter_domains and link_domain not in filter_domains: continue if exclude_domains and link_domain in exclude_domains: continue filtered_links.append( LinkItem( url=link_url, text=link.get("text", "").strip(), is_internal=link_domain == base_domain, ) ) internal_count = sum(1 for link in filtered_links if link.is_internal) external_count = len(filtered_links) - internal_count return LinksResponse( success=True, url=url, total_links=len(filtered_links), links=filtered_links, internal_links_count=internal_count, external_links_count=external_count, ) except Exception as e: logger.error(f"Error extracting links from {url}: {str(e)}") return LinksResponse( success=False, url=url, total_links=0, links=[], internal_links_count=0, external_links_count=0, error=str(e), )
- extractor/server.py:85-102 (schema)Pydantic schemas defining the input parameters (via Annotated Fields) and output response structure (LinksResponse with LinkItem list) for the extract_links tool.class LinkItem(BaseModel): """Individual link item model.""" url: str = Field(..., description="链接URL") text: str = Field(..., description="链接文本") is_internal: bool = Field(..., description="是否为内部链接") class LinksResponse(BaseModel): """Response model for link extraction.""" success: bool = Field(..., description="操作是否成功") url: str = Field(..., description="源页面URL") total_links: int = Field(..., description="总链接数量") links: List[LinkItem] = Field(..., description="提取的链接列表") internal_links_count: int = Field(..., description="内部链接数量") external_links_count: int = Field(..., description="外部链接数量") error: Optional[str] = Field(default=None, description="错误信息(如果有)")
- extractor/scraper.py:414-420 (helper)Core link extraction helper in SimpleScraper.scrape (used by extract_links via method='simple'). Uses BeautifulSoup to find all <a href=True> tags, resolves relative URLs with urljoin, extracts link text, produces list of dicts with 'url' and 'text' consumed by the main handler.result["content"]["links"] = [ { "url": urljoin(url, str(a.get("href", ""))), "text": a.get_text(strip=True), } for a in soup.find_all("a", href=True) if hasattr(a, "get")
- extractor/scraper.py:284-301 (helper)Identical link extraction logic in SeleniumScraper.scrape (alternative scraper method), using BeautifulSoup on rendered page_source.# Default extraction result["content"]["text"] = soup.get_text(strip=True) result["content"]["links"] = [ { "url": urljoin(url, str(a.get("href", ""))), "text": a.get_text(strip=True), } for a in soup.find_all("a", href=True) if hasattr(a, "get") ] result["content"]["images"] = [ { "src": urljoin(url, str(img.get("src", ""))), "alt": str(img.get("alt", "")), } for img in soup.find_all("img", src=True) if hasattr(img, "get") ]
- extractor/server.py:436-436 (registration)FastMCP tool registration decorator for extract_links function.@app.tool()