Skip to main content
Glama

extract_links

Extract and filter links from webpages by domain, specify internal links only, or exclude unwanted domains using this specialized web scraping tool.

Instructions

Extract all links from a webpage.

This tool is specialized for link extraction and can filter links by domain, extract only internal links, or exclude specific domains.

Input Schema

NameRequiredDescriptionDefault
requestYes

Input Schema (JSON Schema)

{ "$defs": { "ExtractLinksRequest": { "description": "Request model for extracting links from a page.", "properties": { "exclude_domains": { "anyOf": [ { "items": { "type": "string" }, "type": "array" }, { "type": "null" } ], "default": null, "description": "Exclude links from these domains", "title": "Exclude Domains" }, "filter_domains": { "anyOf": [ { "items": { "type": "string" }, "type": "array" }, { "type": "null" } ], "default": null, "description": "Only include links from these domains", "title": "Filter Domains" }, "internal_only": { "default": false, "description": "Only extract internal links", "title": "Internal Only", "type": "boolean" }, "url": { "description": "URL to extract links from", "title": "Url", "type": "string" } }, "required": [ "url" ], "title": "ExtractLinksRequest", "type": "object" } }, "properties": { "request": { "$ref": "#/$defs/ExtractLinksRequest", "title": "Request" } }, "required": [ "request" ], "type": "object" }

Implementation Reference

  • Core handler for the 'extract_links' MCP tool. Validates input URL, scrapes page using WebScraper.simple method, extracts raw links from content.links, applies domain-based filtering (internal_only, filter_domains, exclude_domains), categorizes links as internal/external, returns structured LinksResponse.
    @app.tool() async def extract_links( url: Annotated[ str, Field( ..., description="""目标网页 URL,必须包含协议前缀(http://或https://),将从此页面提取所有链接。 支持 http 和 https 协议的有效 URL 格式""", ), ], filter_domains: Annotated[ Optional[List[str]], Field( default=None, description="""白名单域名列表,仅包含这些域名的链接。设置后只返回指定域名的链接。 示例:["example.com", "subdomain.example.com", "blog.example.org"]""", ), ], exclude_domains: Annotated[ Optional[List[str]], Field( default=None, description="""黑名单域名列表,排除这些域名的链接。用于过滤广告、跟踪器等不需要的外部链接。 示例:["ads.com", "tracker.net", "analytics.google.com"]""", ), ], internal_only: Annotated[ bool, Field( default=False, description="是否仅提取内部链接(同域名链接)。设为 True 时只返回与源页面相同域名的链接,忽略所有外部链接", ), ], ) -> LinksResponse: """ Extract all links from a webpage. This tool is specialized for link extraction and can filter links by domain, extract only internal links, or exclude specific domains. Returns: LinksResponse object containing success status, extracted links list, and optional filtering statistics. Each link includes url, text, and additional attributes if available. """ try: # Validate inputs parsed = urlparse(url) if not parsed.scheme or not parsed.netloc: raise ValueError("Invalid URL format") logger.info(f"Extracting links from: {url}") # Scrape the page to get links scrape_result = await web_scraper.scrape_url( url=url, method="simple", # Use simple method for link extraction ) if "error" in scrape_result: return LinksResponse( success=False, url=url, total_links=0, links=[], internal_links_count=0, external_links_count=0, error=scrape_result["error"], ) # Extract and filter links all_links = scrape_result.get("content", {}).get("links", []) base_domain = urlparse(url).netloc filtered_links = [] for link in all_links: link_url = link.get("url", "") if not link_url: continue link_domain = urlparse(link_url).netloc # Apply filters if internal_only and link_domain != base_domain: continue if filter_domains and link_domain not in filter_domains: continue if exclude_domains and link_domain in exclude_domains: continue filtered_links.append( LinkItem( url=link_url, text=link.get("text", "").strip(), is_internal=link_domain == base_domain, ) ) internal_count = sum(1 for link in filtered_links if link.is_internal) external_count = len(filtered_links) - internal_count return LinksResponse( success=True, url=url, total_links=len(filtered_links), links=filtered_links, internal_links_count=internal_count, external_links_count=external_count, ) except Exception as e: logger.error(f"Error extracting links from {url}: {str(e)}") return LinksResponse( success=False, url=url, total_links=0, links=[], internal_links_count=0, external_links_count=0, error=str(e), )
  • Pydantic schemas defining the input parameters (via Annotated Fields) and output response structure (LinksResponse with LinkItem list) for the extract_links tool.
    class LinkItem(BaseModel): """Individual link item model.""" url: str = Field(..., description="链接URL") text: str = Field(..., description="链接文本") is_internal: bool = Field(..., description="是否为内部链接") class LinksResponse(BaseModel): """Response model for link extraction.""" success: bool = Field(..., description="操作是否成功") url: str = Field(..., description="源页面URL") total_links: int = Field(..., description="总链接数量") links: List[LinkItem] = Field(..., description="提取的链接列表") internal_links_count: int = Field(..., description="内部链接数量") external_links_count: int = Field(..., description="外部链接数量") error: Optional[str] = Field(default=None, description="错误信息(如果有)")
  • Core link extraction helper in SimpleScraper.scrape (used by extract_links via method='simple'). Uses BeautifulSoup to find all <a href=True> tags, resolves relative URLs with urljoin, extracts link text, produces list of dicts with 'url' and 'text' consumed by the main handler.
    result["content"]["links"] = [ { "url": urljoin(url, str(a.get("href", ""))), "text": a.get_text(strip=True), } for a in soup.find_all("a", href=True) if hasattr(a, "get")
  • Identical link extraction logic in SeleniumScraper.scrape (alternative scraper method), using BeautifulSoup on rendered page_source.
    # Default extraction result["content"]["text"] = soup.get_text(strip=True) result["content"]["links"] = [ { "url": urljoin(url, str(a.get("href", ""))), "text": a.get_text(strip=True), } for a in soup.find_all("a", href=True) if hasattr(a, "get") ] result["content"]["images"] = [ { "src": urljoin(url, str(img.get("src", ""))), "alt": str(img.get("alt", "")), } for img in soup.find_all("img", src=True) if hasattr(img, "get") ]
  • FastMCP tool registration decorator for extract_links function.
    @app.tool()

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ThreeFish-AI/scrapy-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server