extract_links
Extract all links from a webpage with href values and anchor text, resolving relative URLs while excluding anchors and javascript links.
Instructions
Extract all links from a page with their href and anchor text. Resolves relative URLs. Skips anchors and javascript: links.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | The URL to extract links from |
Implementation Reference
- src/scraper.ts:61-79 (handler)The main extractLinks handler function that fetches a page, parses it with linkedom, finds all anchor tags with href attributes, resolves relative URLs, and returns an array of PageLink objects with href and text properties.
export async function extractLinks(url: string): Promise<PageLink[]> { const html = await fetchPage(url); const { document } = parseHTML(html); const anchors = Array.from(document.querySelectorAll("a[href]")); const links: PageLink[] = []; const baseUrl = new URL(url); for (const a of anchors) { const href = a.getAttribute("href"); if (!href || href.startsWith("#") || href.startsWith("javascript:")) continue; try { const resolved = new URL(href, baseUrl).href; links.push({ href: resolved, text: (a.textContent ?? "").trim() }); } catch { // skip invalid URLs } } return links; } - src/index.ts:45-62 (registration)Tool registration for 'extract_links' using server.tool(), defining the URL input schema with zod validation and the async handler that formats the output links as markdown.
server.tool( "extract_links", "Extract all links from a page with their href and anchor text. Resolves relative URLs. Skips anchors and javascript: links.", { url: z.string().url().describe("The URL to extract links from"), }, async ({ url }) => { const links = await extractLinks(url); const text = links.length === 0 ? "No links found." : links .map((l, i) => `${i + 1}. [${l.text || l.href}](${l.href})`) .join("\n"); return { content: [{ type: "text", text }] }; }, ); - src/scraper.ts:23-26 (schema)Type definition for PageLink interface that defines the structure of extracted links with href (string) and text (string) properties.
export interface PageLink { href: string; text: string; } - src/scraper.ts:28-40 (helper)The fetchPage helper function used by extractLinks to fetch HTML content from URLs with appropriate User-Agent headers and error handling.
async function fetchPage(url: string): Promise<string> { const response = await fetch(url, { headers: { "User-Agent": "Mozilla/5.0 (compatible; mcp-server-scraper/1.0)", Accept: "text/html,application/xhtml+xml", }, redirect: "follow", }); if (!response.ok) { throw new Error(`HTTP ${response.status}: ${response.statusText}`); } return response.text(); }