Skip to main content
Glama
ofershap

mcp-server-scraper

by ofershap

extract_links

Extract all links from a webpage with href values and anchor text, resolving relative URLs while excluding anchors and javascript links.

Instructions

Extract all links from a page with their href and anchor text. Resolves relative URLs. Skips anchors and javascript: links.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe URL to extract links from

Implementation Reference

  • The main extractLinks handler function that fetches a page, parses it with linkedom, finds all anchor tags with href attributes, resolves relative URLs, and returns an array of PageLink objects with href and text properties.
    export async function extractLinks(url: string): Promise<PageLink[]> {
      const html = await fetchPage(url);
      const { document } = parseHTML(html);
      const anchors = Array.from(document.querySelectorAll("a[href]"));
      const links: PageLink[] = [];
      const baseUrl = new URL(url);
      for (const a of anchors) {
        const href = a.getAttribute("href");
        if (!href || href.startsWith("#") || href.startsWith("javascript:"))
          continue;
        try {
          const resolved = new URL(href, baseUrl).href;
          links.push({ href: resolved, text: (a.textContent ?? "").trim() });
        } catch {
          // skip invalid URLs
        }
      }
      return links;
    }
  • src/index.ts:45-62 (registration)
    Tool registration for 'extract_links' using server.tool(), defining the URL input schema with zod validation and the async handler that formats the output links as markdown.
    server.tool(
      "extract_links",
      "Extract all links from a page with their href and anchor text. Resolves relative URLs. Skips anchors and javascript: links.",
      {
        url: z.string().url().describe("The URL to extract links from"),
      },
      async ({ url }) => {
        const links = await extractLinks(url);
        const text =
          links.length === 0
            ? "No links found."
            : links
                .map((l, i) => `${i + 1}. [${l.text || l.href}](${l.href})`)
                .join("\n");
    
        return { content: [{ type: "text", text }] };
      },
    );
  • Type definition for PageLink interface that defines the structure of extracted links with href (string) and text (string) properties.
    export interface PageLink {
      href: string;
      text: string;
    }
  • The fetchPage helper function used by extractLinks to fetch HTML content from URLs with appropriate User-Agent headers and error handling.
    async function fetchPage(url: string): Promise<string> {
      const response = await fetch(url, {
        headers: {
          "User-Agent": "Mozilla/5.0 (compatible; mcp-server-scraper/1.0)",
          Accept: "text/html,application/xhtml+xml",
        },
        redirect: "follow",
      });
      if (!response.ok) {
        throw new Error(`HTTP ${response.status}: ${response.statusText}`);
      }
      return response.text();
    }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ofershap/mcp-server-scraper'

If you have feedback or need assistance with the MCP directory API, please join our Discord server