Skip to main content
Glama

clip_url

Fetch an external URL, extract the main article via Readability, convert to markdown, and save it as a new clipping in the knowledge base.

Instructions

SIDE-EFFECTFUL — fetches an EXTERNAL URL and writes a NEW KB file. Downloads the page, extracts the main article via Readability, converts to markdown, and saves it as a new clipping. NOT idempotent / no de-dup — re-clipping the same URL creates a second file. AUTH: anonymous by default; pass headers (e.g. {Cookie: 'session=...'} or {Authorization: 'Bearer ...'}) to clip behind logins. CtxNest does not rate-limit but the upstream may throttle. On auth-required pages returns isError with code: AUTH_REQUIRED, optional login_url, and a hint to retry with headers. Returns {file_id, path, title, source}. Use to ingest external docs into the KB.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe URL to clip
titleNoOptional title override (defaults to the page's <title>)
headersNoOptional HTTP headers to forward with the fetch (e.g. {"Cookie": "session=..."} or {"Authorization": "Bearer ..."}). Use to clip pages behind auth walls after AUTH_REQUIRED.

Implementation Reference

  • MCP tool registration for 'clip_url' - the server.tool() call that defines the MCP tool interface (zod schema: url, title, headers), its description, and the async handler that calls clipUrl() and handles ClipError exceptions (including AUTH_REQUIRED).
      "clip_url",
      "SIDE-EFFECTFUL — fetches an EXTERNAL URL and writes a NEW KB file. Downloads the page, extracts the main article via Readability, converts to markdown, and saves it as a new clipping. NOT idempotent / no de-dup — re-clipping the same URL creates a second file. AUTH: anonymous by default; pass `headers` (e.g. `{Cookie: 'session=...'}` or `{Authorization: 'Bearer ...'}`) to clip behind logins. CtxNest does not rate-limit but the upstream may throttle. On auth-required pages returns isError with `code: AUTH_REQUIRED`, optional `login_url`, and a hint to retry with `headers`. Returns `{file_id, path, title, source}`. Use to ingest external docs into the KB.",
      {
        url: z.string().url().describe("The URL to clip"),
        title: z.string().optional().describe("Optional title override (defaults to the page's <title>)"),
        headers: z
          .record(z.string())
          .optional()
          .describe("Optional HTTP headers to forward with the fetch (e.g. {\"Cookie\": \"session=...\"} or {\"Authorization\": \"Bearer ...\"}). Use to clip pages behind auth walls after AUTH_REQUIRED."),
      },
      async ({ url, title, headers }) => {
        try {
          const file = await clipUrl({ url, title, dataDir, headers });
          return {
            content: [
              {
                type: "text",
                text: JSON.stringify(
                  {
                    file_id: file.id,
                    path: file.path,
                    title: file.title,
                    source: file.source_path,
                  },
                  null,
                  2
                ),
              },
            ],
          };
        } catch (e) {
          const code = e instanceof ClipError ? e.code : "INTERNAL_ERROR";
          const message = (e as Error).message ?? String(e);
          const payload: Record<string, unknown> = { code, message };
          if (e instanceof ClipError && code === "AUTH_REQUIRED") {
            payload.auth_required = true;
            if (e.details.loginUrl) payload.login_url = e.details.loginUrl;
            if (e.details.signal) payload.signal = e.details.signal;
            if (e.details.wwwAuthenticate) payload.www_authenticate = e.details.wwwAuthenticate;
            payload.hint = "Page requires authentication. Retry with the optional `headers` param (e.g. {\"Cookie\": \"...\"} or {\"Authorization\": \"Bearer ...\"}).";
          }
          return {
            isError: true,
            content: [{ type: "text", text: JSON.stringify(payload, null, 2) }],
          };
        }
      }
    );
  • Core clipUrl() function implementation - fetches HTML, extracts article content via Readability, converts to markdown, builds frontmatter, checks for existing file by source_path, creates or updates the file in the 'urlclips' folder.
    export async function clipUrl(opts: ClipUrlOptions): Promise<FileRecordWithContent> {
      const { url, title: titleOverride, dataDir, headers } = opts;
    
      const { html, finalUrl } = await fetchHtml(url, { headers });
      const { title: extractedTitle, markdown } = await extractFromHtml(html, finalUrl);
      const title = (titleOverride ?? extractedTitle).trim() || "Untitled";
    
      const content = buildFrontmatter({
        source: finalUrl,
        title,
        clippedAt: new Date().toISOString(),
        body: markdown,
      });
    
      const db = getDatabase();
      const existing = db
        .prepare("SELECT id FROM files WHERE source_path = ? LIMIT 1")
        .get(finalUrl) as { id: number } | undefined;
    
      if (existing) {
        const current = readFile(existing.id);
        if (hashBody(current.content) === hashBody(content)) {
          return current;
        }
        return updateFile(existing.id, content, dataDir);
      }
    
      return createFile({
        title,
        content,
        destination: "knowledge",
        folder: URLCLIPS_FOLDER,
        sourcePath: finalUrl,
        dataDir,
      });
    }
  • ClipUrlOptions interface - defines the input schema: url (string), title (optional override), dataDir (string), headers (optional record of string headers).
    export interface ClipUrlOptions {
      url: string;
      title?: string;
      dataDir: string;
      /** Optional headers (cookies, auth) forwarded to the fetch. */
      headers?: Record<string, string>;
    }
  • Extraction utilities - fetchHtml() fetches the URL with SSRF protection (DNS resolution checks, private IP blocking, redirect validation), auth detection (401/403, login page detection), and extractFromHtml() uses Mozilla Readability + Turndown to convert HTML to markdown.
    import { Readability } from "@mozilla/readability";
    import TurndownService from "turndown";
    import { lookup } from "node:dns/promises";
    import { isIP } from "node:net";
    
    export type ClipErrorCode =
      | "INVALID_URL"
      | "FETCH_FAILED"
      | "UNSUPPORTED_CONTENT_TYPE"
      | "EXTRACTION_FAILED"
      | "AUTH_REQUIRED";
    
    export interface ClipErrorDetails {
      loginUrl?: string;
      wwwAuthenticate?: string;
      signal?: AuthSignal;
    }
    
    export type AuthSignal =
      | "http_401"
      | "http_403"
      | "redirect_to_login"
      | "login_page_body";
    
    export class ClipError extends Error {
      public readonly details: ClipErrorDetails;
      constructor(
        public code: ClipErrorCode,
        message: string,
        details: ClipErrorDetails = {}
      ) {
        super(message);
        this.name = "ClipError";
        this.details = details;
      }
    }
    
    const MIN_BODY_CHARS = 200;
    const FETCH_TIMEOUT_MS = 10_000;
    const MAX_RESPONSE_BYTES = 10 * 1024 * 1024; // 10 MiB
    const MAX_REDIRECTS = 5;
    
    function isPrivateIPv4(ip: string): boolean {
      const parts = ip.split(".").map(Number);
      if (parts.length !== 4 || parts.some((p) => Number.isNaN(p) || p < 0 || p > 255)) return false;
      const [a, b] = parts;
      return (
        a === 0 ||                                  // "this" network
        a === 10 ||                                 // RFC1918
        a === 127 ||                                // loopback
        (a === 169 && b === 254) ||                 // link-local (incl. cloud metadata)
        (a === 172 && b >= 16 && b <= 31) ||        // RFC1918
        (a === 192 && b === 168) ||                 // RFC1918
        (a === 100 && b >= 64 && b <= 127) ||       // CGNAT
        a >= 224                                    // multicast / reserved
      );
    }
    
    function isPrivateIPv6(ip: string): boolean {
      const lower = ip.toLowerCase().split("%")[0]; // strip zone id
      if (lower === "::1" || lower === "::") return true;
      if (lower.startsWith("fe80:") || lower.startsWith("fc") || lower.startsWith("fd")) return true;
      // IPv4-mapped IPv6: ::ffff:a.b.c.d
      const m = lower.match(/^::ffff:([0-9.]+)$/);
      if (m) return isPrivateIPv4(m[1]);
      return false;
    }
    
    async function assertPublicHost(parsed: URL): Promise<void> {
      const host = parsed.hostname.replace(/^\[|\]$/g, ""); // strip IPv6 brackets
      // Block obvious local/special TLDs and bare names
      if (
        host === "localhost" ||
        host.endsWith(".localhost") ||
        host.endsWith(".local") ||
        host.endsWith(".internal") ||
        host.endsWith(".lan")
      ) {
        throw new ClipError("INVALID_URL", `Refusing to clip non-public host: ${host}`);
      }
      const kind = isIP(host);
      if (kind === 4 && isPrivateIPv4(host)) {
        throw new ClipError("INVALID_URL", `Refusing to clip private/loopback IPv4: ${host}`);
      }
      if (kind === 6 && isPrivateIPv6(host)) {
        throw new ClipError("INVALID_URL", `Refusing to clip private/loopback IPv6: ${host}`);
      }
      if (kind !== 0) return;
      let resolved;
      try {
        resolved = await lookup(host, { all: true });
      } catch (e) {
        throw new ClipError("FETCH_FAILED", `DNS lookup failed for ${host}: ${(e as Error).message}`);
      }
      for (const r of resolved) {
        if (r.family === 4 && isPrivateIPv4(r.address)) {
          throw new ClipError("INVALID_URL", `Host ${host} resolves to private IPv4 ${r.address}`);
        }
        if (r.family === 6 && isPrivateIPv6(r.address)) {
          throw new ClipError("INVALID_URL", `Host ${host} resolves to private IPv6 ${r.address}`);
        }
      }
    }
    
    async function readBodyWithCap(res: Response): Promise<string> {
      const cl = res.headers.get("content-length");
      if (cl && Number.isFinite(Number(cl)) && Number(cl) > MAX_RESPONSE_BYTES) {
        throw new ClipError("FETCH_FAILED", `Response too large: ${cl} bytes (max ${MAX_RESPONSE_BYTES})`);
      }
      const reader = res.body?.getReader();
      if (!reader) return await res.text();
      const chunks: Uint8Array[] = [];
      let total = 0;
      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        if (!value) continue;
        total += value.byteLength;
        if (total > MAX_RESPONSE_BYTES) {
          try { await reader.cancel(); } catch {}
          throw new ClipError("FETCH_FAILED", `Response exceeded ${MAX_RESPONSE_BYTES} byte cap`);
        }
        chunks.push(value);
      }
      return Buffer.concat(chunks.map((c) => Buffer.from(c.buffer, c.byteOffset, c.byteLength))).toString("utf-8");
    }
    
    const turndown = new TurndownService({
      headingStyle: "atx",
      codeBlockStyle: "fenced",
    });
    
    export interface ExtractResult {
      title: string;
      markdown: string;
    }
    
    const LOGIN_URL_PATTERNS = [
      /\/login(\b|\/|\?|$)/i,
      /\/signin(\b|\/|\?|$)/i,
      /\/sign-in(\b|\/|\?|$)/i,
      /\/sso\//i,
      /\/oauth\//i,
      /\/auth\//i,
      /\/saml\//i,
      /\/account\/login/i,
      /[?&](redirect_to|return_to|next|destination|os_destination)=/i,
    ];
    
    function looksLikeLoginUrl(url: string): boolean {
      return LOGIN_URL_PATTERNS.some((re) => re.test(url));
    }
    
    function looksLikeLoginPage(html: string): boolean {
      // Cheap pre-check before parsing — most real articles don't have these tokens at all.
      const cheap = /<input[^>]+type=["']?password["']?/i.test(html)
        || /name=["']os_username["']/i.test(html)              // Atlassian / Confluence
        || /name=["']j_username["']/i.test(html)               // Spring Security
        || /<form[^>]+action=["'][^"']*\/(login|signin|j_security_check)/i.test(html)
        || /window\.location[^;]+\/login/i.test(html);
      return cheap;
    }
    
    export async function extractFromHtml(html: string, url: string): Promise<ExtractResult> {
      const { JSDOM } = await import("jsdom");
      const dom = new JSDOM(html, { url });
      const reader = new Readability(dom.window.document);
      const article = reader.parse();
    
      const bodyLen = (article?.textContent ?? "").trim().length;
      if (!article || !article.content || bodyLen < MIN_BODY_CHARS) {
        if (looksLikeLoginPage(html)) {
          throw new ClipError(
            "AUTH_REQUIRED",
            `Page at ${url} appears to require authentication (login form detected)`,
            { loginUrl: url, signal: "login_page_body" }
          );
        }
        throw new ClipError(
          "EXTRACTION_FAILED",
          `Could not extract enough article content from ${url}`
        );
      }
    
      const markdown = turndown.turndown(article.content);
      const title = (article.title || dom.window.document.title || "").trim() || "Untitled";
    
      return { title, markdown };
    }
    
    export interface FetchOptions {
      headers?: Record<string, string>;
    }
    
    export interface FetchResult {
      html: string;
      finalUrl: string;
    }
    
    export async function fetchHtml(url: string, opts: FetchOptions = {}): Promise<FetchResult> {
      const headers: Record<string, string> = {
        "User-Agent": "ctxnest-clipper/1.0 (+https://ctxnest.dev)",
        ...(opts.headers ?? {}),
      };
    
      // Manual redirect handling so we can re-validate the host on every hop —
      // a public URL that 302s to http://169.254.169.254/ must NOT be followed.
      let currentUrl = url;
      let res: Response | null = null;
      const signal = AbortSignal.timeout(FETCH_TIMEOUT_MS);
      for (let hop = 0; hop <= MAX_REDIRECTS; hop++) {
        let parsed: URL;
        try {
          parsed = new URL(currentUrl);
        } catch {
          throw new ClipError("INVALID_URL", `Not a valid URL: ${currentUrl}`);
        }
        if (parsed.protocol !== "http:" && parsed.protocol !== "https:") {
          throw new ClipError("INVALID_URL", `Unsupported protocol: ${parsed.protocol}`);
        }
        await assertPublicHost(parsed);
    
        try {
          res = await fetch(currentUrl, { redirect: "manual", signal, headers });
        } catch (e) {
          throw new ClipError("FETCH_FAILED", `Network error fetching ${currentUrl}: ${(e as Error).message}`);
        }
        if (res.status >= 300 && res.status < 400 && res.status !== 304) {
          const loc = res.headers.get("location");
          if (!loc) break;
          try { await res.body?.cancel(); } catch {}
          currentUrl = new URL(loc, currentUrl).toString();
          continue;
        }
        break;
      }
      if (!res) {
        throw new ClipError("FETCH_FAILED", `No response for ${url}`);
      }
      if (res.status >= 300 && res.status < 400) {
        throw new ClipError("FETCH_FAILED", `Too many redirects (>${MAX_REDIRECTS}) starting from ${url}`);
      }
    
      const finalUrl = res.url || currentUrl;
    
      if (res.status === 401) {
        throw new ClipError(
          "AUTH_REQUIRED",
          `HTTP 401 Unauthorized fetching ${url}`,
          {
            loginUrl: finalUrl,
            wwwAuthenticate: res.headers.get("www-authenticate") ?? undefined,
            signal: "http_401",
          }
        );
      }
      if (res.status === 403) {
        throw new ClipError(
          "AUTH_REQUIRED",
          `HTTP 403 Forbidden fetching ${url} (likely auth required)`,
          { loginUrl: finalUrl, signal: "http_403" }
        );
      }
    
      if (!res.ok) {
        throw new ClipError("FETCH_FAILED", `HTTP ${res.status} fetching ${url}`);
      }
    
      // Redirect-to-login: original wasn't login-shaped, but we landed on a login page.
      if (finalUrl !== url && looksLikeLoginUrl(finalUrl) && !looksLikeLoginUrl(url)) {
        throw new ClipError(
          "AUTH_REQUIRED",
          `Request for ${url} was redirected to login page ${finalUrl}`,
          { loginUrl: finalUrl, signal: "redirect_to_login" }
        );
      }
    
      const contentType = (res.headers.get("content-type") ?? "").toLowerCase();
      if (!contentType.startsWith("text/html") && !contentType.startsWith("application/xhtml+xml")) {
        throw new ClipError("UNSUPPORTED_CONTENT_TYPE", `Expected text/html, got ${contentType || "(none)"} from ${url}`);
      }
    
      const html = await readBodyWithCap(res);
      return { html, finalUrl };
    }
  • Frontmatter builder - buildFrontmatter() creates YAML frontmatter (source, title, clipped_at) around the markdown body; hashBody() computes content hash for dedup detection.
    import { computeHash } from "../files/index.js";
    
    export interface FrontmatterFields {
      source: string;
      title: string;
      clippedAt: string;
      body: string;
    }
    
    function yamlScalar(value: string): string {
      // Quote if contains characters that break a plain YAML scalar on one line.
      // A colon followed by a space/newline or other special context breaks YAML.
      // But colons in URLs (https://) are fine.
      if (/: /.test(value) || /:\n/.test(value) || /[#\n"']/.test(value) || value.trim() !== value) {
        return `"${value.replace(/\\/g, "\\\\").replace(/"/g, '\\"')}"`;
      }
      return value;
    }
    
    export function buildFrontmatter(f: FrontmatterFields): string {
      return [
        "---",
        `source: ${yamlScalar(f.source)}`,
        `title: ${yamlScalar(f.title)}`,
        `clipped_at: ${yamlScalar(f.clippedAt)}`,
        "---",
        "",
        f.body,
      ].join("\n");
    }
    
    export function splitFrontmatter(file: string): string {
      if (!file.startsWith("---\n")) return file;
      const end = file.indexOf("\n---\n", 4);
      if (end === -1) return file;
      return file.slice(end + 5).replace(/^\n+/, "");
    }
    
    export function hashBody(file: string): string {
      return computeHash(splitFrontmatter(file));
    }
  • Re-export of clipUrl, ClipError, ClipErrorCode, and ClipUrlOptions from the core package's public API.
    export { clipUrl, ClipError, type ClipErrorCode, type ClipUrlOptions } from "./clip/index.js";
  • Categorization of clip_url tool under 'Write' category for UI display.
    clip_url: "Write",
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description fully discloses behavioral traits: side-effectfulness, non-idempotency, auth requirements, rate limiting, and specific error codes (AUTH_REQUIRED). Since no annotations are provided, the description carries the full burden and meets it thoroughly.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single paragraph but well-structured and front-loaded with key warnings. Every sentence adds value. Could be slightly more concise, but overall efficient.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the complexity of the tool (external fetch, auth, error handling) and the absence of an output schema, the description covers all necessary aspects: process, return shape, errors. Complete and self-contained.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Input schema coverage is 100%, so baseline is 3. The description adds significant value by explaining the 'headers' parameter with examples and its use for auth, and notes the 'title' parameter as an override. This goes beyond the schema descriptions.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the verb ('fetches', 'writes') and resource ('external URL', 'NEW KB file'), and specifies the process (download, readability, markdown conversion). It distinguishes from siblings by noting it ingests external docs, unlike file creation tools.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly states when to use: for ingesting external docs into the KB. Provides clear usage guidance including non-idempotency warning and auth handling. Does not explicitly list alternatives but the context is clear.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/safiyu/ctxnest'

If you have feedback or need assistance with the MCP directory API, please join our Discord server