clip_url
Fetch an external URL, extract the main article via Readability, convert to markdown, and save it as a new clipping in the knowledge base.
Instructions
SIDE-EFFECTFUL — fetches an EXTERNAL URL and writes a NEW KB file. Downloads the page, extracts the main article via Readability, converts to markdown, and saves it as a new clipping. NOT idempotent / no de-dup — re-clipping the same URL creates a second file. AUTH: anonymous by default; pass headers (e.g. {Cookie: 'session=...'} or {Authorization: 'Bearer ...'}) to clip behind logins. CtxNest does not rate-limit but the upstream may throttle. On auth-required pages returns isError with code: AUTH_REQUIRED, optional login_url, and a hint to retry with headers. Returns {file_id, path, title, source}. Use to ingest external docs into the KB.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | The URL to clip | |
| title | No | Optional title override (defaults to the page's <title>) | |
| headers | No | Optional HTTP headers to forward with the fetch (e.g. {"Cookie": "session=..."} or {"Authorization": "Bearer ..."}). Use to clip pages behind auth walls after AUTH_REQUIRED. |
Implementation Reference
- apps/mcp/src/index.ts:816-863 (handler)MCP tool registration for 'clip_url' - the server.tool() call that defines the MCP tool interface (zod schema: url, title, headers), its description, and the async handler that calls clipUrl() and handles ClipError exceptions (including AUTH_REQUIRED).
"clip_url", "SIDE-EFFECTFUL — fetches an EXTERNAL URL and writes a NEW KB file. Downloads the page, extracts the main article via Readability, converts to markdown, and saves it as a new clipping. NOT idempotent / no de-dup — re-clipping the same URL creates a second file. AUTH: anonymous by default; pass `headers` (e.g. `{Cookie: 'session=...'}` or `{Authorization: 'Bearer ...'}`) to clip behind logins. CtxNest does not rate-limit but the upstream may throttle. On auth-required pages returns isError with `code: AUTH_REQUIRED`, optional `login_url`, and a hint to retry with `headers`. Returns `{file_id, path, title, source}`. Use to ingest external docs into the KB.", { url: z.string().url().describe("The URL to clip"), title: z.string().optional().describe("Optional title override (defaults to the page's <title>)"), headers: z .record(z.string()) .optional() .describe("Optional HTTP headers to forward with the fetch (e.g. {\"Cookie\": \"session=...\"} or {\"Authorization\": \"Bearer ...\"}). Use to clip pages behind auth walls after AUTH_REQUIRED."), }, async ({ url, title, headers }) => { try { const file = await clipUrl({ url, title, dataDir, headers }); return { content: [ { type: "text", text: JSON.stringify( { file_id: file.id, path: file.path, title: file.title, source: file.source_path, }, null, 2 ), }, ], }; } catch (e) { const code = e instanceof ClipError ? e.code : "INTERNAL_ERROR"; const message = (e as Error).message ?? String(e); const payload: Record<string, unknown> = { code, message }; if (e instanceof ClipError && code === "AUTH_REQUIRED") { payload.auth_required = true; if (e.details.loginUrl) payload.login_url = e.details.loginUrl; if (e.details.signal) payload.signal = e.details.signal; if (e.details.wwwAuthenticate) payload.www_authenticate = e.details.wwwAuthenticate; payload.hint = "Page requires authentication. Retry with the optional `headers` param (e.g. {\"Cookie\": \"...\"} or {\"Authorization\": \"Bearer ...\"})."; } return { isError: true, content: [{ type: "text", text: JSON.stringify(payload, null, 2) }], }; } } ); - packages/core/src/clip/index.ts:19-54 (handler)Core clipUrl() function implementation - fetches HTML, extracts article content via Readability, converts to markdown, builds frontmatter, checks for existing file by source_path, creates or updates the file in the 'urlclips' folder.
export async function clipUrl(opts: ClipUrlOptions): Promise<FileRecordWithContent> { const { url, title: titleOverride, dataDir, headers } = opts; const { html, finalUrl } = await fetchHtml(url, { headers }); const { title: extractedTitle, markdown } = await extractFromHtml(html, finalUrl); const title = (titleOverride ?? extractedTitle).trim() || "Untitled"; const content = buildFrontmatter({ source: finalUrl, title, clippedAt: new Date().toISOString(), body: markdown, }); const db = getDatabase(); const existing = db .prepare("SELECT id FROM files WHERE source_path = ? LIMIT 1") .get(finalUrl) as { id: number } | undefined; if (existing) { const current = readFile(existing.id); if (hashBody(current.content) === hashBody(content)) { return current; } return updateFile(existing.id, content, dataDir); } return createFile({ title, content, destination: "knowledge", folder: URLCLIPS_FOLDER, sourcePath: finalUrl, dataDir, }); } - packages/core/src/clip/index.ts:9-15 (schema)ClipUrlOptions interface - defines the input schema: url (string), title (optional override), dataDir (string), headers (optional record of string headers).
export interface ClipUrlOptions { url: string; title?: string; dataDir: string; /** Optional headers (cookies, auth) forwarded to the fetch. */ headers?: Record<string, string>; } - Extraction utilities - fetchHtml() fetches the URL with SSRF protection (DNS resolution checks, private IP blocking, redirect validation), auth detection (401/403, login page detection), and extractFromHtml() uses Mozilla Readability + Turndown to convert HTML to markdown.
import { Readability } from "@mozilla/readability"; import TurndownService from "turndown"; import { lookup } from "node:dns/promises"; import { isIP } from "node:net"; export type ClipErrorCode = | "INVALID_URL" | "FETCH_FAILED" | "UNSUPPORTED_CONTENT_TYPE" | "EXTRACTION_FAILED" | "AUTH_REQUIRED"; export interface ClipErrorDetails { loginUrl?: string; wwwAuthenticate?: string; signal?: AuthSignal; } export type AuthSignal = | "http_401" | "http_403" | "redirect_to_login" | "login_page_body"; export class ClipError extends Error { public readonly details: ClipErrorDetails; constructor( public code: ClipErrorCode, message: string, details: ClipErrorDetails = {} ) { super(message); this.name = "ClipError"; this.details = details; } } const MIN_BODY_CHARS = 200; const FETCH_TIMEOUT_MS = 10_000; const MAX_RESPONSE_BYTES = 10 * 1024 * 1024; // 10 MiB const MAX_REDIRECTS = 5; function isPrivateIPv4(ip: string): boolean { const parts = ip.split(".").map(Number); if (parts.length !== 4 || parts.some((p) => Number.isNaN(p) || p < 0 || p > 255)) return false; const [a, b] = parts; return ( a === 0 || // "this" network a === 10 || // RFC1918 a === 127 || // loopback (a === 169 && b === 254) || // link-local (incl. cloud metadata) (a === 172 && b >= 16 && b <= 31) || // RFC1918 (a === 192 && b === 168) || // RFC1918 (a === 100 && b >= 64 && b <= 127) || // CGNAT a >= 224 // multicast / reserved ); } function isPrivateIPv6(ip: string): boolean { const lower = ip.toLowerCase().split("%")[0]; // strip zone id if (lower === "::1" || lower === "::") return true; if (lower.startsWith("fe80:") || lower.startsWith("fc") || lower.startsWith("fd")) return true; // IPv4-mapped IPv6: ::ffff:a.b.c.d const m = lower.match(/^::ffff:([0-9.]+)$/); if (m) return isPrivateIPv4(m[1]); return false; } async function assertPublicHost(parsed: URL): Promise<void> { const host = parsed.hostname.replace(/^\[|\]$/g, ""); // strip IPv6 brackets // Block obvious local/special TLDs and bare names if ( host === "localhost" || host.endsWith(".localhost") || host.endsWith(".local") || host.endsWith(".internal") || host.endsWith(".lan") ) { throw new ClipError("INVALID_URL", `Refusing to clip non-public host: ${host}`); } const kind = isIP(host); if (kind === 4 && isPrivateIPv4(host)) { throw new ClipError("INVALID_URL", `Refusing to clip private/loopback IPv4: ${host}`); } if (kind === 6 && isPrivateIPv6(host)) { throw new ClipError("INVALID_URL", `Refusing to clip private/loopback IPv6: ${host}`); } if (kind !== 0) return; let resolved; try { resolved = await lookup(host, { all: true }); } catch (e) { throw new ClipError("FETCH_FAILED", `DNS lookup failed for ${host}: ${(e as Error).message}`); } for (const r of resolved) { if (r.family === 4 && isPrivateIPv4(r.address)) { throw new ClipError("INVALID_URL", `Host ${host} resolves to private IPv4 ${r.address}`); } if (r.family === 6 && isPrivateIPv6(r.address)) { throw new ClipError("INVALID_URL", `Host ${host} resolves to private IPv6 ${r.address}`); } } } async function readBodyWithCap(res: Response): Promise<string> { const cl = res.headers.get("content-length"); if (cl && Number.isFinite(Number(cl)) && Number(cl) > MAX_RESPONSE_BYTES) { throw new ClipError("FETCH_FAILED", `Response too large: ${cl} bytes (max ${MAX_RESPONSE_BYTES})`); } const reader = res.body?.getReader(); if (!reader) return await res.text(); const chunks: Uint8Array[] = []; let total = 0; while (true) { const { done, value } = await reader.read(); if (done) break; if (!value) continue; total += value.byteLength; if (total > MAX_RESPONSE_BYTES) { try { await reader.cancel(); } catch {} throw new ClipError("FETCH_FAILED", `Response exceeded ${MAX_RESPONSE_BYTES} byte cap`); } chunks.push(value); } return Buffer.concat(chunks.map((c) => Buffer.from(c.buffer, c.byteOffset, c.byteLength))).toString("utf-8"); } const turndown = new TurndownService({ headingStyle: "atx", codeBlockStyle: "fenced", }); export interface ExtractResult { title: string; markdown: string; } const LOGIN_URL_PATTERNS = [ /\/login(\b|\/|\?|$)/i, /\/signin(\b|\/|\?|$)/i, /\/sign-in(\b|\/|\?|$)/i, /\/sso\//i, /\/oauth\//i, /\/auth\//i, /\/saml\//i, /\/account\/login/i, /[?&](redirect_to|return_to|next|destination|os_destination)=/i, ]; function looksLikeLoginUrl(url: string): boolean { return LOGIN_URL_PATTERNS.some((re) => re.test(url)); } function looksLikeLoginPage(html: string): boolean { // Cheap pre-check before parsing — most real articles don't have these tokens at all. const cheap = /<input[^>]+type=["']?password["']?/i.test(html) || /name=["']os_username["']/i.test(html) // Atlassian / Confluence || /name=["']j_username["']/i.test(html) // Spring Security || /<form[^>]+action=["'][^"']*\/(login|signin|j_security_check)/i.test(html) || /window\.location[^;]+\/login/i.test(html); return cheap; } export async function extractFromHtml(html: string, url: string): Promise<ExtractResult> { const { JSDOM } = await import("jsdom"); const dom = new JSDOM(html, { url }); const reader = new Readability(dom.window.document); const article = reader.parse(); const bodyLen = (article?.textContent ?? "").trim().length; if (!article || !article.content || bodyLen < MIN_BODY_CHARS) { if (looksLikeLoginPage(html)) { throw new ClipError( "AUTH_REQUIRED", `Page at ${url} appears to require authentication (login form detected)`, { loginUrl: url, signal: "login_page_body" } ); } throw new ClipError( "EXTRACTION_FAILED", `Could not extract enough article content from ${url}` ); } const markdown = turndown.turndown(article.content); const title = (article.title || dom.window.document.title || "").trim() || "Untitled"; return { title, markdown }; } export interface FetchOptions { headers?: Record<string, string>; } export interface FetchResult { html: string; finalUrl: string; } export async function fetchHtml(url: string, opts: FetchOptions = {}): Promise<FetchResult> { const headers: Record<string, string> = { "User-Agent": "ctxnest-clipper/1.0 (+https://ctxnest.dev)", ...(opts.headers ?? {}), }; // Manual redirect handling so we can re-validate the host on every hop — // a public URL that 302s to http://169.254.169.254/ must NOT be followed. let currentUrl = url; let res: Response | null = null; const signal = AbortSignal.timeout(FETCH_TIMEOUT_MS); for (let hop = 0; hop <= MAX_REDIRECTS; hop++) { let parsed: URL; try { parsed = new URL(currentUrl); } catch { throw new ClipError("INVALID_URL", `Not a valid URL: ${currentUrl}`); } if (parsed.protocol !== "http:" && parsed.protocol !== "https:") { throw new ClipError("INVALID_URL", `Unsupported protocol: ${parsed.protocol}`); } await assertPublicHost(parsed); try { res = await fetch(currentUrl, { redirect: "manual", signal, headers }); } catch (e) { throw new ClipError("FETCH_FAILED", `Network error fetching ${currentUrl}: ${(e as Error).message}`); } if (res.status >= 300 && res.status < 400 && res.status !== 304) { const loc = res.headers.get("location"); if (!loc) break; try { await res.body?.cancel(); } catch {} currentUrl = new URL(loc, currentUrl).toString(); continue; } break; } if (!res) { throw new ClipError("FETCH_FAILED", `No response for ${url}`); } if (res.status >= 300 && res.status < 400) { throw new ClipError("FETCH_FAILED", `Too many redirects (>${MAX_REDIRECTS}) starting from ${url}`); } const finalUrl = res.url || currentUrl; if (res.status === 401) { throw new ClipError( "AUTH_REQUIRED", `HTTP 401 Unauthorized fetching ${url}`, { loginUrl: finalUrl, wwwAuthenticate: res.headers.get("www-authenticate") ?? undefined, signal: "http_401", } ); } if (res.status === 403) { throw new ClipError( "AUTH_REQUIRED", `HTTP 403 Forbidden fetching ${url} (likely auth required)`, { loginUrl: finalUrl, signal: "http_403" } ); } if (!res.ok) { throw new ClipError("FETCH_FAILED", `HTTP ${res.status} fetching ${url}`); } // Redirect-to-login: original wasn't login-shaped, but we landed on a login page. if (finalUrl !== url && looksLikeLoginUrl(finalUrl) && !looksLikeLoginUrl(url)) { throw new ClipError( "AUTH_REQUIRED", `Request for ${url} was redirected to login page ${finalUrl}`, { loginUrl: finalUrl, signal: "redirect_to_login" } ); } const contentType = (res.headers.get("content-type") ?? "").toLowerCase(); if (!contentType.startsWith("text/html") && !contentType.startsWith("application/xhtml+xml")) { throw new ClipError("UNSUPPORTED_CONTENT_TYPE", `Expected text/html, got ${contentType || "(none)"} from ${url}`); } const html = await readBodyWithCap(res); return { html, finalUrl }; } - Frontmatter builder - buildFrontmatter() creates YAML frontmatter (source, title, clipped_at) around the markdown body; hashBody() computes content hash for dedup detection.
import { computeHash } from "../files/index.js"; export interface FrontmatterFields { source: string; title: string; clippedAt: string; body: string; } function yamlScalar(value: string): string { // Quote if contains characters that break a plain YAML scalar on one line. // A colon followed by a space/newline or other special context breaks YAML. // But colons in URLs (https://) are fine. if (/: /.test(value) || /:\n/.test(value) || /[#\n"']/.test(value) || value.trim() !== value) { return `"${value.replace(/\\/g, "\\\\").replace(/"/g, '\\"')}"`; } return value; } export function buildFrontmatter(f: FrontmatterFields): string { return [ "---", `source: ${yamlScalar(f.source)}`, `title: ${yamlScalar(f.title)}`, `clipped_at: ${yamlScalar(f.clippedAt)}`, "---", "", f.body, ].join("\n"); } export function splitFrontmatter(file: string): string { if (!file.startsWith("---\n")) return file; const end = file.indexOf("\n---\n", 4); if (end === -1) return file; return file.slice(end + 5).replace(/^\n+/, ""); } export function hashBody(file: string): string { return computeHash(splitFrontmatter(file)); } - packages/core/src/index.ts:17-17 (registration)Re-export of clipUrl, ClipError, ClipErrorCode, and ClipUrlOptions from the core package's public API.
export { clipUrl, ClipError, type ClipErrorCode, type ClipUrlOptions } from "./clip/index.js"; - apps/web/src/lib/mcp-tool-categories.ts:47-47 (registration)Categorization of clip_url tool under 'Write' category for UI display.
clip_url: "Write",