Read PDF from URL
read_urlFetch a PDF from a URL and extract its text content with options for column reordering and whitespace compaction.
Instructions
Fetch a PDF from a URL and extract its text content.
Downloads the PDF from the specified URL, then extracts text with Y-coordinate-based reading order. Supports HTTP and HTTPS. Maximum file size: 50MB. Timeout: 30 seconds.
Like read_text, accepts split_columns: 2 | 3 for untagged multi-column PDFs and compact_whitespace: true to collapse U+3000 / ASCII whitespace runs. Tagged PDFs should use extract_tables instead.
Args:
url (string): URL pointing to a PDF file (HTTP or HTTPS)
pages (string, optional): Page range to extract. Format: "1-5", "3", or "1,3,5-7". Omit for all pages.
response_format ('markdown' | 'json'): Output format (default: 'markdown')
split_columns (1 | 2 | 3, optional): Column-aware reordering. Default 1 = existing Y-sort.
compact_whitespace (boolean, optional): Collapse whitespace runs (incl. U+3000) to one ASCII space. Default false.
Returns: Extracted text organized by page number, same format as read_text.
Examples:
Read remote PDF: { url: "https://example.com/document.pdf" }
Untagged 2-column PDF: { url: "https://...", split_columns: 2 }
Japanese form: { url: "https://...", compact_whitespace: true }
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | URL pointing to a PDF file (HTTP or HTTPS) | |
| pages | No | Page range to process. Format: "1-5", "3", or "1,3,5-7". Omit for all pages. | |
| response_format | No | Output format: "markdown" for human-readable, "json" for structured data | markdown |
| split_columns | No | Number of columns to use when reordering text. 1 (default) = existing Y-sort. 2 or 3 = bucket by X-coordinate left-to-right. Use for untagged 新旧対照表 / two-column PDFs where Y-sort would interleave columns. Tagged PDFs with proper <Table> markup should use extract_tables instead. | |
| compact_whitespace | No | When true, collapse runs of whitespace (incl. fullwidth space U+3000) to a single ASCII space and trim each line. Reduces token consumption on Japanese form-style PDFs. Default: false (no whitespace normalization). |
Implementation Reference
- src/tools/tier1/read-url.ts:13-80 (handler)The main handler function 'registerReadUrl' that registers the 'read_url' tool. It fetches a PDF from a URL via fetchPdfFromUrl, loads it with pdfjs-service, extracts text with column/whitespace options, formats output as markdown or JSON, truncates if needed, and returns the result.
export function registerReadUrl(server: McpServer): void { server.registerTool( 'read_url', { title: 'Read PDF from URL', description: `Fetch a PDF from a URL and extract its text content. Downloads the PDF from the specified URL, then extracts text with Y-coordinate-based reading order. Supports HTTP and HTTPS. Maximum file size: 50MB. Timeout: 30 seconds. Like \`read_text\`, accepts \`split_columns: 2 | 3\` for **untagged** multi-column PDFs and \`compact_whitespace: true\` to collapse U+3000 / ASCII whitespace runs. Tagged PDFs should use \`extract_tables\` instead. Args: - url (string): URL pointing to a PDF file (HTTP or HTTPS) - pages (string, optional): Page range to extract. Format: "1-5", "3", or "1,3,5-7". Omit for all pages. - response_format ('markdown' | 'json'): Output format (default: 'markdown') - split_columns (1 | 2 | 3, optional): Column-aware reordering. Default 1 = existing Y-sort. - compact_whitespace (boolean, optional): Collapse whitespace runs (incl. U+3000) to one ASCII space. Default false. Returns: Extracted text organized by page number, same format as read_text. Examples: - Read remote PDF: { url: "https://example.com/document.pdf" } - Untagged 2-column PDF: { url: "https://...", split_columns: 2 } - Japanese form: { url: "https://...", compact_whitespace: true }`, inputSchema: ReadUrlSchema, annotations: { readOnlyHint: true, destructiveHint: false, idempotentHint: false, openWorldHint: true, }, }, async (params: ReadUrlInput) => { try { const data = await fetchPdfFromUrl(params.url); const doc = await loadDocumentFromData(data); try { const results = await extractTextFromDoc(doc, params.pages, { splitColumns: params.split_columns, compactWhitespace: params.compact_whitespace, }); let text: string; if (params.response_format === ResponseFormat.JSON) { text = JSON.stringify(results, null, 2); } else { text = formatPageTextsMarkdown(results); } const { text: finalText } = truncateIfNeeded(text); return { content: [{ type: 'text' as const, text: finalText }], }; } finally { await doc.destroy(); } } catch (error) { const err = handleStructuredError(error); return { content: [{ type: 'text' as const, text: JSON.stringify(err, null, 2) }], isError: true, }; } }, ); } - src/schemas/tier1.ts:114-123 (schema)Zod schema 'ReadUrlSchema' defining the input validation for read_url: url (valid URL), pages (page range), response_format (markdown/json), split_columns (1-3), compact_whitespace (boolean).
/** read_url */ export const ReadUrlSchema = z .object({ url: UrlSchema, pages: PagesSchema, response_format: ResponseFormatSchema, split_columns: SplitColumnsSchema, compact_whitespace: CompactWhitespaceSchema, }) .strict(); - src/tools/index.ts:12-37 (registration)Import and registration call of 'registerReadUrl' in the central tool registration entry point (registerAllTools).
import { registerReadUrl } from './tier1/read-url.js'; import { registerSearchText } from './tier1/search-text.js'; import { registerSummarize } from './tier1/summarize.js'; // Tier 2: Structure analysis import { registerExtractTables } from './tier2/extract-tables.js'; import { registerInspectAnnotations } from './tier2/inspect-annotations.js'; import { registerInspectFonts } from './tier2/inspect-fonts.js'; import { registerInspectSignatures } from './tier2/inspect-signatures.js'; import { registerInspectStructure } from './tier2/inspect-structure.js'; import { registerInspectTags } from './tier2/inspect-tags.js'; // Tier 3: Validation & analysis import { registerCompareStructure } from './tier3/compare-structure.js'; import { registerValidateMetadata } from './tier3/validate-metadata.js'; import { registerValidateTagged } from './tier3/validate-tagged.js'; /** * Register all tools with the MCP server. */ export function registerAllTools(server: McpServer): void { // Tier 1: Basic PDF operations registerGetPageCount(server); registerGetMetadata(server); registerReadText(server); registerSearchText(server); registerReadImages(server); registerReadUrl(server); - src/services/url-fetcher.ts:23-86 (helper)The 'fetchPdfFromUrl' helper that actually downloads the PDF from a URL with timeout (30s), size checks (50MB), protocol validation (HTTP/HTTPS), and concurrency limiting.
export async function fetchPdfFromUrl(url: string): Promise<Uint8Array> { return fetchLimit(() => fetchPdfFromUrlUnlimited(url)); } /** * Limit を経由しない素の fetch 実装。 * テスト用、または limit 経由がすでに保証されている呼び出し側用。 */ export async function fetchPdfFromUrlUnlimited(url: string): Promise<Uint8Array> { let parsedUrl: URL; try { parsedUrl = new URL(url); } catch { throw new PdfReaderError( `Invalid URL: ${url}`, 'INVALID_URL', 'Provide a valid HTTP or HTTPS URL pointing to a PDF file.', ); } if (!['http:', 'https:'].includes(parsedUrl.protocol)) { throw new PdfReaderError( `Unsupported protocol: ${parsedUrl.protocol}`, 'UNSUPPORTED_PROTOCOL', 'Only HTTP and HTTPS URLs are supported.', ); } const response = await fetch(url, { headers: { Accept: 'application/pdf', 'User-Agent': `${SERVER_NAME}/${SERVER_VERSION}`, }, signal: AbortSignal.timeout(30_000), }); if (!response.ok) { throw new PdfReaderError( `Failed to fetch PDF: HTTP ${response.status} ${response.statusText}`, 'FETCH_FAILED', 'Check the URL is correct and accessible.', ); } const contentLength = response.headers.get('content-length'); if (contentLength && parseInt(contentLength, 10) > MAX_FILE_SIZE) { throw new PdfReaderError( `Remote PDF size (${(parseInt(contentLength, 10) / 1024 / 1024).toFixed(1)}MB) exceeds the 50MB limit.`, 'FILE_TOO_LARGE', 'Try a smaller PDF file.', ); } const buffer = await response.arrayBuffer(); if (buffer.byteLength > MAX_FILE_SIZE) { throw new PdfReaderError( `Downloaded PDF size (${(buffer.byteLength / 1024 / 1024).toFixed(1)}MB) exceeds the 50MB limit.`, 'FILE_TOO_LARGE', 'Try a smaller PDF file.', ); } return new Uint8Array(buffer); } - src/utils/concurrency.ts:26-57 (helper)Concurrency limiter 'createLimit' used by url-fetcher to cap parallel read_url fetches (default 4), preventing remote host overload.
export function createLimit(concurrency: number): LimitFn { if (!Number.isFinite(concurrency) || concurrency < 1) { throw new Error(`createLimit: concurrency must be >= 1, got ${concurrency}`); } let active = 0; const queue: Array<() => void> = []; const next = () => { if (active >= concurrency) return; const run = queue.shift(); if (run) { active++; run(); } }; return <T>(task: () => Promise<T>): Promise<T> => { return new Promise<T>((resolve, reject) => { const exec = () => { task() .then(resolve, reject) .finally(() => { active--; next(); }); }; queue.push(exec); next(); }); }; }