Read PDF from URL

read_url

Read-only

Fetch a PDF from a URL and extract its text content with options for column reordering and whitespace compaction.

Instructions

Fetch a PDF from a URL and extract its text content.

Downloads the PDF from the specified URL, then extracts text with Y-coordinate-based reading order. Supports HTTP and HTTPS. Maximum file size: 50MB. Timeout: 30 seconds.

Like read_text, accepts split_columns: 2 | 3 for untagged multi-column PDFs and compact_whitespace: true to collapse U+3000 / ASCII whitespace runs. Tagged PDFs should use extract_tables instead.

Args:

url (string): URL pointing to a PDF file (HTTP or HTTPS)
pages (string, optional): Page range to extract. Format: "1-5", "3", or "1,3,5-7". Omit for all pages.
response_format ('markdown' | 'json'): Output format (default: 'markdown')
split_columns (1 | 2 | 3, optional): Column-aware reordering. Default 1 = existing Y-sort.
compact_whitespace (boolean, optional): Collapse whitespace runs (incl. U+3000) to one ASCII space. Default false.

Returns: Extracted text organized by page number, same format as read_text.

Examples:

Read remote PDF: { url: "https://example.com/document.pdf" }
Untagged 2-column PDF: { url: "https://...", split_columns: 2 }
Japanese form: { url: "https://...", compact_whitespace: true }

Input Schema

TableJSON Schema

Name	Required	Description	Default
`url`	Yes	URL pointing to a PDF file (HTTP or HTTPS)
`pages`	No	Page range to process. Format: "1-5", "3", or "1,3,5-7". Omit for all pages.
`response_format`	No	Output format: "markdown" for human-readable, "json" for structured data	markdown
`split_columns`	No	Number of columns to use when reordering text. 1 (default) = existing Y-sort. 2 or 3 = bucket by X-coordinate left-to-right. Use for untagged 新旧対照表 / two-column PDFs where Y-sort would interleave columns. Tagged PDFs with proper <Table> markup should use extract_tables instead.
`compact_whitespace`	No	When true, collapse runs of whitespace (incl. fullwidth space U+3000) to a single ASCII space and trim each line. Reduces token consumption on Japanese form-style PDFs. Default: false (no whitespace normalization).

Implementation Reference

src/tools/tier1/read-url.ts:13-80 (handler)

The main handler function 'registerReadUrl' that registers the 'read_url' tool. It fetches a PDF from a URL via fetchPdfFromUrl, loads it with pdfjs-service, extracts text with column/whitespace options, formats output as markdown or JSON, truncates if needed, and returns the result.

export function registerReadUrl(server: McpServer): void {
  server.registerTool(
    'read_url',
    {
      title: 'Read PDF from URL',
      description: `Fetch a PDF from a URL and extract its text content.

Downloads the PDF from the specified URL, then extracts text with Y-coordinate-based reading order. Supports HTTP and HTTPS. Maximum file size: 50MB. Timeout: 30 seconds.

Like \`read_text\`, accepts \`split_columns: 2 | 3\` for **untagged** multi-column PDFs and \`compact_whitespace: true\` to collapse U+3000 / ASCII whitespace runs. Tagged PDFs should use \`extract_tables\` instead.

Args:
  - url (string): URL pointing to a PDF file (HTTP or HTTPS)
  - pages (string, optional): Page range to extract. Format: "1-5", "3", or "1,3,5-7". Omit for all pages.
  - response_format ('markdown' | 'json'): Output format (default: 'markdown')
  - split_columns (1 | 2 | 3, optional): Column-aware reordering. Default 1 = existing Y-sort.
  - compact_whitespace (boolean, optional): Collapse whitespace runs (incl. U+3000) to one ASCII space. Default false.

Returns:
  Extracted text organized by page number, same format as read_text.

Examples:
  - Read remote PDF: { url: "https://example.com/document.pdf" }
  - Untagged 2-column PDF: { url: "https://...", split_columns: 2 }
  - Japanese form: { url: "https://...", compact_whitespace: true }`,
      inputSchema: ReadUrlSchema,
      annotations: {
        readOnlyHint: true,
        destructiveHint: false,
        idempotentHint: false,
        openWorldHint: true,
      },
    },
    async (params: ReadUrlInput) => {
      try {
        const data = await fetchPdfFromUrl(params.url);
        const doc = await loadDocumentFromData(data);

        try {
          const results = await extractTextFromDoc(doc, params.pages, {
            splitColumns: params.split_columns,
            compactWhitespace: params.compact_whitespace,
          });

          let text: string;
          if (params.response_format === ResponseFormat.JSON) {
            text = JSON.stringify(results, null, 2);
          } else {
            text = formatPageTextsMarkdown(results);
          }

          const { text: finalText } = truncateIfNeeded(text);
          return {
            content: [{ type: 'text' as const, text: finalText }],
          };
        } finally {
          await doc.destroy();
        }
      } catch (error) {
        const err = handleStructuredError(error);
        return {
          content: [{ type: 'text' as const, text: JSON.stringify(err, null, 2) }],
          isError: true,
        };
      }
    },
  );
}

src/schemas/tier1.ts:114-123 (schema)

Zod schema 'ReadUrlSchema' defining the input validation for read_url: url (valid URL), pages (page range), response_format (markdown/json), split_columns (1-3), compact_whitespace (boolean).

/** read_url */
export const ReadUrlSchema = z
  .object({
    url: UrlSchema,
    pages: PagesSchema,
    response_format: ResponseFormatSchema,
    split_columns: SplitColumnsSchema,
    compact_whitespace: CompactWhitespaceSchema,
  })
  .strict();

src/tools/index.ts:12-37 (registration)

Import and registration call of 'registerReadUrl' in the central tool registration entry point (registerAllTools).

import { registerReadUrl } from './tier1/read-url.js';
import { registerSearchText } from './tier1/search-text.js';
import { registerSummarize } from './tier1/summarize.js';
// Tier 2: Structure analysis
import { registerExtractTables } from './tier2/extract-tables.js';
import { registerInspectAnnotations } from './tier2/inspect-annotations.js';
import { registerInspectFonts } from './tier2/inspect-fonts.js';
import { registerInspectSignatures } from './tier2/inspect-signatures.js';
import { registerInspectStructure } from './tier2/inspect-structure.js';
import { registerInspectTags } from './tier2/inspect-tags.js';
// Tier 3: Validation & analysis
import { registerCompareStructure } from './tier3/compare-structure.js';
import { registerValidateMetadata } from './tier3/validate-metadata.js';
import { registerValidateTagged } from './tier3/validate-tagged.js';

/**
 * Register all tools with the MCP server.
 */
export function registerAllTools(server: McpServer): void {
  // Tier 1: Basic PDF operations
  registerGetPageCount(server);
  registerGetMetadata(server);
  registerReadText(server);
  registerSearchText(server);
  registerReadImages(server);
  registerReadUrl(server);

src/services/url-fetcher.ts:23-86 (helper)

The 'fetchPdfFromUrl' helper that actually downloads the PDF from a URL with timeout (30s), size checks (50MB), protocol validation (HTTP/HTTPS), and concurrency limiting.

export async function fetchPdfFromUrl(url: string): Promise<Uint8Array> {
  return fetchLimit(() => fetchPdfFromUrlUnlimited(url));
}

/**
 * Limit を経由しない素の fetch 実装。
 * テスト用、または limit 経由がすでに保証されている呼び出し側用。
 */
export async function fetchPdfFromUrlUnlimited(url: string): Promise<Uint8Array> {
  let parsedUrl: URL;
  try {
    parsedUrl = new URL(url);
  } catch {
    throw new PdfReaderError(
      `Invalid URL: ${url}`,
      'INVALID_URL',
      'Provide a valid HTTP or HTTPS URL pointing to a PDF file.',
    );
  }

  if (!['http:', 'https:'].includes(parsedUrl.protocol)) {
    throw new PdfReaderError(
      `Unsupported protocol: ${parsedUrl.protocol}`,
      'UNSUPPORTED_PROTOCOL',
      'Only HTTP and HTTPS URLs are supported.',
    );
  }

  const response = await fetch(url, {
    headers: {
      Accept: 'application/pdf',
      'User-Agent': `${SERVER_NAME}/${SERVER_VERSION}`,
    },
    signal: AbortSignal.timeout(30_000),
  });

  if (!response.ok) {
    throw new PdfReaderError(
      `Failed to fetch PDF: HTTP ${response.status} ${response.statusText}`,
      'FETCH_FAILED',
      'Check the URL is correct and accessible.',
    );
  }

  const contentLength = response.headers.get('content-length');
  if (contentLength && parseInt(contentLength, 10) > MAX_FILE_SIZE) {
    throw new PdfReaderError(
      `Remote PDF size (${(parseInt(contentLength, 10) / 1024 / 1024).toFixed(1)}MB) exceeds the 50MB limit.`,
      'FILE_TOO_LARGE',
      'Try a smaller PDF file.',
    );
  }

  const buffer = await response.arrayBuffer();
  if (buffer.byteLength > MAX_FILE_SIZE) {
    throw new PdfReaderError(
      `Downloaded PDF size (${(buffer.byteLength / 1024 / 1024).toFixed(1)}MB) exceeds the 50MB limit.`,
      'FILE_TOO_LARGE',
      'Try a smaller PDF file.',
    );
  }

  return new Uint8Array(buffer);
}

src/utils/concurrency.ts:26-57 (helper)

Concurrency limiter 'createLimit' used by url-fetcher to cap parallel read_url fetches (default 4), preventing remote host overload.

export function createLimit(concurrency: number): LimitFn {
  if (!Number.isFinite(concurrency) || concurrency < 1) {
    throw new Error(`createLimit: concurrency must be >= 1, got ${concurrency}`);
  }

  let active = 0;
  const queue: Array<() => void> = [];

  const next = () => {
    if (active >= concurrency) return;
    const run = queue.shift();
    if (run) {
      active++;
      run();
    }
  };

  return <T>(task: () => Promise<T>): Promise<T> => {
    return new Promise<T>((resolve, reject) => {
      const exec = () => {
        task()
          .then(resolve, reject)
          .finally(() => {
            active--;
            next();
          });
      };
      queue.push(exec);
      next();
    });
  };
}

@shuji-bonji/pdf-reader-mcp

Read PDF from URL

Instructions

Input Schema

Implementation Reference

Tool Definition Quality

Other Tools

Latest Blog Posts

MCP directory API