Skip to main content
Glama

Read PDF from URL

read_url
Read-only

Fetch a PDF from a URL and extract its text content with options for column reordering and whitespace compaction.

Instructions

Fetch a PDF from a URL and extract its text content.

Downloads the PDF from the specified URL, then extracts text with Y-coordinate-based reading order. Supports HTTP and HTTPS. Maximum file size: 50MB. Timeout: 30 seconds.

Like read_text, accepts split_columns: 2 | 3 for untagged multi-column PDFs and compact_whitespace: true to collapse U+3000 / ASCII whitespace runs. Tagged PDFs should use extract_tables instead.

Args:

  • url (string): URL pointing to a PDF file (HTTP or HTTPS)

  • pages (string, optional): Page range to extract. Format: "1-5", "3", or "1,3,5-7". Omit for all pages.

  • response_format ('markdown' | 'json'): Output format (default: 'markdown')

  • split_columns (1 | 2 | 3, optional): Column-aware reordering. Default 1 = existing Y-sort.

  • compact_whitespace (boolean, optional): Collapse whitespace runs (incl. U+3000) to one ASCII space. Default false.

Returns: Extracted text organized by page number, same format as read_text.

Examples:

  • Read remote PDF: { url: "https://example.com/document.pdf" }

  • Untagged 2-column PDF: { url: "https://...", split_columns: 2 }

  • Japanese form: { url: "https://...", compact_whitespace: true }

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesURL pointing to a PDF file (HTTP or HTTPS)
pagesNoPage range to process. Format: "1-5", "3", or "1,3,5-7". Omit for all pages.
response_formatNoOutput format: "markdown" for human-readable, "json" for structured datamarkdown
split_columnsNoNumber of columns to use when reordering text. 1 (default) = existing Y-sort. 2 or 3 = bucket by X-coordinate left-to-right. Use for untagged 新旧対照表 / two-column PDFs where Y-sort would interleave columns. Tagged PDFs with proper <Table> markup should use extract_tables instead.
compact_whitespaceNoWhen true, collapse runs of whitespace (incl. fullwidth space U+3000) to a single ASCII space and trim each line. Reduces token consumption on Japanese form-style PDFs. Default: false (no whitespace normalization).

Implementation Reference

  • The main handler function 'registerReadUrl' that registers the 'read_url' tool. It fetches a PDF from a URL via fetchPdfFromUrl, loads it with pdfjs-service, extracts text with column/whitespace options, formats output as markdown or JSON, truncates if needed, and returns the result.
    export function registerReadUrl(server: McpServer): void {
      server.registerTool(
        'read_url',
        {
          title: 'Read PDF from URL',
          description: `Fetch a PDF from a URL and extract its text content.
    
    Downloads the PDF from the specified URL, then extracts text with Y-coordinate-based reading order. Supports HTTP and HTTPS. Maximum file size: 50MB. Timeout: 30 seconds.
    
    Like \`read_text\`, accepts \`split_columns: 2 | 3\` for **untagged** multi-column PDFs and \`compact_whitespace: true\` to collapse U+3000 / ASCII whitespace runs. Tagged PDFs should use \`extract_tables\` instead.
    
    Args:
      - url (string): URL pointing to a PDF file (HTTP or HTTPS)
      - pages (string, optional): Page range to extract. Format: "1-5", "3", or "1,3,5-7". Omit for all pages.
      - response_format ('markdown' | 'json'): Output format (default: 'markdown')
      - split_columns (1 | 2 | 3, optional): Column-aware reordering. Default 1 = existing Y-sort.
      - compact_whitespace (boolean, optional): Collapse whitespace runs (incl. U+3000) to one ASCII space. Default false.
    
    Returns:
      Extracted text organized by page number, same format as read_text.
    
    Examples:
      - Read remote PDF: { url: "https://example.com/document.pdf" }
      - Untagged 2-column PDF: { url: "https://...", split_columns: 2 }
      - Japanese form: { url: "https://...", compact_whitespace: true }`,
          inputSchema: ReadUrlSchema,
          annotations: {
            readOnlyHint: true,
            destructiveHint: false,
            idempotentHint: false,
            openWorldHint: true,
          },
        },
        async (params: ReadUrlInput) => {
          try {
            const data = await fetchPdfFromUrl(params.url);
            const doc = await loadDocumentFromData(data);
    
            try {
              const results = await extractTextFromDoc(doc, params.pages, {
                splitColumns: params.split_columns,
                compactWhitespace: params.compact_whitespace,
              });
    
              let text: string;
              if (params.response_format === ResponseFormat.JSON) {
                text = JSON.stringify(results, null, 2);
              } else {
                text = formatPageTextsMarkdown(results);
              }
    
              const { text: finalText } = truncateIfNeeded(text);
              return {
                content: [{ type: 'text' as const, text: finalText }],
              };
            } finally {
              await doc.destroy();
            }
          } catch (error) {
            const err = handleStructuredError(error);
            return {
              content: [{ type: 'text' as const, text: JSON.stringify(err, null, 2) }],
              isError: true,
            };
          }
        },
      );
    }
  • Zod schema 'ReadUrlSchema' defining the input validation for read_url: url (valid URL), pages (page range), response_format (markdown/json), split_columns (1-3), compact_whitespace (boolean).
    /** read_url */
    export const ReadUrlSchema = z
      .object({
        url: UrlSchema,
        pages: PagesSchema,
        response_format: ResponseFormatSchema,
        split_columns: SplitColumnsSchema,
        compact_whitespace: CompactWhitespaceSchema,
      })
      .strict();
  • Import and registration call of 'registerReadUrl' in the central tool registration entry point (registerAllTools).
    import { registerReadUrl } from './tier1/read-url.js';
    import { registerSearchText } from './tier1/search-text.js';
    import { registerSummarize } from './tier1/summarize.js';
    // Tier 2: Structure analysis
    import { registerExtractTables } from './tier2/extract-tables.js';
    import { registerInspectAnnotations } from './tier2/inspect-annotations.js';
    import { registerInspectFonts } from './tier2/inspect-fonts.js';
    import { registerInspectSignatures } from './tier2/inspect-signatures.js';
    import { registerInspectStructure } from './tier2/inspect-structure.js';
    import { registerInspectTags } from './tier2/inspect-tags.js';
    // Tier 3: Validation & analysis
    import { registerCompareStructure } from './tier3/compare-structure.js';
    import { registerValidateMetadata } from './tier3/validate-metadata.js';
    import { registerValidateTagged } from './tier3/validate-tagged.js';
    
    /**
     * Register all tools with the MCP server.
     */
    export function registerAllTools(server: McpServer): void {
      // Tier 1: Basic PDF operations
      registerGetPageCount(server);
      registerGetMetadata(server);
      registerReadText(server);
      registerSearchText(server);
      registerReadImages(server);
      registerReadUrl(server);
  • The 'fetchPdfFromUrl' helper that actually downloads the PDF from a URL with timeout (30s), size checks (50MB), protocol validation (HTTP/HTTPS), and concurrency limiting.
    export async function fetchPdfFromUrl(url: string): Promise<Uint8Array> {
      return fetchLimit(() => fetchPdfFromUrlUnlimited(url));
    }
    
    /**
     * Limit を経由しない素の fetch 実装。
     * テスト用、または limit 経由がすでに保証されている呼び出し側用。
     */
    export async function fetchPdfFromUrlUnlimited(url: string): Promise<Uint8Array> {
      let parsedUrl: URL;
      try {
        parsedUrl = new URL(url);
      } catch {
        throw new PdfReaderError(
          `Invalid URL: ${url}`,
          'INVALID_URL',
          'Provide a valid HTTP or HTTPS URL pointing to a PDF file.',
        );
      }
    
      if (!['http:', 'https:'].includes(parsedUrl.protocol)) {
        throw new PdfReaderError(
          `Unsupported protocol: ${parsedUrl.protocol}`,
          'UNSUPPORTED_PROTOCOL',
          'Only HTTP and HTTPS URLs are supported.',
        );
      }
    
      const response = await fetch(url, {
        headers: {
          Accept: 'application/pdf',
          'User-Agent': `${SERVER_NAME}/${SERVER_VERSION}`,
        },
        signal: AbortSignal.timeout(30_000),
      });
    
      if (!response.ok) {
        throw new PdfReaderError(
          `Failed to fetch PDF: HTTP ${response.status} ${response.statusText}`,
          'FETCH_FAILED',
          'Check the URL is correct and accessible.',
        );
      }
    
      const contentLength = response.headers.get('content-length');
      if (contentLength && parseInt(contentLength, 10) > MAX_FILE_SIZE) {
        throw new PdfReaderError(
          `Remote PDF size (${(parseInt(contentLength, 10) / 1024 / 1024).toFixed(1)}MB) exceeds the 50MB limit.`,
          'FILE_TOO_LARGE',
          'Try a smaller PDF file.',
        );
      }
    
      const buffer = await response.arrayBuffer();
      if (buffer.byteLength > MAX_FILE_SIZE) {
        throw new PdfReaderError(
          `Downloaded PDF size (${(buffer.byteLength / 1024 / 1024).toFixed(1)}MB) exceeds the 50MB limit.`,
          'FILE_TOO_LARGE',
          'Try a smaller PDF file.',
        );
      }
    
      return new Uint8Array(buffer);
    }
  • Concurrency limiter 'createLimit' used by url-fetcher to cap parallel read_url fetches (default 4), preventing remote host overload.
    export function createLimit(concurrency: number): LimitFn {
      if (!Number.isFinite(concurrency) || concurrency < 1) {
        throw new Error(`createLimit: concurrency must be >= 1, got ${concurrency}`);
      }
    
      let active = 0;
      const queue: Array<() => void> = [];
    
      const next = () => {
        if (active >= concurrency) return;
        const run = queue.shift();
        if (run) {
          active++;
          run();
        }
      };
    
      return <T>(task: () => Promise<T>): Promise<T> => {
        return new Promise<T>((resolve, reject) => {
          const exec = () => {
            task()
              .then(resolve, reject)
              .finally(() => {
                active--;
                next();
              });
          };
          queue.push(exec);
          next();
        });
      };
    }
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already mark it as read-only and non-destructive. The description adds details: downloads PDF, Y-coordinate-based reading order, max 50MB, 30s timeout, and parameter behaviors (e.g., split_columns for column reordering, compact_whitespace for Japanese PDFs). No contradiction.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is succinct yet comprehensive: 4 short paragraphs, bullet-point args, examples. Front-loaded with main purpose. Every sentence adds value.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Despite no output schema, the description covers all 5 parameters, limits, timeout, format, and provides examples. It references read_text for return format, which is sufficient given the shared format. Highly complete.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 100% schema coverage, baseline is 3, but description adds significant meaning: explains split_columns' use for untagged multi-column PDFs, compact_whitespace for reducing tokens in Japanese forms, and response_format options. Examples illustrate usage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Fetch a PDF from a URL and extract its text content.' It distinguishes from siblings by mentioning 'Like read_text' and explicitly directing tagged PDFs to 'extract_tables'.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit guidance on when to use this tool vs alternatives: 'Tagged PDFs should use extract_tables instead.' It also implies use for URL-based PDFs and reference to read_text for similar functionality.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/shuji-bonji/pdf-reader-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server