Skip to main content
Glama

read_pdf

Extract text, metadata, and page count from PDF files or URLs, with options to specify pages or ranges for targeted content retrieval using the PDF Reader MCP Server.

Instructions

Reads content/metadata from one or more PDFs (local/URL). Each source can specify pages to extract.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
include_full_textNoInclude the full text content of each PDF (only if 'pages' is not specified for that source).
include_metadataNoInclude metadata and info objects for each PDF.
include_page_countNoInclude the total number of pages for each PDF.
sourcesYesAn array of PDF sources to process, each can optionally specify pages.

Implementation Reference

  • Main implementation of the read_pdf tool handler. Orchestrates PDF loading, page processing, text/image extraction, and returns structured content using MCP SDK.
    export const readPdf = tool()
      .description(
        'Reads content/metadata/images from one or more PDFs (local/URL). Each source can specify pages to extract.'
      )
      .input(readPdfArgsSchema)
      .handler(async ({ input }) => {
        const { sources, include_full_text, include_metadata, include_page_count, include_images } =
          input;
    
        // Process sources with concurrency limit to prevent memory exhaustion
        // Processing large PDFs concurrently can consume significant memory
        const MAX_CONCURRENT_SOURCES = 3;
        const results: PdfSourceResult[] = [];
        const options = {
          includeFullText: include_full_text ?? false,
          includeMetadata: include_metadata ?? true,
          includePageCount: include_page_count ?? true,
          includeImages: include_images ?? false,
        };
    
        for (let i = 0; i < sources.length; i += MAX_CONCURRENT_SOURCES) {
          const batch = sources.slice(i, i + MAX_CONCURRENT_SOURCES);
          const batchResults = await Promise.all(
            batch.map((source) => processSingleSource(source, options))
          );
          results.push(...batchResults);
        }
    
        // Check if all sources failed
        const allFailed = results.every((r) => !r.success);
        if (allFailed) {
          const errorMessages = results.map((r) => r.error).join('; ');
          return toolError(`All PDF sources failed to process: ${errorMessages}`);
        }
    
        // Build content parts - start with structured JSON for backward compatibility
        const content: Array<ReturnType<typeof text> | ReturnType<typeof image>> = [];
    
        // Strip image data and page_contents from JSON to keep it manageable
        const resultsForJson = results.map((result) => {
          if (result.data) {
            const { images, page_contents, ...dataWithoutBinaryContent } = result.data;
            // Include image count and metadata in JSON, but not the base64 data
            if (images) {
              const imageInfo = images.map((img) => ({
                page: img.page,
                index: img.index,
                width: img.width,
                height: img.height,
                format: img.format,
              }));
              return { ...result, data: { ...dataWithoutBinaryContent, image_info: imageInfo } };
            }
            return { ...result, data: dataWithoutBinaryContent };
          }
          return result;
        });
    
        // First content part: Structured JSON results
        content.push(text(JSON.stringify({ results: resultsForJson }, null, 2)));
    
        // Add page content in exact Y-coordinate order
        for (const result of results) {
          if (!result.success || !result.data?.page_contents) continue;
    
          // Process each page's content items in order
          for (const pageContent of result.data.page_contents) {
            for (const item of pageContent.items) {
              if (item.type === 'text' && item.textContent) {
                // Add text content part
                content.push(text(item.textContent));
              } else if (item.type === 'image' && item.imageData) {
                // Add image content part (all images are now encoded as PNG)
                content.push(image(item.imageData.data, 'image/png'));
              }
            }
          }
        }
    
        return content;
      });
  • Vex schema for validating input arguments to the read_pdf tool, including sources, page selections, and inclusion flags.
    export const readPdfArgsSchema = object({
      sources: array(pdfSourceSchema),
      include_full_text: optional(
        bool(
          description(
            "Include the full text content of each PDF (only if 'pages' is not specified for that source)."
          )
        )
      ),
      include_metadata: optional(bool(description('Include metadata and info objects for each PDF.'))),
      include_page_count: optional(
        bool(description('Include the total number of pages for each PDF.'))
      ),
      include_images: optional(
        bool(
          description('Extract and include embedded images from the PDF pages as base64-encoded data.')
        )
      ),
    });
  • src/index.ts:4-13 (registration)
    Import and registration of the read_pdf tool in the MCP server creation.
    import { readPdf } from './handlers/readPdf.js';
    
    const server = createServer({
      name: 'pdf-reader-mcp',
      version: '1.3.0',
      instructions:
        'MCP Server for reading PDF files and extracting text, metadata, images, and page information.',
      tools: { read_pdf: readPdf },
      transport: stdio(),
    });
Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/SylphxAI/pdf-reader-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server