read_pdf

read_pdf

Extract text, metadata, and page count from PDF files or URLs, with options to specify pages or ranges for targeted content retrieval using the PDF Reader MCP Server.

Instructions

Reads content/metadata from one or more PDFs (local/URL). Each source can specify pages to extract.

Input Schema

TableJSON Schema

Name	Required	Description
`include_full_text`	No	Include the full text content of each PDF (only if 'pages' is not specified for that source).
`include_metadata`	No	Include metadata and info objects for each PDF.
`include_page_count`	No	Include the total number of pages for each PDF.
`sources`	Yes	An array of PDF sources to process, each can optionally specify pages.

Implementation Reference

src/handlers/readPdf.ts:145-225 (handler)
Main implementation of the read_pdf tool handler. Orchestrates PDF loading, page processing, text/image extraction, and returns structured content using MCP SDK.
export const readPdf = tool() .description( 'Reads content/metadata/images from one or more PDFs (local/URL). Each source can specify pages to extract.' ) .input(readPdfArgsSchema) .handler(async ({ input }) => { const { sources, include_full_text, include_metadata, include_page_count, include_images } = input; // Process sources with concurrency limit to prevent memory exhaustion // Processing large PDFs concurrently can consume significant memory const MAX_CONCURRENT_SOURCES = 3; const results: PdfSourceResult[] = []; const options = { includeFullText: include_full_text ?? false, includeMetadata: include_metadata ?? true, includePageCount: include_page_count ?? true, includeImages: include_images ?? false, }; for (let i = 0; i < sources.length; i += MAX_CONCURRENT_SOURCES) { const batch = sources.slice(i, i + MAX_CONCURRENT_SOURCES); const batchResults = await Promise.all( batch.map((source) => processSingleSource(source, options)) ); results.push(...batchResults); } // Check if all sources failed const allFailed = results.every((r) => !r.success); if (allFailed) { const errorMessages = results.map((r) => r.error).join('; '); return toolError(`All PDF sources failed to process: ${errorMessages}`); } // Build content parts - start with structured JSON for backward compatibility const content: Array<ReturnType<typeof text> | ReturnType<typeof image>> = []; // Strip image data and page_contents from JSON to keep it manageable const resultsForJson = results.map((result) => { if (result.data) { const { images, page_contents, ...dataWithoutBinaryContent } = result.data; // Include image count and metadata in JSON, but not the base64 data if (images) { const imageInfo = images.map((img) => ({ page: img.page, index: img.index, width: img.width, height: img.height, format: img.format, })); return { ...result, data: { ...dataWithoutBinaryContent, image_info: imageInfo } }; } return { ...result, data: dataWithoutBinaryContent }; } return result; }); // First content part: Structured JSON results content.push(text(JSON.stringify({ results: resultsForJson }, null, 2))); // Add page content in exact Y-coordinate order for (const result of results) { if (!result.success || !result.data?.page_contents) continue; // Process each page's content items in order for (const pageContent of result.data.page_contents) { for (const item of pageContent.items) { if (item.type === 'text' && item.textContent) { // Add text content part content.push(text(item.textContent)); } else if (item.type === 'image' && item.imageData) { // Add image content part (all images are now encoded as PNG) content.push(image(item.imageData.data, 'image/png')); } } } } return content; });
src/schemas/readPdf.ts:32-50 (schema)
Vex schema for validating input arguments to the read_pdf tool, including sources, page selections, and inclusion flags.
export const readPdfArgsSchema = object({ sources: array(pdfSourceSchema), include_full_text: optional( bool( description( "Include the full text content of each PDF (only if 'pages' is not specified for that source)." ) ) ), include_metadata: optional(bool(description('Include metadata and info objects for each PDF.'))), include_page_count: optional( bool(description('Include the total number of pages for each PDF.')) ), include_images: optional( bool( description('Extract and include embedded images from the PDF pages as base64-encoded data.') ) ), });
src/index.ts:4-13 (registration)
Import and registration of the read_pdf tool in the MCP server creation.
import { readPdf } from './handlers/readPdf.js'; const server = createServer({ name: 'pdf-reader-mcp', version: '1.3.0', instructions: 'MCP Server for reading PDF files and extracting text, metadata, images, and page information.', tools: { read_pdf: readPdf }, transport: stdio(), });

PDF Reader MCP Server

Instructions

Input Schema

Implementation Reference

Other Tools

Related Tools

Latest Blog Posts

MCP directory API