read_pdf_text

Extract text from PDF files to access content for editing, analysis, or conversion. Specify pages, preserve layout, and set encoding as needed.

Instructions

Extract text content from a PDF file using pdftotext from poppler-utils

Input Schema

TableJSON Schema

Name	Required	Description	Default
`path`	Yes	Path to the PDF file (relative to current working directory or absolute path)
`page`	No	Specific page number to extract (1-based indexing). If not specified, extracts all pages.
`layout`	No	Preserve original text layout formatting (default: false)
`encoding`	No	Text encoding for output (default: UTF-8)	UTF-8

Implementation Reference

src/server.js:165-277 (handler)
The primary handler function for the 'read_pdf_text' tool. It destructures arguments, validates the PDF file and pdftotext availability, constructs and executes the pdftotext command with options for page, layout, and encoding, computes metadata and statistics, and returns a structured JSON response or error.
async handleReadPdfText(args) { try { const { path: filePath, page, layout = false, encoding = 'UTF-8' } = args; // Check if pdftotext is available if (!this.checkPdftotextAvailable()) { throw new Error( 'pdftotext is not available. Please install poppler-utils:\n' + ' Ubuntu/Debian: sudo apt install poppler-utils\n' + ' macOS: brew install poppler\n' + ' Windows: choco install poppler' ); } // Validate the PDF file this.validatePdfFile(filePath); // Build pdftotext command const args_array = ['pdftotext']; // Add encoding if specified if (encoding !== 'UTF-8') { args_array.push('-enc', encoding); } // Add layout preservation if requested if (layout) { args_array.push('-layout'); } // Add page specification if provided if (page) { args_array.push('-f', page.toString(), '-l', page.toString()); } // Add input file and output to stdout args_array.push(filePath, '-'); // Execute pdftotext const text = execSync(args_array.join(' '), { encoding: 'utf8', maxBuffer: 50 * 1024 * 1024, // 50MB buffer for very large PDFs timeout: 30000, // 30 second timeout }); // Get file metadata const stats = fs.statSync(filePath); const fileName = path.basename(filePath); const fileDir = path.dirname(path.resolve(filePath)); // Prepare response const response = { success: true, file: fileName, path: path.resolve(filePath), directory: fileDir, extractedText: text.trim(), pageSpecific: page || 'all', layoutPreserved: layout, encoding: encoding, fileSize: stats.size, lastModified: stats.mtime.toISOString(), extractedAt: new Date().toISOString(), textLength: text.trim().length, wordCount: text.trim().split(/\s+/).filter(word => word.length > 0).length, }; return { content: [ { type: 'text', text: JSON.stringify(response, null, 2), }, ], }; } catch (error) { // Prepare error response const errorResponse = { success: false, error: error.message, file: args.path || 'unknown', timestamp: new Date().toISOString(), }; // Add specific error context if available if (error.code === 'ENOENT') { errorResponse.errorType = 'FILE_NOT_FOUND'; } else if (error.code === 'EACCES') { errorResponse.errorType = 'PERMISSION_DENIED'; } else if (error.message.includes('pdftotext')) { errorResponse.errorType = 'PDFTOTEXT_ERROR'; } else if (error.message.includes('PDF')) { errorResponse.errorType = 'INVALID_PDF'; } else { errorResponse.errorType = 'UNKNOWN_ERROR'; } return { content: [ { type: 'text', text: JSON.stringify(errorResponse, null, 2), }, ], }; } }
src/server.js:76-101 (schema)
The input schema for the 'read_pdf_text' tool, defining the expected parameters: path (required string), optional page (number >=1), layout (boolean, default false), encoding (string enum, default UTF-8).
inputSchema: { type: 'object', properties: { path: { type: 'string', description: 'Path to the PDF file (relative to current working directory or absolute path)', }, page: { type: 'number', description: 'Specific page number to extract (1-based indexing). If not specified, extracts all pages.', minimum: 1, }, layout: { type: 'boolean', description: 'Preserve original text layout formatting (default: false)', default: false, }, encoding: { type: 'string', description: 'Text encoding for output (default: UTF-8)', default: 'UTF-8', enum: ['UTF-8', 'Latin1', 'ASCII'], }, }, required: ['path'], },
src/server.js:107-113 (registration)
Registration of the CallToolRequest handler that checks if the tool name is 'read_pdf_text' and dispatches to the handleReadPdfText function.
this.server.setRequestHandler(CallToolRequestSchema, async (request) => { if (request.params.name !== 'read_pdf_text') { throw new Error(`Unknown tool: ${request.params.name}`); } return await this.handleReadPdfText(request.params.arguments); });
src/server.js:71-104 (registration)
Registration of the ListToolsRequest handler that provides the tool metadata including name 'read_pdf_text', description, and input schema.
this.server.setRequestHandler(ListToolsRequestSchema, async () => ({ tools: [ { name: 'read_pdf_text', description: 'Extract text content from a PDF file using pdftotext from poppler-utils', inputSchema: { type: 'object', properties: { path: { type: 'string', description: 'Path to the PDF file (relative to current working directory or absolute path)', }, page: { type: 'number', description: 'Specific page number to extract (1-based indexing). If not specified, extracts all pages.', minimum: 1, }, layout: { type: 'boolean', description: 'Preserve original text layout formatting (default: false)', default: false, }, encoding: { type: 'string', description: 'Text encoding for output (default: UTF-8)', default: 'UTF-8', enum: ['UTF-8', 'Latin1', 'ASCII'], }, }, required: ['path'], }, }, ], }));
src/server.js:131-159 (helper)
Helper function to validate the PDF file: checks existence, readability, .pdf extension, and PDF magic bytes (%PDF header). Called by the handler.
validatePdfFile(filePath) { // Check if file exists if (!fs.existsSync(filePath)) { throw new Error(`File not found: ${filePath}`); } // Check if file is readable try { fs.accessSync(filePath, fs.constants.R_OK); } catch (error) { throw new Error(`File is not readable: ${filePath}`); } // Basic PDF file validation (check extension and magic bytes) if (!filePath.toLowerCase().endsWith('.pdf')) { throw new Error(`File does not appear to be a PDF: ${filePath}`); } try { const buffer = fs.readFileSync(filePath, { start: 0, end: 4 }); if (!buffer.toString().startsWith('%PDF')) { throw new Error(`File is not a valid PDF (missing PDF header): ${filePath}`); } } catch (error) { if (error.message.includes('PDF header')) { throw error; } throw new Error(`Unable to validate PDF file: ${filePath}`); }

PDFtotext MCP Server

read_pdf_text

Instructions

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API