read_pdf_text
Extract text from PDF files with customizable options like page selection, layout preservation, and text encoding using the PDFtotext MCP Server.
Instructions
Extract text content from a PDF file using pdftotext from poppler-utils
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| encoding | No | Text encoding for output (default: UTF-8) | UTF-8 |
| layout | No | Preserve original text layout formatting (default: false) | |
| page | No | Specific page number to extract (1-based indexing). If not specified, extracts all pages. | |
| path | Yes | Path to the PDF file (relative to current working directory or absolute path) |
Implementation Reference
- src/server.js:165-277 (handler)Main handler function for the 'read_pdf_text' tool. Destructures arguments, checks for pdftotext availability, validates PDF file, constructs and executes pdftotext command with options (page, layout, encoding), computes metadata and statistics, returns structured JSON response via MCP content array, handles various errors with typed responses.async handleReadPdfText(args) { try { const { path: filePath, page, layout = false, encoding = 'UTF-8' } = args; // Check if pdftotext is available if (!this.checkPdftotextAvailable()) { throw new Error( 'pdftotext is not available. Please install poppler-utils:\n' + ' Ubuntu/Debian: sudo apt install poppler-utils\n' + ' macOS: brew install poppler\n' + ' Windows: choco install poppler' ); } // Validate the PDF file this.validatePdfFile(filePath); // Build pdftotext command const args_array = ['pdftotext']; // Add encoding if specified if (encoding !== 'UTF-8') { args_array.push('-enc', encoding); } // Add layout preservation if requested if (layout) { args_array.push('-layout'); } // Add page specification if provided if (page) { args_array.push('-f', page.toString(), '-l', page.toString()); } // Add input file and output to stdout args_array.push(filePath, '-'); // Execute pdftotext const text = execSync(args_array.join(' '), { encoding: 'utf8', maxBuffer: 50 * 1024 * 1024, // 50MB buffer for very large PDFs timeout: 30000, // 30 second timeout }); // Get file metadata const stats = fs.statSync(filePath); const fileName = path.basename(filePath); const fileDir = path.dirname(path.resolve(filePath)); // Prepare response const response = { success: true, file: fileName, path: path.resolve(filePath), directory: fileDir, extractedText: text.trim(), pageSpecific: page || 'all', layoutPreserved: layout, encoding: encoding, fileSize: stats.size, lastModified: stats.mtime.toISOString(), extractedAt: new Date().toISOString(), textLength: text.trim().length, wordCount: text.trim().split(/\s+/).filter(word => word.length > 0).length, }; return { content: [ { type: 'text', text: JSON.stringify(response, null, 2), }, ], }; } catch (error) { // Prepare error response const errorResponse = { success: false, error: error.message, file: args.path || 'unknown', timestamp: new Date().toISOString(), }; // Add specific error context if available if (error.code === 'ENOENT') { errorResponse.errorType = 'FILE_NOT_FOUND'; } else if (error.code === 'EACCES') { errorResponse.errorType = 'PERMISSION_DENIED'; } else if (error.message.includes('pdftotext')) { errorResponse.errorType = 'PDFTOTEXT_ERROR'; } else if (error.message.includes('PDF')) { errorResponse.errorType = 'INVALID_PDF'; } else { errorResponse.errorType = 'UNKNOWN_ERROR'; } return { content: [ { type: 'text', text: JSON.stringify(errorResponse, null, 2), }, ], }; } }
- src/server.js:76-101 (schema)Input schema definition for the 'read_pdf_text' tool, specifying parameters: path (required string), page (optional number >=1), layout (optional boolean, default false), encoding (optional string enum, default UTF-8).inputSchema: { type: 'object', properties: { path: { type: 'string', description: 'Path to the PDF file (relative to current working directory or absolute path)', }, page: { type: 'number', description: 'Specific page number to extract (1-based indexing). If not specified, extracts all pages.', minimum: 1, }, layout: { type: 'boolean', description: 'Preserve original text layout formatting (default: false)', default: false, }, encoding: { type: 'string', description: 'Text encoding for output (default: UTF-8)', default: 'UTF-8', enum: ['UTF-8', 'Latin1', 'ASCII'], }, }, required: ['path'], },
- src/server.js:107-113 (registration)Tool call request handler registration: checks if the requested tool name is 'read_pdf_text' and dispatches to handleReadPdfText if matched.this.server.setRequestHandler(CallToolRequestSchema, async (request) => { if (request.params.name !== 'read_pdf_text') { throw new Error(`Unknown tool: ${request.params.name}`); } return await this.handleReadPdfText(request.params.arguments); });
- src/server.js:71-104 (registration)ListToolsRequestSchema handler registration: advertises the 'read_pdf_text' tool with its name, description, and input schema.this.server.setRequestHandler(ListToolsRequestSchema, async () => ({ tools: [ { name: 'read_pdf_text', description: 'Extract text content from a PDF file using pdftotext from poppler-utils', inputSchema: { type: 'object', properties: { path: { type: 'string', description: 'Path to the PDF file (relative to current working directory or absolute path)', }, page: { type: 'number', description: 'Specific page number to extract (1-based indexing). If not specified, extracts all pages.', minimum: 1, }, layout: { type: 'boolean', description: 'Preserve original text layout formatting (default: false)', default: false, }, encoding: { type: 'string', description: 'Text encoding for output (default: UTF-8)', default: 'UTF-8', enum: ['UTF-8', 'Latin1', 'ASCII'], }, }, required: ['path'], }, }, ], }));
- src/server.js:131-160 (helper)Helper function to validate the PDF file: checks existence, readability, .pdf extension, and PDF magic bytes (%PDF header).validatePdfFile(filePath) { // Check if file exists if (!fs.existsSync(filePath)) { throw new Error(`File not found: ${filePath}`); } // Check if file is readable try { fs.accessSync(filePath, fs.constants.R_OK); } catch (error) { throw new Error(`File is not readable: ${filePath}`); } // Basic PDF file validation (check extension and magic bytes) if (!filePath.toLowerCase().endsWith('.pdf')) { throw new Error(`File does not appear to be a PDF: ${filePath}`); } try { const buffer = fs.readFileSync(filePath, { start: 0, end: 4 }); if (!buffer.toString().startsWith('%PDF')) { throw new Error(`File is not a valid PDF (missing PDF header): ${filePath}`); } } catch (error) { if (error.message.includes('PDF header')) { throw error; } throw new Error(`Unable to validate PDF file: ${filePath}`); } }