read_pdf
Extract all text from a PDF file by specifying its absolute path. Optionally include PDF metadata. Solves the need to get text content from PDF documents.
Instructions
Extract all text content from a PDF file
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| filePath | Yes | Absolute path to the PDF file | |
| includeMetadata | No | Whether to include PDF metadata in the response |
Implementation Reference
- src/pdf-tools.ts:9-37 (handler)The core handler function 'extractTextFromPDF' that reads a PDF file and extracts all text content along with metadata (title, author, subject, etc.) and page count. This is the actual implementation called when 'read_pdf' is invoked.
export async function extractTextFromPDF(filePath: string): Promise<PDFExtractionResult> { try { const dataBuffer = await fs.readFile(filePath); const parser = new PDFParse({ data: dataBuffer }); const textResult = await parser.getText(); const infoResult = await parser.getInfo(); await parser.destroy(); const metadata: PDFMetadata = { title: infoResult.info?.Title, author: infoResult.info?.Author, subject: infoResult.info?.Subject, creator: infoResult.info?.Creator, producer: infoResult.info?.Producer, creationDate: infoResult.info?.CreationDate, modificationDate: infoResult.info?.ModDate, keywords: infoResult.info?.Keywords, totalPages: infoResult.total, }; return { text: textResult.text, metadata, pageCount: infoResult.total, }; } catch (error) { throw new Error(`Failed to read PDF: ${error instanceof Error ? error.message : String(error)}`); } } - src/types.ts:27-31 (schema)The PDFExtractionResult interface defines the return type of the read_pdf handler, containing text, optional metadata, and pageCount.
export interface PDFExtractionResult { text: string; metadata?: PDFMetadata; pageCount: number; } - src/index.ts:22-41 (registration)Tool registration definition for 'read_pdf' in the TOOLS array, including its name, description, and inputSchema (filePath required, includeMetadata optional).
const TOOLS: Tool[] = [ { name: 'read_pdf', description: 'Extract all text content from a PDF file', inputSchema: { type: 'object', properties: { filePath: { type: 'string', description: 'Absolute path to the PDF file', }, includeMetadata: { type: 'boolean', description: 'Whether to include PDF metadata in the response', default: false, }, }, required: ['filePath'], }, }, - src/index.ts:175-202 (handler)The CallToolRequestSchema handler case for 'read_pdf' in the main server - dispatches to extractTextFromPDF and formats the response based on includeMetadata flag.
case 'read_pdf': { const { filePath, includeMetadata } = args as { filePath: string; includeMetadata?: boolean; }; const result = await extractTextFromPDF(filePath); if (includeMetadata) { return { content: [ { type: 'text', text: JSON.stringify(result, null, 2), }, ], }; } return { content: [ { type: 'text', text: result.text, }, ], }; } - src/types.ts:4-14 (schema)The PDFMetadata interface defines the metadata structure returned as part of the read_pdf result.
export interface PDFMetadata { title?: string; author?: string; subject?: string; creator?: string; producer?: string; creationDate?: string; modificationDate?: string; keywords?: string; totalPages?: number; }