extract_pdf_text
Extract text from PDF documents, including optional metadata and formatting preservation. Specify page ranges to refine extraction and retain document structure as needed.
Instructions
Extract text content from PDF documents with optional metadata and formatting preservation
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| file_path | Yes | Path to the PDF file to extract text from | |
| include_metadata | No | Whether to include document metadata in the response | |
| pages | No | Page range to extract (e.g., "1-5", "1,3,5", or "all") | all |
| preserve_formatting | No | Whether to preserve text formatting and structure |
Implementation Reference
- src/tools/extract-text.ts:37-62 (handler)Core handler function that validates input using ExtractTextParamsSchema, extracts text using PDFProcessor, conditionally includes metadata, and returns structured result or throws MCP-formatted error.export async function handleExtractText(args: unknown): Promise<ExtractTextResult> { try { const params = ExtractTextParamsSchema.parse(args); const processor = new PDFProcessor(); const result = await processor.extractText( params.file_path, params.preserve_formatting ); const response: ExtractTextResult = { text: result.text, page_count: result.pageCount, processing_time_ms: result.processingTimeMs }; if (params.include_metadata) { response.metadata = result.metadata; } return response; } catch (error) { const mcpError = handleError(error, typeof args === 'object' && args !== null && 'file_path' in args ? String(args.file_path) : undefined); throw new Error(JSON.stringify(mcpError)); } }
- src/tools/extract-text.ts:7-35 (schema)Tool specification including name, description, and inputSchema for MCP protocol compliance.export const extractTextTool: Tool = { name: 'extract_pdf_text', description: 'Extract text content from PDF documents with optional metadata and formatting preservation', inputSchema: { type: 'object', properties: { file_path: { type: 'string', description: 'Path to the PDF file to extract text from' }, pages: { type: 'string', description: 'Page range to extract (e.g., "1-5", "1,3,5", or "all")', default: 'all' }, preserve_formatting: { type: 'boolean', description: 'Whether to preserve text formatting and structure', default: true }, include_metadata: { type: 'boolean', description: 'Whether to include document metadata in the response', default: false } }, required: ['file_path'] } };
- src/types/mcp-types.ts:7-12 (schema)Zod schema for runtime input validation of extract_pdf_text parameters, used in the handler.export const ExtractTextParamsSchema = z.object({ file_path: filePathValidation, pages: z.string().default('all'), preserve_formatting: z.boolean().default(true), include_metadata: z.boolean().default(false) });
- src/index.ts:41-45 (registration)Registration of the extract_pdf_text tool (via extractTextTool) in the MCP listTools request handler.extractTextTool, extractMetadataTool, extractPagesTool, validatePDFTool, ],
- src/index.ts:53-61 (registration)Dispatch/registration in the MCP callTool request handler switch statement, invoking the tool's handleExtractText function.case 'extract_pdf_text': return { content: [ { type: 'text', text: JSON.stringify(await handleExtractText(args), null, 2), }, ], };