search_pdf
Search for text patterns or regex within a PDF file, returning matching pages with context snippets. Use page ranges and early stopping for efficient large document scanning.
Instructions
Search for text patterns (including regex) within a PDF file and return matching pages with context snippets. Supports Python-style page ranges and early stopping for performance. Use /pattern/flags format for regex (e.g., '/budget|forecast/gi') or plain text for literal search.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| absolute_path | No | Absolute path to the PDF file (e.g., '/Users/john/documents/report.pdf') | |
| relative_path | No | Path relative to ~/pdf-agent/ directory (e.g., 'reports/annual.pdf') | |
| use_pdf_home | No | Use PDF agent home directory for relative paths (default: true) | |
| page_range | No | Page range in enhanced Python-style format: '5' (page 5), '5:10' (pages 5-10), '7:' (page 7 to end), ':5' (start to page 5). Also supports comma-separated combinations: '1,3:5,7' (pages 1, 3-5, and 7), '1-3,7,10:' (pages 1-3, 7, and 10 to end). Default: '1:' (all pages) | 1: |
| search_pattern | No | Search pattern: '/regex/flags' format (e.g., '/budget|forecast/gi') or plain text for literal search. Required. | |
| max_results | No | Stop after finding this many total matches. Optional - use for quick searches. | |
| max_pages_scanned | No | Stop after scanning this many pages. Optional - use for quick searches. | |
| context_chars | No | Number of characters to include before/after each match for context. Default: 150 | |
| search_timeout | No | Timeout for search operations in milliseconds. Default: 10000 (10 seconds) |
Implementation Reference
- src/index.ts:143-158 (schema)Zod schema for the search_pdf tool input validation, defining fields like absolute_path, relative_path, page_range, search_pattern, max_results, max_pages_scanned, context_chars, and search_timeout.
const SearchPdfSchema = z.object({ absolute_path: z.string().optional(), relative_path: z.string().optional(), use_pdf_home: z.boolean().default(true), page_range: z.string().default("1:"), search_pattern: z.string().min(1), max_results: z.coerce.number().min(1).optional(), max_pages_scanned: z.coerce.number().min(1).optional(), context_chars: z.coerce.number().min(10).max(1000).default(150), search_timeout: z.coerce.number().min(1000).max(60000).default(10000), }).refine( (data) => (data.absolute_path && !data.relative_path) || (!data.absolute_path && data.relative_path), { message: "Exactly one of 'absolute_path' or 'relative_path' must be provided", } ); - src/index.ts:1574-1628 (registration)Tool registration entry in ListToolsRequestSchema handler, declaring 'search_pdf' name, description, and input schema.
{ name: "search_pdf", description: "Search for text patterns (including regex) within a PDF file and return matching pages with context snippets. Supports Python-style page ranges and early stopping for performance. Use /pattern/flags format for regex (e.g., '/budget|forecast/gi') or plain text for literal search.", inputSchema: { type: "object", properties: { absolute_path: { type: "string", description: "Absolute path to the PDF file (e.g., '/Users/john/documents/report.pdf')", }, relative_path: { type: "string", description: "Path relative to ~/pdf-agent/ directory (e.g., 'reports/annual.pdf')", }, use_pdf_home: { type: "boolean", description: "Use PDF agent home directory for relative paths (default: true)", default: true, }, page_range: { type: "string", description: "Page range in enhanced Python-style format: '5' (page 5), '5:10' (pages 5-10), '7:' (page 7 to end), ':5' (start to page 5). Also supports comma-separated combinations: '1,3:5,7' (pages 1, 3-5, and 7), '1-3,7,10:' (pages 1-3, 7, and 10 to end). Default: '1:' (all pages)", default: "1:", }, search_pattern: { type: "string", description: "Search pattern: '/regex/flags' format (e.g., '/budget|forecast/gi') or plain text for literal search. Required.", }, max_results: { type: "number", description: "Stop after finding this many total matches. Optional - use for quick searches.", minimum: 1, }, max_pages_scanned: { type: "number", description: "Stop after scanning this many pages. Optional - use for quick searches.", minimum: 1, }, context_chars: { type: "number", description: "Number of characters to include before/after each match for context. Default: 150", minimum: 10, maximum: 1000, default: 150, }, search_timeout: { type: "number", description: "Timeout for search operations in milliseconds. Default: 10000 (10 seconds)", minimum: 1000, maximum: 60000, default: 10000, }, }, }, }, - src/index.ts:2217-2368 (handler)Main handler for the search_pdf tool in CallToolRequestSchema switch case. Resolves file path, determines search strategy (page_by_page vs extract_all), calls searchPdfPageByPage or searchPdfComprehensive, and formats results.
case "search_pdf": { const { absolute_path, relative_path, use_pdf_home, page_range, search_pattern, max_results, max_pages_scanned, context_chars, search_timeout } = SearchPdfSchema.parse(args); try { // Resolve the final path based on parameters let resolvedPath: string; if (absolute_path) { resolvedPath = resolve(absolute_path); } else { if (use_pdf_home) { const pdfAgentHome = await ensurePdfAgentHome(); resolvedPath = resolve(pdfAgentHome, relative_path!); } else { resolvedPath = resolve(relative_path!); } } // Check if file exists if (!(await fileExists(resolvedPath))) { throw new Error(`PDF file not found at ${resolvedPath}. Please check the file path and ensure the file exists.`); } // Get PDF metadata to determine page count const pdfBuffer = await safeReadFile(resolvedPath); const pdfDoc = await PDFDocument.load(pdfBuffer); const totalPages = pdfDoc.getPageCount(); // Parse page range const pageNumbers = parsePageRange(page_range, totalPages); // Validate search pattern let searchRegex: RegExp; let isRegexSearch: boolean; try { const parsed = parseSearchPattern(search_pattern); searchRegex = parsed.regex; isRegexSearch = parsed.isRegex; } catch (regexError) { throw new Error(`Invalid search pattern: ${regexError}`); } // Determine search strategy based on limits const hasLimits = max_results !== undefined || max_pages_scanned !== undefined; let searchResults: any; let searchStrategy: string; if (hasLimits) { // Use page-by-page search with early stopping searchStrategy = "page_by_page"; searchResults = await searchPdfPageByPage( resolvedPath, pageNumbers, search_pattern, context_chars, search_timeout, max_results, max_pages_scanned ); } else { // Use comprehensive search (extract all then search) searchStrategy = "extract_all"; const comprehensiveResults = await searchPdfComprehensive( resolvedPath, pageNumbers, search_pattern, context_chars, search_timeout ); searchResults = { ...comprehensiveResults, completed: true, stoppedReason: 'completed' }; } // Create comprehensive summary const totalMatches = searchResults.matches.reduce((sum: number, page: any) => sum + page.matchCount, 0); const pagesWithMatches = searchResults.matches.length; const summary = { total_matches: totalMatches, pages_with_matches: pagesWithMatches, pages_scanned: searchResults.pagesScanned, total_pages_in_range: pageNumbers.length, search_strategy: searchStrategy, search_pattern: search_pattern, is_regex: isRegexSearch, completed: searchResults.completed, stopped_reason: searchResults.stoppedReason, context_chars: context_chars, timeout_ms: search_timeout, errors: searchResults.errors?.length || 0 }; // Prepare response content const content: any[] = []; // Add summary as first item content.push({ type: "text", text: JSON.stringify(summary, null, 2) }); // Add detailed results if matches found if (searchResults.matches.length > 0) { content.push({ type: "text", text: JSON.stringify({ matches: searchResults.matches, errors: searchResults.errors || [] }, null, 2) }); } // Add error details if any if (searchResults.errors && searchResults.errors.length > 0) { content.push({ type: "text", text: JSON.stringify({ errors: searchResults.errors }, null, 2) }); } return { content: content }; } catch (e) { const providedPath = relative_path || absolute_path || 'unknown'; const pathType = relative_path ? 'relative path' : 'absolute path'; return { content: [ { type: "text", text: JSON.stringify({ error: `Error searching PDF at ${pathType} '${providedPath}': ${e}. Please ensure the file is a valid PDF, check the search pattern format, and verify the page range.` }), }, ], }; } } - src/index.ts:645-727 (helper)Helper function searchPdfComprehensive: extract-all-then-search strategy that extracts text from all requested pages first, then searches for the pattern across all extracted text.
async function searchPdfComprehensive( pdfPath: string, pageNumbers: number[], searchPattern: string, contextChars: number, searchTimeout: number ): Promise<{ matches: Array<{ page: number; matchCount: number; snippets: Array<{ text: string; matchStart: number; matchEnd: number; }>; }>; errors: string[]; pagesScanned: number; }> { const results: Array<{ page: number; matchCount: number; snippets: Array<{ text: string; matchStart: number; matchEnd: number; }>; }> = []; const errors: string[] = []; try { // Extract text from all pages using hybrid approach log('info', `Extracting text from ${pageNumbers.length} pages for comprehensive search`); const pdfBuffer = await safeReadFile(pdfPath); const pageTexts = await extractTextHybrid(pdfBuffer, pdfPath, pageNumbers); // Parse search pattern const { regex } = parseSearchPattern(searchPattern); // Search each page for (let i = 0; i < pageNumbers.length; i++) { const pageNum = pageNumbers[i]; const pageText = pageTexts[i]; if (!pageText || pageText.trim().length === 0) { errors.push(`Page ${pageNum}: No text extracted`); continue; } try { // Search with timeout protection const matches = await searchWithTimeout(pageText, new RegExp(regex.source, regex.flags), searchTimeout); if (matches.length > 0) { const snippets = matches.map(match => { const context = extractContext(pageText, match.index!, match.index! + match[0].length, contextChars); return { text: context.snippet, matchStart: context.matchStartInSnippet, matchEnd: context.matchEndInSnippet, }; }); results.push({ page: pageNum, matchCount: matches.length, snippets, }); } } catch (searchError) { errors.push(`Page ${pageNum}: Search failed - ${searchError}`); } } return { matches: results, errors, pagesScanned: pageNumbers.length, }; } catch (error) { throw new Error(`Comprehensive search failed: ${error}`); } } - src/index.ts:732-842 (helper)Helper function searchPdfPageByPage: page-by-page search with early stopping based on max_results or max_pages_scanned limits.
async function searchPdfPageByPage( pdfPath: string, pageNumbers: number[], searchPattern: string, contextChars: number, searchTimeout: number, maxResults?: number, maxPagesScanned?: number ): Promise<{ matches: Array<{ page: number; matchCount: number; snippets: Array<{ text: string; matchStart: number; matchEnd: number; }>; }>; errors: string[]; pagesScanned: number; completed: boolean; stoppedReason?: 'max_results' | 'max_pages' | 'completed'; }> { const results: Array<{ page: number; matchCount: number; snippets: Array<{ text: string; matchStart: number; matchEnd: number; }>; }> = []; const errors: string[] = []; let totalMatchCount = 0; let pagesScanned = 0; // Parse search pattern once const { regex } = parseSearchPattern(searchPattern); log('info', `Starting page-by-page search with limits: max_results=${maxResults}, max_pages=${maxPagesScanned}`); for (const pageNum of pageNumbers) { // Check if we should stop scanning more pages if (maxPagesScanned && pagesScanned >= maxPagesScanned) { return { matches: results, errors, pagesScanned, completed: false, stoppedReason: 'max_pages', }; } pagesScanned++; try { // Extract text from single page using hybrid approach const pdfBuffer = await safeReadFile(pdfPath); const pageTexts = await extractTextHybrid(pdfBuffer, pdfPath, [pageNum]); const pageText = pageTexts[0]; if (!pageText || pageText.trim().length === 0) { errors.push(`Page ${pageNum}: No text extracted`); continue; } // Search with timeout protection const matches = await searchWithTimeout(pageText, new RegExp(regex.source, regex.flags), searchTimeout); if (matches.length > 0) { const snippets = matches.map(match => { const context = extractContext(pageText, match.index!, match.index! + match[0].length, contextChars); return { text: context.snippet, matchStart: context.matchStartInSnippet, matchEnd: context.matchEndInSnippet, }; }); results.push({ page: pageNum, matchCount: matches.length, snippets, }); totalMatchCount += matches.length; // Check if we've reached max results if (maxResults && totalMatchCount >= maxResults) { return { matches: results, errors, pagesScanned, completed: false, stoppedReason: 'max_results', }; } } } catch (searchError) { errors.push(`Page ${pageNum}: Search failed - ${searchError}`); } } return { matches: results, errors, pagesScanned, completed: true, stoppedReason: 'completed', }; }