search_pdf

Search for text patterns or regex within a PDF file, returning matching pages with context snippets. Use page ranges and early stopping for efficient large document scanning.

Instructions

Search for text patterns (including regex) within a PDF file and return matching pages with context snippets. Supports Python-style page ranges and early stopping for performance. Use /pattern/flags format for regex (e.g., '/budget|forecast/gi') or plain text for literal search.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`absolute_path`	No	Absolute path to the PDF file (e.g., '/Users/john/documents/report.pdf')
`relative_path`	No	Path relative to ~/pdf-agent/ directory (e.g., 'reports/annual.pdf')
`use_pdf_home`	No	Use PDF agent home directory for relative paths (default: true)
`page_range`	No	Page range in enhanced Python-style format: '5' (page 5), '5:10' (pages 5-10), '7:' (page 7 to end), ':5' (start to page 5). Also supports comma-separated combinations: '1,3:5,7' (pages 1, 3-5, and 7), '1-3,7,10:' (pages 1-3, 7, and 10 to end). Default: '1:' (all pages)	1:
`search_pattern`	No	Search pattern: '/regex/flags' format (e.g., '/budget\|forecast/gi') or plain text for literal search. Required.
`max_results`	No	Stop after finding this many total matches. Optional - use for quick searches.
`max_pages_scanned`	No	Stop after scanning this many pages. Optional - use for quick searches.
`context_chars`	No	Number of characters to include before/after each match for context. Default: 150
`search_timeout`	No	Timeout for search operations in milliseconds. Default: 10000 (10 seconds)

Implementation Reference

src/index.ts:143-158 (schema)

Zod schema for the search_pdf tool input validation, defining fields like absolute_path, relative_path, page_range, search_pattern, max_results, max_pages_scanned, context_chars, and search_timeout.

const SearchPdfSchema = z.object({
  absolute_path: z.string().optional(),
  relative_path: z.string().optional(),
  use_pdf_home: z.boolean().default(true),
  page_range: z.string().default("1:"),
  search_pattern: z.string().min(1),
  max_results: z.coerce.number().min(1).optional(),
  max_pages_scanned: z.coerce.number().min(1).optional(),
  context_chars: z.coerce.number().min(10).max(1000).default(150),
  search_timeout: z.coerce.number().min(1000).max(60000).default(10000),
}).refine(
  (data) => (data.absolute_path && !data.relative_path) || (!data.absolute_path && data.relative_path),
  {
    message: "Exactly one of 'absolute_path' or 'relative_path' must be provided",
  }
);

src/index.ts:1574-1628 (registration)

Tool registration entry in ListToolsRequestSchema handler, declaring 'search_pdf' name, description, and input schema.

{
  name: "search_pdf",
  description: "Search for text patterns (including regex) within a PDF file and return matching pages with context snippets. Supports Python-style page ranges and early stopping for performance. Use /pattern/flags format for regex (e.g., '/budget|forecast/gi') or plain text for literal search.",
  inputSchema: {
    type: "object",
    properties: {
      absolute_path: {
        type: "string",
        description: "Absolute path to the PDF file (e.g., '/Users/john/documents/report.pdf')",
      },
      relative_path: {
        type: "string",
        description: "Path relative to ~/pdf-agent/ directory (e.g., 'reports/annual.pdf')",
      },
      use_pdf_home: {
        type: "boolean",
        description: "Use PDF agent home directory for relative paths (default: true)",
        default: true,
      },
      page_range: {
        type: "string",
        description: "Page range in enhanced Python-style format: '5' (page 5), '5:10' (pages 5-10), '7:' (page 7 to end), ':5' (start to page 5). Also supports comma-separated combinations: '1,3:5,7' (pages 1, 3-5, and 7), '1-3,7,10:' (pages 1-3, 7, and 10 to end). Default: '1:' (all pages)",
        default: "1:",
      },
      search_pattern: {
        type: "string",
        description: "Search pattern: '/regex/flags' format (e.g., '/budget|forecast/gi') or plain text for literal search. Required.",
      },
      max_results: {
        type: "number",
        description: "Stop after finding this many total matches. Optional - use for quick searches.",
        minimum: 1,
      },
      max_pages_scanned: {
        type: "number",
        description: "Stop after scanning this many pages. Optional - use for quick searches.",
        minimum: 1,
      },
      context_chars: {
        type: "number",
        description: "Number of characters to include before/after each match for context. Default: 150",
        minimum: 10,
        maximum: 1000,
        default: 150,
      },
      search_timeout: {
        type: "number",
        description: "Timeout for search operations in milliseconds. Default: 10000 (10 seconds)",
        minimum: 1000,
        maximum: 60000,
        default: 10000,
      },
    },
  },
},

src/index.ts:2217-2368 (handler)

Main handler for the search_pdf tool in CallToolRequestSchema switch case. Resolves file path, determines search strategy (page_by_page vs extract_all), calls searchPdfPageByPage or searchPdfComprehensive, and formats results.

case "search_pdf": {
  const { 
    absolute_path, 
    relative_path, 
    use_pdf_home, 
    page_range, 
    search_pattern,
    max_results,
    max_pages_scanned,
    context_chars,
    search_timeout
  } = SearchPdfSchema.parse(args);
  
  try {
    // Resolve the final path based on parameters
    let resolvedPath: string;
    if (absolute_path) {
      resolvedPath = resolve(absolute_path);
    } else {
      if (use_pdf_home) {
        const pdfAgentHome = await ensurePdfAgentHome();
        resolvedPath = resolve(pdfAgentHome, relative_path!);
      } else {
        resolvedPath = resolve(relative_path!);
      }
    }
    
    // Check if file exists
    if (!(await fileExists(resolvedPath))) {
      throw new Error(`PDF file not found at ${resolvedPath}. Please check the file path and ensure the file exists.`);
    }
    
    // Get PDF metadata to determine page count
    const pdfBuffer = await safeReadFile(resolvedPath);
    const pdfDoc = await PDFDocument.load(pdfBuffer);
    const totalPages = pdfDoc.getPageCount();
    
    // Parse page range
    const pageNumbers = parsePageRange(page_range, totalPages);
    
    // Validate search pattern
    let searchRegex: RegExp;
    let isRegexSearch: boolean;
    try {
      const parsed = parseSearchPattern(search_pattern);
      searchRegex = parsed.regex;
      isRegexSearch = parsed.isRegex;
    } catch (regexError) {
      throw new Error(`Invalid search pattern: ${regexError}`);
    }
    
    // Determine search strategy based on limits
    const hasLimits = max_results !== undefined || max_pages_scanned !== undefined;
    let searchResults: any;
    let searchStrategy: string;
    
    if (hasLimits) {
      // Use page-by-page search with early stopping
      searchStrategy = "page_by_page";
      searchResults = await searchPdfPageByPage(
        resolvedPath,
        pageNumbers,
        search_pattern,
        context_chars,
        search_timeout,
        max_results,
        max_pages_scanned
      );
    } else {
      // Use comprehensive search (extract all then search)
      searchStrategy = "extract_all";
      const comprehensiveResults = await searchPdfComprehensive(
        resolvedPath,
        pageNumbers,
        search_pattern,
        context_chars,
        search_timeout
      );
      searchResults = {
        ...comprehensiveResults,
        completed: true,
        stoppedReason: 'completed'
      };
    }
    
    // Create comprehensive summary
    const totalMatches = searchResults.matches.reduce((sum: number, page: any) => sum + page.matchCount, 0);
    const pagesWithMatches = searchResults.matches.length;
    
    const summary = {
      total_matches: totalMatches,
      pages_with_matches: pagesWithMatches,
      pages_scanned: searchResults.pagesScanned,
      total_pages_in_range: pageNumbers.length,
      search_strategy: searchStrategy,
      search_pattern: search_pattern,
      is_regex: isRegexSearch,
      completed: searchResults.completed,
      stopped_reason: searchResults.stoppedReason,
      context_chars: context_chars,
      timeout_ms: search_timeout,
      errors: searchResults.errors?.length || 0
    };
    
    // Prepare response content
    const content: any[] = [];
    
    // Add summary as first item
    content.push({
      type: "text",
      text: JSON.stringify(summary, null, 2)
    });
    
    // Add detailed results if matches found
    if (searchResults.matches.length > 0) {
      content.push({
        type: "text", 
        text: JSON.stringify({
          matches: searchResults.matches,
          errors: searchResults.errors || []
        }, null, 2)
      });
    }
    
    // Add error details if any
    if (searchResults.errors && searchResults.errors.length > 0) {
      content.push({
        type: "text",
        text: JSON.stringify({
          errors: searchResults.errors
        }, null, 2)
      });
    }
    
    return {
      content: content
    };
  } catch (e) {
    const providedPath = relative_path || absolute_path || 'unknown';
    const pathType = relative_path ? 'relative path' : 'absolute path';
    return {
      content: [
        {
          type: "text",
          text: JSON.stringify({ 
            error: `Error searching PDF at ${pathType} '${providedPath}': ${e}. Please ensure the file is a valid PDF, check the search pattern format, and verify the page range.` 
          }),
        },
      ],
    };
  }
}

src/index.ts:645-727 (helper)

Helper function searchPdfComprehensive: extract-all-then-search strategy that extracts text from all requested pages first, then searches for the pattern across all extracted text.

async function searchPdfComprehensive(
  pdfPath: string,
  pageNumbers: number[],
  searchPattern: string,
  contextChars: number,
  searchTimeout: number
): Promise<{
  matches: Array<{
    page: number;
    matchCount: number;
    snippets: Array<{
      text: string;
      matchStart: number;
      matchEnd: number;
    }>;
  }>;
  errors: string[];
  pagesScanned: number;
}> {
  const results: Array<{
    page: number;
    matchCount: number;
    snippets: Array<{
      text: string;
      matchStart: number;
      matchEnd: number;
    }>;
  }> = [];
  const errors: string[] = [];
  
  try {
    // Extract text from all pages using hybrid approach
    log('info', `Extracting text from ${pageNumbers.length} pages for comprehensive search`);
    const pdfBuffer = await safeReadFile(pdfPath);
    const pageTexts = await extractTextHybrid(pdfBuffer, pdfPath, pageNumbers);
    
    // Parse search pattern
    const { regex } = parseSearchPattern(searchPattern);
    
    // Search each page
    for (let i = 0; i < pageNumbers.length; i++) {
      const pageNum = pageNumbers[i];
      const pageText = pageTexts[i];
      
      if (!pageText || pageText.trim().length === 0) {
        errors.push(`Page ${pageNum}: No text extracted`);
        continue;
      }
      
      try {
        // Search with timeout protection
        const matches = await searchWithTimeout(pageText, new RegExp(regex.source, regex.flags), searchTimeout);
        
        if (matches.length > 0) {
          const snippets = matches.map(match => {
            const context = extractContext(pageText, match.index!, match.index! + match[0].length, contextChars);
            return {
              text: context.snippet,
              matchStart: context.matchStartInSnippet,
              matchEnd: context.matchEndInSnippet,
            };
          });
          
          results.push({
            page: pageNum,
            matchCount: matches.length,
            snippets,
          });
        }
      } catch (searchError) {
        errors.push(`Page ${pageNum}: Search failed - ${searchError}`);
      }
    }
    
    return {
      matches: results,
      errors,
      pagesScanned: pageNumbers.length,
    };
  } catch (error) {
    throw new Error(`Comprehensive search failed: ${error}`);
  }
}

src/index.ts:732-842 (helper)

Helper function searchPdfPageByPage: page-by-page search with early stopping based on max_results or max_pages_scanned limits.

async function searchPdfPageByPage(
  pdfPath: string,
  pageNumbers: number[],
  searchPattern: string,
  contextChars: number,
  searchTimeout: number,
  maxResults?: number,
  maxPagesScanned?: number
): Promise<{
  matches: Array<{
    page: number;
    matchCount: number;
    snippets: Array<{
      text: string;
      matchStart: number;
      matchEnd: number;
    }>;
  }>;
  errors: string[];
  pagesScanned: number;
  completed: boolean;
  stoppedReason?: 'max_results' | 'max_pages' | 'completed';
}> {
  const results: Array<{
    page: number;
    matchCount: number;
    snippets: Array<{
      text: string;
      matchStart: number;
      matchEnd: number;
    }>;
  }> = [];
  const errors: string[] = [];
  let totalMatchCount = 0;
  let pagesScanned = 0;
  
  // Parse search pattern once
  const { regex } = parseSearchPattern(searchPattern);
  
  log('info', `Starting page-by-page search with limits: max_results=${maxResults}, max_pages=${maxPagesScanned}`);
  
  for (const pageNum of pageNumbers) {
    // Check if we should stop scanning more pages
    if (maxPagesScanned && pagesScanned >= maxPagesScanned) {
      return {
        matches: results,
        errors,
        pagesScanned,
        completed: false,
        stoppedReason: 'max_pages',
      };
    }
    
    pagesScanned++;
    
    try {
      // Extract text from single page using hybrid approach
      const pdfBuffer = await safeReadFile(pdfPath);
      const pageTexts = await extractTextHybrid(pdfBuffer, pdfPath, [pageNum]);
      const pageText = pageTexts[0];
      
      if (!pageText || pageText.trim().length === 0) {
        errors.push(`Page ${pageNum}: No text extracted`);
        continue;
      }
      
      // Search with timeout protection
      const matches = await searchWithTimeout(pageText, new RegExp(regex.source, regex.flags), searchTimeout);
      
      if (matches.length > 0) {
        const snippets = matches.map(match => {
          const context = extractContext(pageText, match.index!, match.index! + match[0].length, contextChars);
          return {
            text: context.snippet,
            matchStart: context.matchStartInSnippet,
            matchEnd: context.matchEndInSnippet,
          };
        });
        
        results.push({
          page: pageNum,
          matchCount: matches.length,
          snippets,
        });
        
        totalMatchCount += matches.length;
        
        // Check if we've reached max results
        if (maxResults && totalMatchCount >= maxResults) {
          return {
            matches: results,
            errors,
            pagesScanned,
            completed: false,
            stoppedReason: 'max_results',
          };
        }
      }
    } catch (searchError) {
      errors.push(`Page ${pageNum}: Search failed - ${searchError}`);
    }
  }
  
  return {
    matches: results,
    errors,
    pagesScanned,
    completed: true,
    stoppedReason: 'completed',
  };
}

PDF Agent MCP

search_pdf

Instructions

Input Schema

Implementation Reference

Tool Definition Quality

Other Tools

Latest Blog Posts

MCP directory API