search_multiple_pdfs

Search for text patterns across multiple PDF files in parallel. Returns matches and errors per file for efficient document analysis.

Instructions

Search for text patterns across multiple PDF files in parallel. Processes files concurrently based on the parallelism factor for optimal performance. Increase parallelism (max: 50) to search more files simultaneously and reduce total search time. For large batches of files, prefer a single call with high parallelism rather than multiple smaller calls (e.g., search 100 files with parallelism=50 in one call instead of multiple calls with 20 files each). Returns matches and errors for each file separately.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`files`	Yes	Array of PDF files to search. Each file must specify either absolute_path or relative_path.
`search_pattern`	Yes	Search pattern: '/regex/flags' format or plain text. Applied to all files.
`parallelism`	No	Number of files to process concurrently. Higher values = faster search. Default: 4, Max: 50
`page_range`	No	Page range to search in each file. Default: '1:' (all pages)	1:
`max_results_per_file`	No	Max matches per file before stopping. Optional.
`max_pages_scanned_per_file`	No	Max pages to scan per file. Optional.
`context_chars`	No	Characters of context around matches. Default: 150
`search_timeout`	No	Timeout per file in milliseconds. Default: 10000

Implementation Reference

src/index.ts:180-196 (schema)

Zod schema for the search_multiple_pdfs tool - defines input validation including files array, search_pattern, parallelism, page_range, max_results_per_file, max_pages_scanned_per_file, context_chars, and search_timeout.

const SearchMultiplePdfsSchema = z.object({
  files: z.array(z.object({
    absolute_path: z.string().optional(),
    relative_path: z.string().optional(),
    use_pdf_home: z.boolean().default(true),
  }).refine(
    (data) => (data.absolute_path && !data.relative_path) || (!data.absolute_path && data.relative_path),
    { message: "Exactly one of 'absolute_path' or 'relative_path' must be provided for each file" }
  )).min(1),
  search_pattern: z.string().min(1),
  parallelism: z.coerce.number().min(1).max(50).default(4),
  page_range: z.string().default("1:"),
  max_results_per_file: z.coerce.number().min(1).optional(),
  max_pages_scanned_per_file: z.coerce.number().min(1).optional(),
  context_chars: z.coerce.number().min(10).max(1000).default(150),
  search_timeout: z.coerce.number().min(1000).max(60000).default(10000),
});

src/index.ts:1691-1763 (registration)

Tool registration in ListToolsRequestSchema handler - defines the tool name, description, and input schema for 'search_multiple_pdfs'.

{
  name: "search_multiple_pdfs",
  description: "Search for text patterns across multiple PDF files in parallel. Processes files concurrently based on the parallelism factor for optimal performance. Increase parallelism (max: 50) to search more files simultaneously and reduce total search time. For large batches of files, prefer a single call with high parallelism rather than multiple smaller calls (e.g., search 100 files with parallelism=50 in one call instead of multiple calls with 20 files each). Returns matches and errors for each file separately.",
  inputSchema: {
    type: "object",
    properties: {
      files: {
        type: "array",
        description: "Array of PDF files to search. Each file must specify either absolute_path or relative_path.",
        items: {
          type: "object",
          properties: {
            absolute_path: {
              type: "string",
              description: "Absolute path to the PDF file"
            },
            relative_path: {
              type: "string",
              description: "Path relative to ~/pdf-agent/ directory"
            },
            use_pdf_home: {
              type: "boolean",
              description: "Use PDF agent home directory for relative paths (default: true)",
              default: true
            }
          }
        },
        minItems: 1
      },
      search_pattern: {
        type: "string",
        description: "Search pattern: '/regex/flags' format or plain text. Applied to all files."
      },
      parallelism: {
        type: "number",
        description: "Number of files to process concurrently. Higher values = faster search. Default: 4, Max: 50",
        minimum: 1,
        maximum: 50,
        default: 4
      },
      page_range: {
        type: "string",
        description: "Page range to search in each file. Default: '1:' (all pages)",
        default: "1:"
      },
      max_results_per_file: {
        type: "number",
        description: "Max matches per file before stopping. Optional.",
        minimum: 1
      },
      max_pages_scanned_per_file: {
        type: "number",
        description: "Max pages to scan per file. Optional.",
        minimum: 1
      },
      context_chars: {
        type: "number",
        description: "Characters of context around matches. Default: 150",
        minimum: 10,
        maximum: 1000,
        default: 150
      },
      search_timeout: {
        type: "number",
        description: "Timeout per file in milliseconds. Default: 10000",
        minimum: 1000,
        maximum: 60000,
        default: 10000
      }
    },
    required: ["files", "search_pattern"]
  }
},

src/index.ts:2506-2621 (handler)

Main handler implementation in CallToolRequestSchema - resolves file paths, validates inputs with Zod schema, calls searchMultiplePdfsWithParallelism, and formats the response with summary statistics.

case "search_multiple_pdfs": {
  // Handle case where files might be passed as JSON string
  let processedArgs = { ...args };
  if (args && typeof args.files === 'string') {
    try {
      processedArgs.files = JSON.parse(args.files);
    } catch (e) {
      throw new Error(`Invalid JSON in files parameter: ${e}`);
    }
  }
  
  const { 
    files, 
    search_pattern,
    parallelism,
    page_range,
    max_results_per_file,
    max_pages_scanned_per_file,
    context_chars,
    search_timeout
  } = SearchMultiplePdfsSchema.parse(processedArgs);
  
  try {
    // Resolve all file paths
    const resolvedFiles = await Promise.all(files.map(async (file) => {
      let resolvedPath: string;
      let originalPath: string;
      
      if (file.use_pdf_home && file.relative_path) {
        const pdfAgentHome = await ensurePdfAgentHome();
        resolvedPath = join(pdfAgentHome, file.relative_path);
        originalPath = file.relative_path;
      } else if (file.absolute_path) {
        if (!isAbsolute(file.absolute_path)) {
          throw new Error(`Path '${file.absolute_path}' is not absolute`);
        }
        resolvedPath = file.absolute_path;
        originalPath = file.absolute_path;
      } else {
        throw new Error('Invalid file specification');
      }
      
      return { path: resolvedPath, originalPath };
    }));
    
    log('info', `Starting parallel search across ${files.length} PDFs with parallelism ${parallelism}`);
    
    // Perform parallel search
    const searchResults = await searchMultiplePdfsWithParallelism(
      resolvedFiles,
      search_pattern,
      {
        parallelism,
        pageRange: page_range,
        maxResultsPerFile: max_results_per_file,
        maxPagesScannedPerFile: max_pages_scanned_per_file,
        contextChars: context_chars,
        searchTimeout: search_timeout
      }
    );
    
    // Calculate summary statistics
    const successfulSearches = searchResults.filter(r => r.success);
    const failedSearches = searchResults.filter(r => !r.success);
    const totalMatches = successfulSearches.reduce((sum, r) => {
      if (r.result?.total_matches) {
        return sum + r.result.total_matches;
      }
      return sum;
    }, 0);
    
    const totalPagesScanned = successfulSearches.reduce((sum, r) => {
      if (r.result?.pages_scanned) {
        return sum + r.result.pages_scanned;
      }
      return sum;
    }, 0);
    
    const summary = {
      files_searched: files.length,
      successful_searches: successfulSearches.length,
      failed_searches: failedSearches.length,
      total_matches_found: totalMatches,
      total_pages_scanned: totalPagesScanned,
      search_pattern: search_pattern,
      parallelism_used: parallelism,
      page_range: page_range
    };
    
    log('info', `Search completed: ${totalMatches} matches found across ${successfulSearches.length} files`);
    
    return {
      content: [
        {
          type: "text",
          text: JSON.stringify({
            summary,
            results: searchResults
          }, null, 2)
        }
      ]
    };
  } catch (e) {
    log('error', 'Error in search_multiple_pdfs', { error: e });
    return {
      content: [
        {
          type: "text",
          text: JSON.stringify({ 
            error: `Error searching multiple PDFs: ${e}`
          })
        }
      ]
    };
  }
}

src/index.ts:1280-1410 (helper)

Helper function searchMultiplePdfsWithParallelism - searches multiple PDFs in parallel batches, handling file existence checks, PDF loading, page range parsing, and delegating to searchPdfPageByPage or searchPdfComprehensive based on limits.

/**
 * Search multiple PDFs with parallelism control
 */
async function searchMultiplePdfsWithParallelism(
  files: Array<{ path: string; originalPath: string }>,
  searchPattern: string,
  options: {
    parallelism: number;
    pageRange: string;
    maxResultsPerFile?: number;
    maxPagesScannedPerFile?: number;
    contextChars: number;
    searchTimeout: number;
  }
): Promise<Array<{
  file: string;
  success: boolean;
  result?: any;
  error?: string;
}>> {
  const results: Array<{ file: string; success: boolean; result?: any; error?: string }> = [];
  
  // Process files in batches based on parallelism
  for (let i = 0; i < files.length; i += options.parallelism) {
    const batch = files.slice(i, i + options.parallelism);
    
    const batchPromises = batch.map(async ({ path, originalPath }) => {
      try {
        // Check if file exists
        if (!(await fileExists(path))) {
          return {
            file: originalPath,
            success: false,
            error: `File not found: ${path}`
          };
        }
        
        // Get PDF metadata
        const pdfBuffer = await safeReadFile(path);
        let pdfDoc: PDFDocument;
        try {
          pdfDoc = await PDFDocument.load(pdfBuffer);
        } catch (error) {
          if (error instanceof Error && error.message.includes('encrypted')) {
            pdfDoc = await PDFDocument.load(pdfBuffer, { ignoreEncryption: true });
          } else {
            throw error;
          }
        }
        const totalPages = pdfDoc.getPageCount();
        
        // Parse page range
        const pageNumbers = parsePageRange(options.pageRange, totalPages);
        
        // Determine search strategy
        const hasLimits = options.maxResultsPerFile !== undefined || options.maxPagesScannedPerFile !== undefined;
        
        let searchResult;
        if (hasLimits) {
          searchResult = await searchPdfPageByPage(
            path,
            pageNumbers,
            searchPattern,
            options.contextChars,
            options.searchTimeout,
            options.maxResultsPerFile,
            options.maxPagesScannedPerFile
          );
        } else {
          const comprehensiveResult = await searchPdfComprehensive(
            path,
            pageNumbers,
            searchPattern,
            options.contextChars,
            options.searchTimeout
          );
          searchResult = {
            ...comprehensiveResult,
            completed: true,
            stoppedReason: 'completed'
          };
        }
        
        // Calculate total matches for this file
        const totalMatches = searchResult.matches.reduce((sum: number, page: any) => 
          sum + page.matchCount, 0);
        
        return {
          file: originalPath,
          success: true,
          result: {
            total_pages: totalPages,
            pages_in_range: pageNumbers.length,
            total_matches: totalMatches,
            pages_with_matches: searchResult.matches.length,
            pages_scanned: searchResult.pagesScanned,
            completed: searchResult.completed,
            stopped_reason: searchResult.stoppedReason,
            matches: searchResult.matches,
            errors: searchResult.errors
          }
        };
      } catch (error) {
        log('error', `Error searching PDF ${originalPath}`, { error });
        return {
          file: originalPath,
          success: false,
          error: error instanceof Error ? error.message : String(error)
        };
      }
    });
    
    // Wait for batch to complete
    const batchResults = await Promise.allSettled(batchPromises);
    
    // Process batch results
    batchResults.forEach((result, index) => {
      if (result.status === 'fulfilled') {
        results.push(result.value);
      } else {
        results.push({
          file: batch[index].originalPath,
          success: false,
          error: `Unexpected error: ${result.reason}`
        });
      }
    });
  }
  
  return results;
}

PDF Agent MCP

search_multiple_pdfs

Instructions

Input Schema

Implementation Reference

Tool Definition Quality

Other Tools

Latest Blog Posts

MCP directory API