Skip to main content
Glama

search_pdf

Search for text patterns or regex within a PDF file, returning matching pages with context snippets. Use page ranges and early stopping for efficient large document scanning.

Instructions

Search for text patterns (including regex) within a PDF file and return matching pages with context snippets. Supports Python-style page ranges and early stopping for performance. Use /pattern/flags format for regex (e.g., '/budget|forecast/gi') or plain text for literal search.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
absolute_pathNoAbsolute path to the PDF file (e.g., '/Users/john/documents/report.pdf')
relative_pathNoPath relative to ~/pdf-agent/ directory (e.g., 'reports/annual.pdf')
use_pdf_homeNoUse PDF agent home directory for relative paths (default: true)
page_rangeNoPage range in enhanced Python-style format: '5' (page 5), '5:10' (pages 5-10), '7:' (page 7 to end), ':5' (start to page 5). Also supports comma-separated combinations: '1,3:5,7' (pages 1, 3-5, and 7), '1-3,7,10:' (pages 1-3, 7, and 10 to end). Default: '1:' (all pages)1:
search_patternNoSearch pattern: '/regex/flags' format (e.g., '/budget|forecast/gi') or plain text for literal search. Required.
max_resultsNoStop after finding this many total matches. Optional - use for quick searches.
max_pages_scannedNoStop after scanning this many pages. Optional - use for quick searches.
context_charsNoNumber of characters to include before/after each match for context. Default: 150
search_timeoutNoTimeout for search operations in milliseconds. Default: 10000 (10 seconds)

Implementation Reference

  • Zod schema for the search_pdf tool input validation, defining fields like absolute_path, relative_path, page_range, search_pattern, max_results, max_pages_scanned, context_chars, and search_timeout.
    const SearchPdfSchema = z.object({
      absolute_path: z.string().optional(),
      relative_path: z.string().optional(),
      use_pdf_home: z.boolean().default(true),
      page_range: z.string().default("1:"),
      search_pattern: z.string().min(1),
      max_results: z.coerce.number().min(1).optional(),
      max_pages_scanned: z.coerce.number().min(1).optional(),
      context_chars: z.coerce.number().min(10).max(1000).default(150),
      search_timeout: z.coerce.number().min(1000).max(60000).default(10000),
    }).refine(
      (data) => (data.absolute_path && !data.relative_path) || (!data.absolute_path && data.relative_path),
      {
        message: "Exactly one of 'absolute_path' or 'relative_path' must be provided",
      }
    );
  • src/index.ts:1574-1628 (registration)
    Tool registration entry in ListToolsRequestSchema handler, declaring 'search_pdf' name, description, and input schema.
    {
      name: "search_pdf",
      description: "Search for text patterns (including regex) within a PDF file and return matching pages with context snippets. Supports Python-style page ranges and early stopping for performance. Use /pattern/flags format for regex (e.g., '/budget|forecast/gi') or plain text for literal search.",
      inputSchema: {
        type: "object",
        properties: {
          absolute_path: {
            type: "string",
            description: "Absolute path to the PDF file (e.g., '/Users/john/documents/report.pdf')",
          },
          relative_path: {
            type: "string",
            description: "Path relative to ~/pdf-agent/ directory (e.g., 'reports/annual.pdf')",
          },
          use_pdf_home: {
            type: "boolean",
            description: "Use PDF agent home directory for relative paths (default: true)",
            default: true,
          },
          page_range: {
            type: "string",
            description: "Page range in enhanced Python-style format: '5' (page 5), '5:10' (pages 5-10), '7:' (page 7 to end), ':5' (start to page 5). Also supports comma-separated combinations: '1,3:5,7' (pages 1, 3-5, and 7), '1-3,7,10:' (pages 1-3, 7, and 10 to end). Default: '1:' (all pages)",
            default: "1:",
          },
          search_pattern: {
            type: "string",
            description: "Search pattern: '/regex/flags' format (e.g., '/budget|forecast/gi') or plain text for literal search. Required.",
          },
          max_results: {
            type: "number",
            description: "Stop after finding this many total matches. Optional - use for quick searches.",
            minimum: 1,
          },
          max_pages_scanned: {
            type: "number",
            description: "Stop after scanning this many pages. Optional - use for quick searches.",
            minimum: 1,
          },
          context_chars: {
            type: "number",
            description: "Number of characters to include before/after each match for context. Default: 150",
            minimum: 10,
            maximum: 1000,
            default: 150,
          },
          search_timeout: {
            type: "number",
            description: "Timeout for search operations in milliseconds. Default: 10000 (10 seconds)",
            minimum: 1000,
            maximum: 60000,
            default: 10000,
          },
        },
      },
    },
  • Main handler for the search_pdf tool in CallToolRequestSchema switch case. Resolves file path, determines search strategy (page_by_page vs extract_all), calls searchPdfPageByPage or searchPdfComprehensive, and formats results.
    case "search_pdf": {
      const { 
        absolute_path, 
        relative_path, 
        use_pdf_home, 
        page_range, 
        search_pattern,
        max_results,
        max_pages_scanned,
        context_chars,
        search_timeout
      } = SearchPdfSchema.parse(args);
      
      try {
        // Resolve the final path based on parameters
        let resolvedPath: string;
        if (absolute_path) {
          resolvedPath = resolve(absolute_path);
        } else {
          if (use_pdf_home) {
            const pdfAgentHome = await ensurePdfAgentHome();
            resolvedPath = resolve(pdfAgentHome, relative_path!);
          } else {
            resolvedPath = resolve(relative_path!);
          }
        }
        
        // Check if file exists
        if (!(await fileExists(resolvedPath))) {
          throw new Error(`PDF file not found at ${resolvedPath}. Please check the file path and ensure the file exists.`);
        }
        
        // Get PDF metadata to determine page count
        const pdfBuffer = await safeReadFile(resolvedPath);
        const pdfDoc = await PDFDocument.load(pdfBuffer);
        const totalPages = pdfDoc.getPageCount();
        
        // Parse page range
        const pageNumbers = parsePageRange(page_range, totalPages);
        
        // Validate search pattern
        let searchRegex: RegExp;
        let isRegexSearch: boolean;
        try {
          const parsed = parseSearchPattern(search_pattern);
          searchRegex = parsed.regex;
          isRegexSearch = parsed.isRegex;
        } catch (regexError) {
          throw new Error(`Invalid search pattern: ${regexError}`);
        }
        
        // Determine search strategy based on limits
        const hasLimits = max_results !== undefined || max_pages_scanned !== undefined;
        let searchResults: any;
        let searchStrategy: string;
        
        if (hasLimits) {
          // Use page-by-page search with early stopping
          searchStrategy = "page_by_page";
          searchResults = await searchPdfPageByPage(
            resolvedPath,
            pageNumbers,
            search_pattern,
            context_chars,
            search_timeout,
            max_results,
            max_pages_scanned
          );
        } else {
          // Use comprehensive search (extract all then search)
          searchStrategy = "extract_all";
          const comprehensiveResults = await searchPdfComprehensive(
            resolvedPath,
            pageNumbers,
            search_pattern,
            context_chars,
            search_timeout
          );
          searchResults = {
            ...comprehensiveResults,
            completed: true,
            stoppedReason: 'completed'
          };
        }
        
        // Create comprehensive summary
        const totalMatches = searchResults.matches.reduce((sum: number, page: any) => sum + page.matchCount, 0);
        const pagesWithMatches = searchResults.matches.length;
        
        const summary = {
          total_matches: totalMatches,
          pages_with_matches: pagesWithMatches,
          pages_scanned: searchResults.pagesScanned,
          total_pages_in_range: pageNumbers.length,
          search_strategy: searchStrategy,
          search_pattern: search_pattern,
          is_regex: isRegexSearch,
          completed: searchResults.completed,
          stopped_reason: searchResults.stoppedReason,
          context_chars: context_chars,
          timeout_ms: search_timeout,
          errors: searchResults.errors?.length || 0
        };
        
        // Prepare response content
        const content: any[] = [];
        
        // Add summary as first item
        content.push({
          type: "text",
          text: JSON.stringify(summary, null, 2)
        });
        
        // Add detailed results if matches found
        if (searchResults.matches.length > 0) {
          content.push({
            type: "text", 
            text: JSON.stringify({
              matches: searchResults.matches,
              errors: searchResults.errors || []
            }, null, 2)
          });
        }
        
        // Add error details if any
        if (searchResults.errors && searchResults.errors.length > 0) {
          content.push({
            type: "text",
            text: JSON.stringify({
              errors: searchResults.errors
            }, null, 2)
          });
        }
        
        return {
          content: content
        };
      } catch (e) {
        const providedPath = relative_path || absolute_path || 'unknown';
        const pathType = relative_path ? 'relative path' : 'absolute path';
        return {
          content: [
            {
              type: "text",
              text: JSON.stringify({ 
                error: `Error searching PDF at ${pathType} '${providedPath}': ${e}. Please ensure the file is a valid PDF, check the search pattern format, and verify the page range.` 
              }),
            },
          ],
        };
      }
    }
  • Helper function searchPdfComprehensive: extract-all-then-search strategy that extracts text from all requested pages first, then searches for the pattern across all extracted text.
    async function searchPdfComprehensive(
      pdfPath: string,
      pageNumbers: number[],
      searchPattern: string,
      contextChars: number,
      searchTimeout: number
    ): Promise<{
      matches: Array<{
        page: number;
        matchCount: number;
        snippets: Array<{
          text: string;
          matchStart: number;
          matchEnd: number;
        }>;
      }>;
      errors: string[];
      pagesScanned: number;
    }> {
      const results: Array<{
        page: number;
        matchCount: number;
        snippets: Array<{
          text: string;
          matchStart: number;
          matchEnd: number;
        }>;
      }> = [];
      const errors: string[] = [];
      
      try {
        // Extract text from all pages using hybrid approach
        log('info', `Extracting text from ${pageNumbers.length} pages for comprehensive search`);
        const pdfBuffer = await safeReadFile(pdfPath);
        const pageTexts = await extractTextHybrid(pdfBuffer, pdfPath, pageNumbers);
        
        // Parse search pattern
        const { regex } = parseSearchPattern(searchPattern);
        
        // Search each page
        for (let i = 0; i < pageNumbers.length; i++) {
          const pageNum = pageNumbers[i];
          const pageText = pageTexts[i];
          
          if (!pageText || pageText.trim().length === 0) {
            errors.push(`Page ${pageNum}: No text extracted`);
            continue;
          }
          
          try {
            // Search with timeout protection
            const matches = await searchWithTimeout(pageText, new RegExp(regex.source, regex.flags), searchTimeout);
            
            if (matches.length > 0) {
              const snippets = matches.map(match => {
                const context = extractContext(pageText, match.index!, match.index! + match[0].length, contextChars);
                return {
                  text: context.snippet,
                  matchStart: context.matchStartInSnippet,
                  matchEnd: context.matchEndInSnippet,
                };
              });
              
              results.push({
                page: pageNum,
                matchCount: matches.length,
                snippets,
              });
            }
          } catch (searchError) {
            errors.push(`Page ${pageNum}: Search failed - ${searchError}`);
          }
        }
        
        return {
          matches: results,
          errors,
          pagesScanned: pageNumbers.length,
        };
      } catch (error) {
        throw new Error(`Comprehensive search failed: ${error}`);
      }
    }
  • Helper function searchPdfPageByPage: page-by-page search with early stopping based on max_results or max_pages_scanned limits.
    async function searchPdfPageByPage(
      pdfPath: string,
      pageNumbers: number[],
      searchPattern: string,
      contextChars: number,
      searchTimeout: number,
      maxResults?: number,
      maxPagesScanned?: number
    ): Promise<{
      matches: Array<{
        page: number;
        matchCount: number;
        snippets: Array<{
          text: string;
          matchStart: number;
          matchEnd: number;
        }>;
      }>;
      errors: string[];
      pagesScanned: number;
      completed: boolean;
      stoppedReason?: 'max_results' | 'max_pages' | 'completed';
    }> {
      const results: Array<{
        page: number;
        matchCount: number;
        snippets: Array<{
          text: string;
          matchStart: number;
          matchEnd: number;
        }>;
      }> = [];
      const errors: string[] = [];
      let totalMatchCount = 0;
      let pagesScanned = 0;
      
      // Parse search pattern once
      const { regex } = parseSearchPattern(searchPattern);
      
      log('info', `Starting page-by-page search with limits: max_results=${maxResults}, max_pages=${maxPagesScanned}`);
      
      for (const pageNum of pageNumbers) {
        // Check if we should stop scanning more pages
        if (maxPagesScanned && pagesScanned >= maxPagesScanned) {
          return {
            matches: results,
            errors,
            pagesScanned,
            completed: false,
            stoppedReason: 'max_pages',
          };
        }
        
        pagesScanned++;
        
        try {
          // Extract text from single page using hybrid approach
          const pdfBuffer = await safeReadFile(pdfPath);
          const pageTexts = await extractTextHybrid(pdfBuffer, pdfPath, [pageNum]);
          const pageText = pageTexts[0];
          
          if (!pageText || pageText.trim().length === 0) {
            errors.push(`Page ${pageNum}: No text extracted`);
            continue;
          }
          
          // Search with timeout protection
          const matches = await searchWithTimeout(pageText, new RegExp(regex.source, regex.flags), searchTimeout);
          
          if (matches.length > 0) {
            const snippets = matches.map(match => {
              const context = extractContext(pageText, match.index!, match.index! + match[0].length, contextChars);
              return {
                text: context.snippet,
                matchStart: context.matchStartInSnippet,
                matchEnd: context.matchEndInSnippet,
              };
            });
            
            results.push({
              page: pageNum,
              matchCount: matches.length,
              snippets,
            });
            
            totalMatchCount += matches.length;
            
            // Check if we've reached max results
            if (maxResults && totalMatchCount >= maxResults) {
              return {
                matches: results,
                errors,
                pagesScanned,
                completed: false,
                stoppedReason: 'max_results',
              };
            }
          }
        } catch (searchError) {
          errors.push(`Page ${pageNum}: Search failed - ${searchError}`);
        }
      }
      
      return {
        matches: results,
        errors,
        pagesScanned,
        completed: true,
        stoppedReason: 'completed',
      };
    }
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description handles transparency. It discloses early stopping behavior and regex format, and implies it is a read-only operation. It could add error handling details but is sufficient for a search tool.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three sentences, each adding unique value: purpose, features, regex syntax. Front-loaded and no redundancy.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given 9 parameters with full schema descriptions and no output schema, the description covers the main behavior. It could brief the return format, but 'matching pages with context snippets' is adequate.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, so baseline is 3. The description adds value by explaining the regex format (/pattern/flags) and early stopping parameters, enhancing understanding beyond the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool is for searching text patterns (including regex) within a PDF and returning matching pages with context snippets. It distinguishes from siblings like search_multiple_pdfs (multi-PDF) and get_pdf_text (full text extraction).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explains when to use the tool (searching for patterns) and mentions performance features (early stopping). However, it does not explicitly state when not to use it or mention alternative tools.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/vlad-ds/pdf-agent-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server