Skip to main content
Glama
flyanima

Open Search MCP

by flyanima

analyze_pdf

Search and analyze PDF documents for research purposes, extracting insights from academic papers, reports, and manuals with configurable depth and OCR support.

Instructions

Conduct comprehensive PDF research with document discovery, processing, and analysis

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
queryYesResearch query for PDF document search
maxDocumentsNoMaximum number of documents to find (default: 10)
documentTypeNoType of documents to search for
includeOCRNoEnable OCR for scanned PDFs (default: false)
forceOCRNoForce OCR processing even for good quality text - useful for testing OCR functionality (default: false)
sourcesNoSources to search (arxiv, pubmed, web, all)
dateRangeNoDate range filter for documents
analysisDepthNoDepth of analysis (shallow, medium, deep)

Implementation Reference

  • The pdfResearch function is the main handler that executes the 'analyze_pdf' tool. It performs PDF document search based on a query, processes found PDFs with optional OCR, generates summaries and insights based on analysis depth.
    async function pdfResearch(args: ToolInput): Promise<ToolOutput> {
      const {
        query,
        maxDocuments = 10,
        documentType = 'any',
        includeOCR = false,
        forceOCR = false,
        sources = ['all'],
        dateRange,
        analysisDepth = 'medium'
      } = args;
      
      try {
        logger.info(`Starting PDF research for: ${query}`);
        
        if (!query || typeof query !== 'string') {
          throw new Error('Query parameter is required and must be a string');
        }
        
        const pdfProcessor = new PDFProcessor();
        
        // Configure search options
        const searchOptions: PDFSearchOptions = {
          query,
          maxDocuments,
          documentType: documentType as any,
          includeOCR,
          forceOCR,
          sources: Array.isArray(sources) ? sources : [sources],
          dateRange
        };
        
        // Search for relevant PDFs
        const pdfDocuments = await pdfProcessor.searchPDFs(searchOptions);
        
        if (pdfDocuments.length === 0) {
          return {
            success: true,
            data: {
              query,
              documents: [],
              totalFound: 0,
              message: 'No PDF documents found for the given query',
              searchedAt: new Date().toISOString()
            },
            metadata: {
              sources: ['pdf-research'],
              cached: false
            }
          };
        }
        
        // Process PDFs based on analysis depth
        const processedDocuments = [];
        const maxToProcess = analysisDepth === 'shallow' ? Math.min(3, pdfDocuments.length) :
                           analysisDepth === 'medium' ? Math.min(5, pdfDocuments.length) :
                           Math.min(10, pdfDocuments.length);
        
        for (let i = 0; i < maxToProcess; i++) {
          const pdfDoc = pdfDocuments[i];
          logger.info(`Processing PDF ${i + 1}/${maxToProcess}: ${pdfDoc.title}`);
          
          try {
            const processedPDF = await pdfProcessor.processPDF(pdfDoc, includeOCR, forceOCR);
            
            if (processedPDF) {
              // Create summary based on analysis depth
              const summary = createPDFSummary(processedPDF, analysisDepth as string);
              
              processedDocuments.push({
                id: processedPDF.id,
                title: processedPDF.title,
                url: processedPDF.url,
                source: processedPDF.source,
                summary,
                metadata: {
                  pages: processedPDF.content.pages,
                  author: processedPDF.metadata.author,
                  creationDate: processedPDF.metadata.creationDate,
                  processingMethod: processedPDF.processing.method,
                  ocrConfidence: processedPDF.processing.ocrConfidence
                },
                structure: {
                  sectionsCount: processedPDF.structure.sections.length,
                  referencesCount: processedPDF.structure.references.length,
                  figuresCount: processedPDF.structure.figures.length,
                  tablesCount: processedPDF.structure.tables.length
                },
                relevanceScore: pdfDoc.relevanceScore
              });
            }
          } catch (error) {
            logger.warn(`Failed to process PDF: ${pdfDoc.title}`, error);
            
            // Add basic info even if processing failed
            processedDocuments.push({
              id: pdfDoc.id,
              title: pdfDoc.title,
              url: pdfDoc.url,
              source: pdfDoc.source,
              summary: 'PDF processing failed - document available for manual review',
              metadata: {
                processingError: true
              },
              relevanceScore: pdfDoc.relevanceScore
            });
          }
        }
        
        // Generate research insights
        const insights = generateResearchInsights(processedDocuments, query);
        
        const result: ToolOutput = {
          success: true,
          data: {
            query,
            documents: processedDocuments,
            totalFound: pdfDocuments.length,
            totalProcessed: processedDocuments.length,
            insights,
            searchOptions: {
              documentType,
              includeOCR,
              forceOCR,
              sources: searchOptions.sources,
              analysisDepth
            },
            searchedAt: new Date().toISOString()
          },
          metadata: {
            sources: ['pdf-research'],
            cached: false
          }
        };
    
        logger.info(`PDF research completed: ${processedDocuments.length} documents processed for ${query}`);
        return result;
    
      } catch (error) {
        logger.error(`Failed PDF research for ${query}:`, error);
        
        return {
          success: false,
          error: `Failed to conduct PDF research: ${error instanceof Error ? error.message : 'Unknown error'}`,
          data: null,
          metadata: {
            sources: ['pdf-research'],
            cached: false
          }
        };
      }
    }
  • The registerPDFResearchTools function creates and registers the 'analyze_pdf' tool using createTool, defines its metadata, input schema, and registers it with the tool registry.
    export function registerPDFResearchTools(registry: ToolRegistry): void {
      logger.info('Registering PDF research tools...');
    
      // PDF Research tool
      const pdfResearchTool = createTool(
        'analyze_pdf',
        'Conduct comprehensive PDF research with document discovery, processing, and analysis',
        'pdf',
        'pdf-research',
        pdfResearch,
        {
          cacheTTL: 3600, // 1 hour cache
          rateLimit: 10,  // 10 requests per minute
          requiredParams: ['query'],
          optionalParams: ['maxDocuments', 'documentType', 'includeOCR', 'forceOCR', 'sources', 'dateRange', 'analysisDepth']
        }
      );
    
      pdfResearchTool.inputSchema = {
        type: 'object',
        properties: {
          query: {
            type: 'string',
            description: 'Research query for PDF document search'
          },
          maxDocuments: {
            type: 'number',
            description: 'Maximum number of documents to find (default: 10)'
          },
          documentType: {
            type: 'string',
            description: 'Type of documents to search for',
            enum: ['academic', 'report', 'manual', 'any']
          },
          includeOCR: {
            type: 'boolean',
            description: 'Enable OCR for scanned PDFs (default: false)'
          },
          forceOCR: {
            type: 'boolean',
            description: 'Force OCR processing even for good quality text - useful for testing OCR functionality (default: false)'
          },
          sources: {
            type: 'array',
            items: { type: 'string' },
            description: 'Sources to search (arxiv, pubmed, web, all)'
          },
          dateRange: {
            type: 'object',
            properties: {
              start: { type: 'string', description: 'Start date (YYYY-MM-DD)' },
              end: { type: 'string', description: 'End date (YYYY-MM-DD)' }
            },
            description: 'Date range filter for documents'
          },
          analysisDepth: {
            type: 'string',
            description: 'Depth of analysis (shallow, medium, deep)',
            enum: ['shallow', 'medium', 'deep']
          }
        },
        required: ['query']
      };
    
      registry.registerTool(pdfResearchTool);
    
      // PDF Discovery tool
      const pdfDiscoveryTool = createTool(
        'pdf_discovery',
        'Discover PDF documents without full processing - fast PDF search and listing',
        'pdf',
        'pdf-discovery',
        pdfDiscovery,
        {
          cacheTTL: 1800, // 30 minutes cache
          rateLimit: 15,  // 15 requests per minute
          requiredParams: ['query'],
          optionalParams: ['maxResults', 'sources', 'documentType']
        }
      );
    
      pdfDiscoveryTool.inputSchema = {
        type: 'object',
        properties: {
          query: {
            type: 'string',
            description: 'Search query for PDF document discovery'
          },
          maxResults: {
            type: 'number',
            description: 'Maximum number of results to return (default: 20)'
          },
          sources: {
            type: 'array',
            items: { type: 'string' },
            description: 'Sources to search (arxiv, pubmed, web, all)'
          },
          documentType: {
            type: 'string',
            description: 'Type of documents to search for',
            enum: ['academic', 'report', 'manual', 'any']
          }
        },
        required: ['query']
      };
    
      registry.registerTool(pdfDiscoveryTool);
    
      logger.info('PDF research tools registered successfully');
    }
  • Input schema definition for the 'analyze_pdf' tool, specifying parameters like query (required), maxDocuments, analysisDepth, etc.
    pdfResearchTool.inputSchema = {
      type: 'object',
      properties: {
        query: {
          type: 'string',
          description: 'Research query for PDF document search'
        },
        maxDocuments: {
          type: 'number',
          description: 'Maximum number of documents to find (default: 10)'
        },
        documentType: {
          type: 'string',
          description: 'Type of documents to search for',
          enum: ['academic', 'report', 'manual', 'any']
        },
        includeOCR: {
          type: 'boolean',
          description: 'Enable OCR for scanned PDFs (default: false)'
        },
        forceOCR: {
          type: 'boolean',
          description: 'Force OCR processing even for good quality text - useful for testing OCR functionality (default: false)'
        },
        sources: {
          type: 'array',
          items: { type: 'string' },
          description: 'Sources to search (arxiv, pubmed, web, all)'
        },
        dateRange: {
          type: 'object',
          properties: {
            start: { type: 'string', description: 'Start date (YYYY-MM-DD)' },
            end: { type: 'string', description: 'End date (YYYY-MM-DD)' }
          },
          description: 'Date range filter for documents'
        },
        analysisDepth: {
          type: 'string',
          description: 'Depth of analysis (shallow, medium, deep)',
          enum: ['shallow', 'medium', 'deep']
        }
      },
      required: ['query']
    };
  • src/index.ts:251-252 (registration)
    Call to registerPDFResearchTools in the main server initialization, which registers the 'analyze_pdf' tool among others.
    registerPDFResearchTools(this.toolRegistry);        // 1 tool: analyze_pdf
  • Additional validation schema 'pdfAnalysis' used in input-validator for 'analyze_pdf', though the tool uses query-based input primarily.
    // PDF analysis
    pdfAnalysis: z.object({
      filePath: CommonSchemas.filePath,
      extractText: z.boolean().optional().default(true),
      extractMetadata: z.boolean().optional().default(true),
      pageRange: z.object({
        start: CommonSchemas.positiveInteger,
        end: CommonSchemas.positiveInteger,
      }).optional(),
    }),
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries the full burden of behavioral disclosure. It mentions 'comprehensive PDF research' but lacks critical details: whether this is a read-only or mutating operation, expected runtime, rate limits, authentication needs, or what 'analysis' entails. For a complex tool with 8 parameters and no output schema, this is a significant gap.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single, efficient sentence that front-loads the core purpose. It avoids redundancy and wastes no words, though it could be more structured (e.g., separating discovery, processing, and analysis into bullet points) given the tool's complexity.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (8 parameters, no output schema, no annotations), the description is incomplete. It doesn't explain what 'analysis' outputs, how results are returned, error conditions, or performance expectations. For a tool that presumably returns rich data, this leaves the agent with insufficient context to use it effectively.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema fully documents all 8 parameters. The description adds no specific parameter information beyond implying a research context, which the schema already covers with parameter descriptions like 'Research query for PDF document search'. Baseline 3 is appropriate when the schema does the heavy lifting.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('conduct comprehensive PDF research') and resources ('PDF'), covering discovery, processing, and analysis. It distinguishes itself from simpler sibling tools like 'pdf_discovery' by implying broader functionality, though it doesn't explicitly differentiate from all research-related siblings like 'deep_research' or 'intelligent_research'.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. With many sibling tools for research (e.g., 'deep_research', 'intelligent_research', 'pdf_discovery'), there's no indication of this tool's specific niche, prerequisites, or exclusions, leaving the agent to guess based on the name alone.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/flyanima/open-search-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server