Skip to main content
Glama
flyanima

Open Search MCP

by flyanima

analyze_pdf

Search and analyze PDF documents for research purposes, extracting insights from academic papers, reports, and manuals with configurable depth and OCR support.

Instructions

Conduct comprehensive PDF research with document discovery, processing, and analysis

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
queryYesResearch query for PDF document search
maxDocumentsNoMaximum number of documents to find (default: 10)
documentTypeNoType of documents to search for
includeOCRNoEnable OCR for scanned PDFs (default: false)
forceOCRNoForce OCR processing even for good quality text - useful for testing OCR functionality (default: false)
sourcesNoSources to search (arxiv, pubmed, web, all)
dateRangeNoDate range filter for documents
analysisDepthNoDepth of analysis (shallow, medium, deep)

Implementation Reference

  • The pdfResearch function is the main handler that executes the 'analyze_pdf' tool. It performs PDF document search based on a query, processes found PDFs with optional OCR, generates summaries and insights based on analysis depth.
    async function pdfResearch(args: ToolInput): Promise<ToolOutput> { const { query, maxDocuments = 10, documentType = 'any', includeOCR = false, forceOCR = false, sources = ['all'], dateRange, analysisDepth = 'medium' } = args; try { logger.info(`Starting PDF research for: ${query}`); if (!query || typeof query !== 'string') { throw new Error('Query parameter is required and must be a string'); } const pdfProcessor = new PDFProcessor(); // Configure search options const searchOptions: PDFSearchOptions = { query, maxDocuments, documentType: documentType as any, includeOCR, forceOCR, sources: Array.isArray(sources) ? sources : [sources], dateRange }; // Search for relevant PDFs const pdfDocuments = await pdfProcessor.searchPDFs(searchOptions); if (pdfDocuments.length === 0) { return { success: true, data: { query, documents: [], totalFound: 0, message: 'No PDF documents found for the given query', searchedAt: new Date().toISOString() }, metadata: { sources: ['pdf-research'], cached: false } }; } // Process PDFs based on analysis depth const processedDocuments = []; const maxToProcess = analysisDepth === 'shallow' ? Math.min(3, pdfDocuments.length) : analysisDepth === 'medium' ? Math.min(5, pdfDocuments.length) : Math.min(10, pdfDocuments.length); for (let i = 0; i < maxToProcess; i++) { const pdfDoc = pdfDocuments[i]; logger.info(`Processing PDF ${i + 1}/${maxToProcess}: ${pdfDoc.title}`); try { const processedPDF = await pdfProcessor.processPDF(pdfDoc, includeOCR, forceOCR); if (processedPDF) { // Create summary based on analysis depth const summary = createPDFSummary(processedPDF, analysisDepth as string); processedDocuments.push({ id: processedPDF.id, title: processedPDF.title, url: processedPDF.url, source: processedPDF.source, summary, metadata: { pages: processedPDF.content.pages, author: processedPDF.metadata.author, creationDate: processedPDF.metadata.creationDate, processingMethod: processedPDF.processing.method, ocrConfidence: processedPDF.processing.ocrConfidence }, structure: { sectionsCount: processedPDF.structure.sections.length, referencesCount: processedPDF.structure.references.length, figuresCount: processedPDF.structure.figures.length, tablesCount: processedPDF.structure.tables.length }, relevanceScore: pdfDoc.relevanceScore }); } } catch (error) { logger.warn(`Failed to process PDF: ${pdfDoc.title}`, error); // Add basic info even if processing failed processedDocuments.push({ id: pdfDoc.id, title: pdfDoc.title, url: pdfDoc.url, source: pdfDoc.source, summary: 'PDF processing failed - document available for manual review', metadata: { processingError: true }, relevanceScore: pdfDoc.relevanceScore }); } } // Generate research insights const insights = generateResearchInsights(processedDocuments, query); const result: ToolOutput = { success: true, data: { query, documents: processedDocuments, totalFound: pdfDocuments.length, totalProcessed: processedDocuments.length, insights, searchOptions: { documentType, includeOCR, forceOCR, sources: searchOptions.sources, analysisDepth }, searchedAt: new Date().toISOString() }, metadata: { sources: ['pdf-research'], cached: false } }; logger.info(`PDF research completed: ${processedDocuments.length} documents processed for ${query}`); return result; } catch (error) { logger.error(`Failed PDF research for ${query}:`, error); return { success: false, error: `Failed to conduct PDF research: ${error instanceof Error ? error.message : 'Unknown error'}`, data: null, metadata: { sources: ['pdf-research'], cached: false } }; } }
  • The registerPDFResearchTools function creates and registers the 'analyze_pdf' tool using createTool, defines its metadata, input schema, and registers it with the tool registry.
    export function registerPDFResearchTools(registry: ToolRegistry): void { logger.info('Registering PDF research tools...'); // PDF Research tool const pdfResearchTool = createTool( 'analyze_pdf', 'Conduct comprehensive PDF research with document discovery, processing, and analysis', 'pdf', 'pdf-research', pdfResearch, { cacheTTL: 3600, // 1 hour cache rateLimit: 10, // 10 requests per minute requiredParams: ['query'], optionalParams: ['maxDocuments', 'documentType', 'includeOCR', 'forceOCR', 'sources', 'dateRange', 'analysisDepth'] } ); pdfResearchTool.inputSchema = { type: 'object', properties: { query: { type: 'string', description: 'Research query for PDF document search' }, maxDocuments: { type: 'number', description: 'Maximum number of documents to find (default: 10)' }, documentType: { type: 'string', description: 'Type of documents to search for', enum: ['academic', 'report', 'manual', 'any'] }, includeOCR: { type: 'boolean', description: 'Enable OCR for scanned PDFs (default: false)' }, forceOCR: { type: 'boolean', description: 'Force OCR processing even for good quality text - useful for testing OCR functionality (default: false)' }, sources: { type: 'array', items: { type: 'string' }, description: 'Sources to search (arxiv, pubmed, web, all)' }, dateRange: { type: 'object', properties: { start: { type: 'string', description: 'Start date (YYYY-MM-DD)' }, end: { type: 'string', description: 'End date (YYYY-MM-DD)' } }, description: 'Date range filter for documents' }, analysisDepth: { type: 'string', description: 'Depth of analysis (shallow, medium, deep)', enum: ['shallow', 'medium', 'deep'] } }, required: ['query'] }; registry.registerTool(pdfResearchTool); // PDF Discovery tool const pdfDiscoveryTool = createTool( 'pdf_discovery', 'Discover PDF documents without full processing - fast PDF search and listing', 'pdf', 'pdf-discovery', pdfDiscovery, { cacheTTL: 1800, // 30 minutes cache rateLimit: 15, // 15 requests per minute requiredParams: ['query'], optionalParams: ['maxResults', 'sources', 'documentType'] } ); pdfDiscoveryTool.inputSchema = { type: 'object', properties: { query: { type: 'string', description: 'Search query for PDF document discovery' }, maxResults: { type: 'number', description: 'Maximum number of results to return (default: 20)' }, sources: { type: 'array', items: { type: 'string' }, description: 'Sources to search (arxiv, pubmed, web, all)' }, documentType: { type: 'string', description: 'Type of documents to search for', enum: ['academic', 'report', 'manual', 'any'] } }, required: ['query'] }; registry.registerTool(pdfDiscoveryTool); logger.info('PDF research tools registered successfully'); }
  • Input schema definition for the 'analyze_pdf' tool, specifying parameters like query (required), maxDocuments, analysisDepth, etc.
    pdfResearchTool.inputSchema = { type: 'object', properties: { query: { type: 'string', description: 'Research query for PDF document search' }, maxDocuments: { type: 'number', description: 'Maximum number of documents to find (default: 10)' }, documentType: { type: 'string', description: 'Type of documents to search for', enum: ['academic', 'report', 'manual', 'any'] }, includeOCR: { type: 'boolean', description: 'Enable OCR for scanned PDFs (default: false)' }, forceOCR: { type: 'boolean', description: 'Force OCR processing even for good quality text - useful for testing OCR functionality (default: false)' }, sources: { type: 'array', items: { type: 'string' }, description: 'Sources to search (arxiv, pubmed, web, all)' }, dateRange: { type: 'object', properties: { start: { type: 'string', description: 'Start date (YYYY-MM-DD)' }, end: { type: 'string', description: 'End date (YYYY-MM-DD)' } }, description: 'Date range filter for documents' }, analysisDepth: { type: 'string', description: 'Depth of analysis (shallow, medium, deep)', enum: ['shallow', 'medium', 'deep'] } }, required: ['query'] };
  • src/index.ts:251-252 (registration)
    Call to registerPDFResearchTools in the main server initialization, which registers the 'analyze_pdf' tool among others.
    registerPDFResearchTools(this.toolRegistry); // 1 tool: analyze_pdf
  • Additional validation schema 'pdfAnalysis' used in input-validator for 'analyze_pdf', though the tool uses query-based input primarily.
    // PDF analysis pdfAnalysis: z.object({ filePath: CommonSchemas.filePath, extractText: z.boolean().optional().default(true), extractMetadata: z.boolean().optional().default(true), pageRange: z.object({ start: CommonSchemas.positiveInteger, end: CommonSchemas.positiveInteger, }).optional(), }),

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/flyanima/open-search-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server