Skip to main content
Glama
orneryd

M.I.M.I.R - Multi-agent Intelligent Memory & Insight Repository

by orneryd
DocumentParser.md3.8 kB
[**mimir v1.0.0**](../README.md) *** [mimir](../README.md) / indexing/DocumentParser # indexing/DocumentParser ## Classes ### DocumentParser Defined in: src/indexing/DocumentParser.ts:8 #### Constructors ##### Constructor > **new DocumentParser**(): [`DocumentParser`](#documentparser) ###### Returns [`DocumentParser`](#documentparser) #### Methods ##### extractText() > **extractText**(`buffer`, `extension`): `Promise`\<`string`\> Defined in: src/indexing/DocumentParser.ts:65 Extract plain text from PDF or DOCX files for indexing Parses binary document formats and extracts readable text content. Used by FileIndexer to make documents searchable and embeddable. Automatically detects format from extension and uses appropriate parser. Supported Formats: - **PDF**: Uses pdf-parse library for text extraction - **DOCX**: Uses mammoth library for text extraction ###### Parameters ###### buffer `Buffer` File content as Buffer ###### extension `string` File extension (.pdf, .docx) ###### Returns `Promise`\<`string`\> Extracted plain text content ###### Throws If format is unsupported or extraction fails ###### Examples ```ts // Extract text from PDF file const parser = new DocumentParser(); const pdfBuffer = await fs.readFile('/path/to/document.pdf'); const text = await parser.extractText(pdfBuffer, '.pdf'); console.log('Extracted', text.length, 'characters'); console.log('First 100 chars:', text.substring(0, 100)); ``` ```ts // Extract text from DOCX file const docxBuffer = await fs.readFile('/path/to/document.docx'); const text = await parser.extractText(docxBuffer, '.docx'); console.log('Document text:', text); ``` ```ts // Handle extraction errors try { const buffer = await fs.readFile('/path/to/doc.pdf'); const text = await parser.extractText(buffer, '.pdf'); if (text.length === 0) { console.warn('Document is empty'); } } catch (error) { if (error.message.includes('no extractable text')) { console.log('PDF is image-based or encrypted'); } else { console.error('Extraction failed:', error.message); } } ``` ```ts // Use in file indexing pipeline const files = await glob('docs/*.{pdf,docx}'); for (const file of files) { const buffer = await fs.readFile(file); const ext = path.extname(file); const text = await parser.extractText(buffer, ext); await indexDocument(file, text); } ``` ##### isSupportedFormat() > **isSupportedFormat**(`extension`): `boolean` Defined in: src/indexing/DocumentParser.ts:160 Check if a file extension is supported for document parsing Tests whether the parser can extract text from files with the given extension. Use this before attempting extraction to avoid errors. ###### Parameters ###### extension `string` File extension (e.g., '.pdf', '.docx') ###### Returns `boolean` true if format is supported, false otherwise ###### Examples ```ts // Check before parsing const parser = new DocumentParser(); const file = '/path/to/document.pdf'; const ext = path.extname(file); if (parser.isSupportedFormat(ext)) { const buffer = await fs.readFile(file); const text = await parser.extractText(buffer, ext); console.log('Extracted:', text.length, 'chars'); } else { console.log('Unsupported format:', ext); } ``` ```ts // Filter files by supported formats const allFiles = await glob('documents/*.*'); const supportedFiles = allFiles.filter(file => { const ext = path.extname(file); return parser.isSupportedFormat(ext); }); console.log('Can parse', supportedFiles.length, 'files'); ``` ```ts // Build supported extensions list const extensions = ['.pdf', '.docx', '.txt', '.md', '.doc']; const supported = extensions.filter(ext => parser.isSupportedFormat(ext)); console.log('Supported:', supported.join(', ')); // Output: Supported: .pdf, .docx ```

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/orneryd/Mimir'

If you have feedback or need assistance with the MCP directory API, please join our Discord server