CompanyIQ MCP Server

OCR_SOLUTIONS_GUIDE.md•6.45 kB

# 🔍 OCR Solutions for Image-Based PDFs ## 📋 Available Libraries & Solutions ### 1. **Tesseract.js** (Pure JavaScript) ```bash npm install tesseract.js ``` - ✅ No system dependencies - ✅ Works in browser and Node.js - ⚠️ Slower than native Tesseract - 📦 Downloads language models automatically ### 2. **Google Cloud Vision API** (Best Accuracy) ```javascript // Requires Google Cloud account const vision = require('@google-cloud/vision'); const client = new vision.ImageAnnotatorClient(); const [result] = await client.documentTextDetection('pdf-file.pdf'); const fullText = result.fullTextAnnotation.text; ``` - ✅ Excellent accuracy - ✅ Handles complex layouts - 💰 Paid service ($1.50 per 1000 pages) - 🔑 Requires API key ### 3. **Azure Computer Vision** (Microsoft) ```javascript const ComputerVisionClient = require('@azure/cognitiveservices-computervision').ComputerVisionClient; const client = new ComputerVisionClient(credentials, endpoint); const result = await client.read(pdfUrl); ``` - ✅ Good accuracy - ✅ Handles tables well - 💰 Paid service - 🔑 Requires Azure account ### 4. **AWS Textract** (Best for Financial Documents) ```javascript const AWS = require('aws-sdk'); const textract = new AWS.Textract(); const params = { Document: { Bytes: pdfBuffer } }; const result = await textract.analyzeDocument(params).promise(); ``` - ✅ **Specialized for financial documents** - ✅ Extracts tables, forms, key-value pairs - ✅ High accuracy for invoices/statements - 💰 Paid service ($1.50 per 1000 pages) - 🔑 Requires AWS account ### 5. **OCR.space API** (Free Tier Available) ```javascript const formData = new FormData(); formData.append('file', pdfBuffer); formData.append('language', 'nor'); formData.append('isTable', 'true'); const response = await fetch('https://api.ocr.space/parse/image', { method: 'POST', headers: { 'apikey': 'YOUR_API_KEY' }, body: formData }); ``` - ✅ Free tier (500 calls/month) - ✅ No installation required - ⚠️ File size limits (1MB free, 5MB paid) --- ## 🛠️ Installation Requirements ### For Tesseract.js (Recommended for Free Solution) ```bash # No system dependencies needed! npm install tesseract.js # Optional: Install canvas for better PDF handling npm install canvas pdfjs-dist ``` ### For pdf2pic (Requires ImageMagick) ```bash # macOS brew install imagemagick ghostscript # Then install npm packages npm install pdf2pic ``` --- ## 💡 Working Solution Without System Dependencies ```javascript import Tesseract from 'tesseract.js'; import { getDocument } from 'pdfjs-dist'; import { createCanvas } from 'canvas'; async function extractFromImagePDF(pdfPath) { // Load PDF const pdf = await getDocument(pdfPath).promise; const page = await pdf.getPage(1); // First page // Render to canvas const viewport = page.getViewport({ scale: 2.0 }); const canvas = createCanvas(viewport.width, viewport.height); const context = canvas.getContext('2d'); await page.render({ canvasContext: context, viewport }).promise; // Convert to image buffer const imageBuffer = canvas.toBuffer('image/png'); // Run OCR const { data: { text } } = await Tesseract.recognize( imageBuffer, 'nor+eng', // Norwegian + English { logger: m => console.log(m.progress * 100 + '%') } ); return text; } ``` --- ## 🎯 Best Approach for Norwegian Financial PDFs ### **Recommended: AWS Textract or Google Vision** For production use with Norwegian financial documents: ```javascript // AWS Textract - Best for financial documents const AWS = require('aws-sdk'); AWS.config.update({ accessKeyId: process.env.AWS_ACCESS_KEY, secretAccessKey: process.env.AWS_SECRET_KEY, region: 'eu-north-1' // Stockholm region }); const textract = new AWS.Textract(); async function extractFinancialData(pdfBuffer) { const params = { Document: { Bytes: pdfBuffer }, FeatureTypes: ['TABLES', 'FORMS'] // Extract tables and key-value pairs }; const result = await textract.analyzeDocument(params).promise(); // Parse the structured data const financialData = { revenue: null, profit: null, assets: null, equity: null }; // Textract returns structured key-value pairs result.Blocks.forEach(block => { if (block.BlockType === 'KEY_VALUE_SET') { const key = getKeyText(block, result.Blocks); const value = getValueText(block, result.Blocks); if (key.includes('Driftsinntekter')) { financialData.revenue = parseNorwegianNumber(value); } // ... check other fields } }); return financialData; } ``` --- ## 📊 Comparison Table | Solution | Accuracy | Cost | Setup | Speed | Norwegian Support | |----------|----------|------|-------|-------|------------------| | Tesseract.js | 70% | Free | Easy | Slow | Yes (with lang pack) | | Google Vision | 95% | $1.50/1000 | Medium | Fast | Excellent | | AWS Textract | 95% | $1.50/1000 | Medium | Fast | Good | | Azure Vision | 90% | $1/1000 | Medium | Fast | Good | | OCR.space | 80% | Free/Paid | Easy | Medium | Good | --- ## 🚀 Quick Start ### Option 1: Free (Tesseract.js) ```javascript npm install tesseract.js // Use the OCRPDFParser class already created ``` ### Option 2: Best Quality (AWS Textract) ```bash npm install aws-sdk # Set environment variables export AWS_ACCESS_KEY=your_key export AWS_SECRET_KEY=your_secret ``` ### Option 3: Hybrid Approach 1. Try regular PDF text extraction first 2. If text < 500 chars, use OCR 3. For critical documents, use cloud OCR 4. Cache results in database --- ## 🔧 Troubleshooting ### Issue: OCR returns 0 characters - **Cause**: PDF-to-image conversion failed - **Fix**: Install ImageMagick or use cloud service ### Issue: Poor accuracy - **Cause**: Low quality scans - **Fix**: Increase DPI, use cloud OCR ### Issue: Slow processing - **Cause**: Large PDFs, many pages - **Fix**: Process only first few pages, use cloud service --- ## 📝 Final Recommendation For the Norwegian financial PDFs from Brønnøysund: 1. **Development/Testing**: Use Tesseract.js (free, no dependencies) 2. **Production**: Use AWS Textract or Google Vision (high accuracy) 3. **Fallback**: Manual data entry with the downloaded PDFs The current implementation with `OCRPDFParser` provides the foundation - just need to either: - Install ImageMagick for local OCR, or - Switch to a cloud OCR service for better accuracy

Latest Blog Posts

What Is Context Bloat in MCP?
By Om-Shree-0709 on December 16, 2025.
mcp
Context Bloat
MCP Moves to the Linux Foundation: Neutral Stewardship for Agentic Infrastructure
By Om-Shree-0709 on December 15, 2025.
mcp
anthropic
Linux Foundation
Code Execution with MCP: Architecting Agentic Efficiency
By Om-Shree-0709 on December 14, 2025.
mcp
Token bloat

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/josuekongolo/companyiq-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server