Skip to main content
Glama
josuekongolo

CompanyIQ MCP Server

by josuekongolo
PDF_PARSING_SOLUTION.md4.46 kB
# 📄 PDF Parsing Challenge & Solution ## 🔍 The Problem The PDFs downloaded from Brønnøysund are **image-based (scanned documents)** without extractable text: - 32 pages but only 64 characters of metadata - No searchable text layer - Standard PDF parsers return empty/N/A values ## 💡 The Solution: Hybrid Approach ### 1. **Automatic PDF Downloads ✅ (Working)** - Browser automation successfully downloads ALL PDFs - Files saved to `~/Downloads/` or `data/pdfs/` - Named format: `aarsregnskap_{org_nr}-{year}.pdf` ### 2. **Data Extraction Strategy** #### Option A: API + Manual Entry (Recommended) ```javascript // Automatic for latest year ✅ 2024: API returns complete data (revenue, profit, assets, equity) // Historical years: Manual entry after viewing PDFs 📄 2023: PDF downloaded → User views → Enters data 📄 2022: PDF downloaded → User views → Enters data ``` #### Option B: OCR Integration (Future Enhancement) ```javascript // Install OCR tool (e.g., Tesseract) brew install tesseract // Convert image PDF to text tesseract aarsregnskap_999059198-2023.pdf output.txt // Parse extracted text ``` #### Option C: Use Browser Extraction Instead of downloading PDFs, scrape the data directly from the webpage tables if available. --- ## 🛠️ Current Working Implementation ### What Works Now: 1. **Browser automation** - Opens Chrome, navigates to company page ✅ 2. **Finds all years** - Discovers 2012-2024 for test company ✅ 3. **Downloads PDFs** - Clicks download links, saves files ✅ 4. **API data extraction** - Gets latest year (2024) with full data ✅ 5. **Database storage** - Saves all data for instant access ✅ ### What Needs Manual Work: - **Historical PDF data extraction** - PDFs are scanned images - User must view PDFs and manually enter data OR use OCR tools --- ## 📊 Practical Workflow ### For Each Company: ```bash # 1. Run auto-scraper (30 seconds) "auto_scrape_financials for 999059198" Result: ✅ Downloads all PDFs (2012-2024) ✅ Extracts 2024 data from API ⚠️ Historical years need manual entry # 2. View downloaded PDFs open ~/Downloads/aarsregnskap_999059198-2023.pdf # 3. Manual data entry "import_financials_from_file data.csv" ``` --- ## 🔧 Enhanced Solution with OCR ### Install OCR Dependencies: ```bash # macOS brew install tesseract brew install poppler # for pdf2image # Install Python OCR libraries pip install pytesseract pdf2image ``` ### OCR Script Example: ```python import pytesseract from pdf2image import convert_from_path # Convert PDF to images pages = convert_from_path('aarsregnskap_999059198-2023.pdf') # Extract text from each page for i, page in enumerate(pages): text = pytesseract.image_to_string(page, lang='nor') # Look for financial data patterns if 'Driftsinntekter' in text: # Extract revenue pass if 'Årsresultat' in text: # Extract profit pass ``` --- ## 📈 Alternative: Direct Web Scraping Instead of downloading PDFs, scrape data directly from the webpage: ```javascript // In browser_scraper.ts const financialTable = await page.evaluate(() => { // Find table with financial data const rows = document.querySelectorAll('tr'); const data = {}; rows.forEach(row => { const cells = row.querySelectorAll('td'); if (cells[0]?.textContent?.includes('Driftsinntekter')) { data.revenue = cells[1]?.textContent; } // ... extract more fields }); return data; }); ``` --- ## 🎯 Summary ### Current Status: - ✅ **PDF downloads work perfectly** - ✅ **Latest year data from API** - ⚠️ **Historical PDFs are image-based** ### Best Practice Now: 1. Use `auto_scrape_financials` to download all PDFs 2. API provides latest year automatically 3. View historical PDFs manually and use `import_financials` tool ### Future Improvements: 1. Integrate OCR for automatic text extraction 2. Scrape data directly from webpage tables 3. Use AI vision models to read PDF images --- ## 📝 Manual Data Entry Template After viewing the downloaded PDFs, use this CSV template: ```csv org_nr,year,revenue,profit,assets,equity,source 999059198,2023,750000000,80000000,400000000,250000000,manual 999059198,2022,700000000,75000000,380000000,240000000,manual ``` Then import: ``` "import_financials_from_file /path/to/data.csv" ``` This hybrid approach ensures you get ALL the data - automatically where possible, manually where necessary!

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/josuekongolo/companyiq-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server