Skip to main content
Glama

content-core

plan.md7.32 kB
# OSS-216: Remove libmagic Dependency If you are working on this feature, make sure to update this plan.md file as you go. ## PHASE 1: Create Pure Python File Detection Module [Completed ✅] Build the core file detection system to replace libmagic with pure Python implementation. ### Create file_detector.py with basic structure [Completed ✅] Create `/src/content_core/content/identification/file_detector.py` with: - FileDetector class skeleton - Basic signature mappings for binary formats (PDF, images) - Simple detect() method that reads first 512 bytes - Raise UnsupportedTypeException for unknown types ### Implement binary format detection [Completed ✅] Add detection for binary formats: - PDF files (magic bytes: `%PDF`) - Common image formats (JPEG, PNG, GIF, TIFF, BMP) - Audio formats (MP3, WAV, M4A) - Video formats (MP4, AVI, MOV) - Test each format with sample files ### Implement ZIP-based format detection [Completed ✅] Handle Office and EPUB formats that use ZIP containers: - Detect ZIP magic bytes (`PK\x03\x04`) - Use zipfile module to inspect internal structure - Differentiate DOCX (word/), XLSX (xl/), PPTX (ppt/), EPUB (META-INF/container.xml) - Handle corrupted or password-protected ZIP files gracefully ### Comments: - Focus on accurate detection over performance - Ensure all MIME types match exactly what libmagic returns - **Implementation notes from Phase 1:** - Added comprehensive binary signatures with ordered checking (longer signatures first) - Implemented generic ftyp box detection for MP4/MOV files for better compatibility - Added FLAC audio format support - Special RIFF handling differentiates between WAV and AVI - Text detection requires minimum content length to avoid false positives - All core file types tested and working correctly ## PHASE 2: Text Format Detection and Fallbacks [Completed ✅] Implement text-based format detection and extension fallback mechanism. ### Add text format detection [Completed ✅] Implement content analysis for text formats: - HTML detection (DOCTYPE, <html tags) - XML detection (<?xml declaration) - JSON detection (starts with { or [) - YAML detection (--- header) - Markdown detection (combine multiple indicators) - CSV detection (analyze structure) - Plain text as default for unrecognized text ### Implement extension fallback system [Completed ✅] Create comprehensive extension mapping: - Map common file extensions to MIME types - Use as last resort when content detection fails - Log when falling back to extension - Maintain compatibility with current behavior ### Add detection method priority logic [Completed ✅] Implement the agreed priority order: 1. Binary signature detection (most reliable) 2. Content analysis for text formats 3. File extension as final fallback - Add logging at each detection stage - Return appropriate MIME type or raise exception ### Comments: - Text detection needs to be careful to avoid false positives - Extension fallback ensures graceful degradation - **Implementation notes from Phase 2:** - Enhanced JSON detection with pattern matching and keyword checking - Improved YAML detection to avoid conflicts with Markdown - Added sophisticated Markdown scoring system (headers, lists, links, etc.) - Extended extension mapping to cover more file types (70+ extensions) - Fixed YAML/Markdown detection priority to avoid false positives - Added minimum content requirements for text detection - All text formats tested with edge cases ## PHASE 3: Integration with Existing Code [Completed ✅] Replace libmagic usage throughout the codebase. ### Update identification module [Completed ✅] Modify `/src/content_core/content/identification/__init__.py`: - Import FileDetector - Replace `magic.from_file()` call in `get_file_type()` - Maintain async interface - Remove magic import ### Update graph.py file type detection [Completed ✅] Modify `/src/content_core/content/extraction/graph.py`: - Replace `magic.from_file()` at line 62 - Import get_file_type from identification module - Remove direct magic import - Ensure error handling remains consistent ### Test integration thoroughly [Completed ✅] Verify all extraction paths work: - Test each supported file type through full pipeline - Verify correct processor routing - Check error messages for unsupported types - Ensure no regression in functionality ### Comments: - Must maintain exact same external behavior - All existing code depending on MIME types should work unchanged - **Implementation notes from Phase 3:** - Successfully replaced all libmagic usage with FileDetector - Integration was seamless - no changes needed to downstream processors - All file types correctly detected and routed to appropriate processors - Tested with PDF, DOCX, MP4, MP3, JSON, HTML, CSV, text files - Only test failure was unrelated (OpenAI API issue for MP3 transcription) - MIME types match exactly what libmagic returned ## PHASE 4: Cleanup and Final Validation [In Progress 🔄] Remove dependencies and ensure production readiness. ### Remove libmagic from dependencies [Completed ✅] Update `/pyproject.toml`: - Remove `python-magic>=0.4.27` - Remove `python-magic-bin==0.4.14` for Windows - Update lock file with `uv sync` - Verify clean installation works **Implementation notes:** - Successfully removed both python-magic dependencies from pyproject.toml - Lock file updated with `uv sync` - 2 packages uninstalled: python-magic and python-magic-bin ### Add comprehensive test suite [Not Started ⏳] Create thorough tests: - Unit tests for FileDetector methods - Integration tests for full extraction pipeline - Edge cases (empty files, malformed files) - Cross-platform compatibility tests - Performance benchmarks ### Documentation and release preparation [Not Started ⏳] Final preparations: - Update README if it mentions libmagic - Add docstrings to all new code - Update CHANGELOG - Test installation on fresh environment - Run full test suite: `make test` - Build package: `uv build` ### Comments: - This is a breaking change for anyone depending on libmagic behavior - Consider adding migration guide if needed ## Key Technical Details **Critical MIME Types** (must match exactly): - `application/pdf` - PDF files - `application/epub+zip` - EPUB files - `application/vnd.openxmlformats-officedocument.wordprocessingml.document` - DOCX - `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` - XLSX - `application/vnd.openxmlformats-officedocument.presentationml.presentation` - PPTX - `text/plain` - Text and Markdown files - `text/html` - HTML files - `text/csv` - CSV files - `application/json` - JSON files - `image/*` - Various image formats - `video/*` - Video files (prefix matching) - `audio/*` - Audio files (prefix matching) **Implementation Constraints**: - 512-byte buffer is sufficient (no deep file inspection needed) - Performance is not critical (small load expected) - Must raise `UnsupportedTypeException` for unknown types - Maintain async interface for consistency - Pure Python only (no C extensions) **Risk Mitigation**: - Extensive testing before removing libmagic - Keep detection logic modular for easy updates - Log detection decisions for debugging - Consider feature flag for rollback if needed

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/lfnovo/content-core'

If you have feedback or need assistance with the MCP directory API, please join our Discord server