RAGStack-Lambda

Overview Schema Related Servers Score Discussions

TEXT_EXTRACTORS.md•10.7 KiB

# Text Extractors Module Text extraction for non-OCR document types. Auto-detects format and extracts to markdown. ## text_extractors/ ```python from ragstack_common.text_extractors import extract_text, ExtractionResult def extract_text(content: bytes, filename: str) -> ExtractionResult ``` **ExtractionResult fields:** - `markdown`: Extracted text as markdown - `file_type`: Detected file type (html, csv, json, etc.) - `title`: Document title if found - `metadata`: Extracted metadata dict **Supported formats:** | Format | Extractor | Dependencies | |--------|-----------|--------------| | HTML | HtmlExtractor | - | | TXT | TextExtractor | - | | CSV | CsvExtractor | - | | JSON | JsonExtractor | - | | XML | XmlExtractor | - | | EML | EmailExtractor | - | | EPUB | EpubExtractor | ebooklib | | DOCX | DocxExtractor | python-docx | | XLSX | XlsxExtractor | openpyxl | ## Overview The text extractors module provides format-specific extraction for structured documents. Unlike OCR (for images/PDFs), these extractors parse native file formats and convert to clean markdown for embedding and search. ## Usage ### Basic Extraction ```python from ragstack_common.text_extractors import extract_text # Read file with open("document.html", "rb") as f: content = f.read() # Extract text result = extract_text(content, filename="document.html") # Access extracted data print(result.markdown) # Markdown text print(result.file_type) # "html" print(result.title) # "Document Title" print(result.metadata) # {"author": "...", ...} ``` ### S3 Integration ```python from ragstack_common.storage import read_s3_binary from ragstack_common.text_extractors import extract_text # Read from S3 content = read_s3_binary("s3://bucket/document.docx") # Extract text result = extract_text(content, filename="document.docx") ``` ### Format Detection ```python # Auto-detects from filename extension result = extract_text(content, filename="data.csv") # Uses CsvExtractor result = extract_text(content, filename="email.eml") # Uses EmailExtractor result = extract_text(content, filename="book.epub") # Uses EpubExtractor ``` ## Format-Specific Examples ### HTML Extraction ```python html_content = b""" <html> <head><title>Sample Document</title></head> <body> <h1>Heading</h1> <p>Paragraph text.</p> <ul> <li>Item 1</li> <li>Item 2</li> </ul> </body> </html> """ result = extract_text(html_content, filename="sample.html") # result.markdown: # # Heading # # Paragraph text. # # - Item 1 # - Item 2 # result.title: "Sample Document" # result.file_type: "html" ``` **Features:** - Converts HTML tags to markdown - Extracts title from `<title>` tag - Preserves lists, tables, headings - Strips scripts, styles, navigation ### CSV Extraction ```python csv_content = b""" Name,Email,Department John Doe,john@example.com,Engineering Jane Smith,jane@example.com,Sales """ result = extract_text(csv_content, filename="employees.csv") # result.markdown: # | Name | Email | Department | # |------|-------|------------| # | John Doe | john@example.com | Engineering | # | Jane Smith | jane@example.com | Sales | # result.file_type: "csv" # result.metadata: {"rows": 2, "columns": 3} ``` **Features:** - Auto-detects delimiter (`,`, `\t`, `;`) - Converts to markdown table - Handles quoted values - Preserves header row ### JSON Extraction ```python json_content = b""" { "title": "Configuration", "settings": { "theme": "dark", "notifications": true }, "users": ["alice", "bob"] } """ result = extract_text(json_content, filename="config.json") # result.markdown: # # Configuration # # ## settings # - **theme**: dark # - **notifications**: true # # ## users # - alice # - bob # result.title: "Configuration" # result.file_type: "json" ``` **Features:** - Converts to hierarchical markdown - Detects title from "title" or "name" fields - Formats arrays as bullet lists - Formats objects as key-value pairs ### XML Extraction ```python xml_content = b""" <document> <title>Sample XML</title> <section name="Introduction"> <paragraph>Text content</paragraph> </section> </document> """ result = extract_text(xml_content, filename="doc.xml") # result.markdown: # # Sample XML # # ## Introduction # Text content # result.title: "Sample XML" # result.file_type: "xml" ``` **Features:** - Converts to hierarchical markdown - Preserves structure via headings - Extracts attributes as metadata - Handles nested elements ### Email Extraction (EML) ```python eml_content = b""" From: sender@example.com To: recipient@example.com Subject: Meeting Notes Date: Mon, 1 Jan 2024 10:00:00 -0500 Meeting notes: - Discuss Q1 goals - Review budget """ result = extract_text(eml_content, filename="meeting.eml") # result.markdown: # From: sender@example.com # To: recipient@example.com # Date: Mon, 1 Jan 2024 10:00:00 -0500 # # Meeting notes: # - Discuss Q1 goals # - Review budget # result.title: "Meeting Notes" # result.file_type: "eml" # result.metadata: { # "from": "sender@example.com", # "to": "recipient@example.com", # "date": "Mon, 1 Jan 2024 10:00:00 -0500" # } ``` **Features:** - Extracts headers (From, To, Subject, Date) - Handles multipart messages - Extracts plaintext parts only - Preserves email thread structure ### EPUB Extraction ```python # Read EPUB file with open("book.epub", "rb") as f: epub_content = f.read() result = extract_text(epub_content, filename="book.epub") # result.markdown: # # Book Title # # ## Chapter 1: Introduction # Chapter text... # # ## Chapter 2: Main Content # Chapter text... # result.title: "Book Title" # result.file_type: "epub" # result.metadata: { # "author": "Author Name", # "publisher": "Publisher", # "language": "en" # } ``` **Features:** - Extracts all chapters - Preserves chapter hierarchy - Extracts metadata (title, author, publisher) - Handles HTML content within EPUB **Requires:** `pip install ebooklib` ### DOCX Extraction ```python # Read DOCX file with open("report.docx", "rb") as f: docx_content = f.read() result = extract_text(docx_content, filename="report.docx") # result.markdown: # # Annual Report 2023 # # ## Executive Summary # Text content... # # ## Financial Results # - Revenue: $1.2M # - Profit: $300K # result.title: "Annual Report 2023" # result.file_type: "docx" # result.metadata: { # "author": "Finance Team", # "created": "2024-01-15" # } ``` **Features:** - Extracts paragraphs and headings - Preserves text formatting structure - Extracts document properties - Handles tables and lists **Requires:** `pip install python-docx` ### XLSX Extraction ```python # Read XLSX file with open("data.xlsx", "rb") as f: xlsx_content = f.read() result = extract_text(xlsx_content, filename="data.xlsx") # result.markdown: # # Sheet1 # # | Name | Value | Status | # |------|-------|--------| # | Item A | 100 | Active | # | Item B | 200 | Pending | # # # Sheet2 # # | Column1 | Column2 | # |---------|---------| # | Data1 | Data2 | # result.file_type: "xlsx" # result.metadata: {"sheets": 2, "rows": 4} ``` **Features:** - Extracts all sheets - Converts to markdown tables - Preserves cell values (no formulas) - Handles merged cells **Requires:** `pip install openpyxl` ## ExtractionResult API ```python from dataclasses import dataclass @dataclass class ExtractionResult: markdown: str # Extracted text as markdown file_type: str # Detected file type title: str | None # Document title (if found) metadata: dict # Format-specific metadata ``` ### Attributes **markdown** - Type: `str` - Contains extracted text formatted as markdown - Always populated (empty string if extraction fails) **file_type** - Type: `str` - Format identifier: `html`, `csv`, `json`, `xml`, `eml`, `epub`, `docx`, `xlsx`, `txt` - Used for logging and analytics **title** - Type: `str | None` - Document title extracted from format-specific fields - `None` if no title found **metadata** - Type: `dict` - Format-specific metadata (author, date, row count, etc.) - Empty dict if no metadata available ## Error Handling ```python from ragstack_common.text_extractors import extract_text try: result = extract_text(content, filename="document.docx") if not result.markdown.strip(): logger.warning("Extraction produced empty text") except ImportError as e: # Missing optional dependency (ebooklib, python-docx, openpyxl) logger.error(f"Missing dependency: {e}") except UnicodeDecodeError as e: # Encoding issues logger.error(f"Encoding error: {e}") except Exception as e: # Other extraction errors logger.error(f"Extraction failed: {e}") ``` ## Integration with Document Pipeline ```python from ragstack_common.storage import read_s3_binary, write_s3_text from ragstack_common.text_extractors import extract_text from ragstack_common.models import Document, Status def process_text_document(document: Document) -> Document: """ Extract text from non-OCR document formats. """ # Read input file content = read_s3_binary(document.input_s3_uri) # Extract text result = extract_text(content, filename=document.filename) # Save markdown to S3 output_s3_uri = document.input_s3_uri.replace("/input/", "/output/") output_s3_uri = output_s3_uri.rsplit(".", 1)[0] + ".md" write_s3_text(output_s3_uri, result.markdown, content_type="text/markdown") # Update document document.output_s3_uri = output_s3_uri document.status = Status.OCR_COMPLETE return document ``` ## Fallback to OCR Some file types (DOCX, XLSX) may contain embedded images or scanned content. Consider fallback: ```python from ragstack_common.text_extractors import extract_text from ragstack_common.ocr import OcrService result = extract_text(content, filename="document.docx") # Check if extraction produced meaningful text if len(result.markdown.strip()) < 100: logger.info("Text extraction insufficient, falling back to OCR") ocr_service = OcrService(backend="textract") document = ocr_service.process_document(document) ``` ## Best Practices 1. **Format Detection**: Always provide accurate filename with extension 2. **Dependencies**: Install optional dependencies only when needed (`ebooklib`, `python-docx`, `openpyxl`) 3. **Validation**: Check `result.markdown.strip()` for empty/insufficient extractions 4. **Fallback**: Have OCR fallback for documents with embedded images 5. **Encoding**: UTF-8 recommended for text files; extractors handle format-specific encodings 6. **Large Files**: Consider chunking very large CSVs or spreadsheets 7. **Security**: Validate file types before extraction to prevent malicious uploads ## See Also - [OCR.md](./OCR.md) - OCR processing for PDFs and images - [STORAGE.md](./STORAGE.md) - S3 storage utilities - [models.py](./UTILITIES.md#models) - Document data models

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/HatmanStack/RAGStack-Lambda'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

TEXT_EXTRACTORS.md•10.7 KiB