ingest_documents
Process PDF files to extract text and images, generate structured Document Manifests, and optionally index into a knowledge graph for precise asset retrieval.
Instructions
Process PDF files and create Document Manifests.
ETL Pipeline:
Extract text (to markdown) and images
Generate structured Document Manifest
Index in LightRAG (if enabled)
Args: file_paths: List of absolute paths to PDF files async_mode: Kept for backwards compatibility. PDF ingestion is routed to a background job from the MCP tool layer to keep stdio clients responsive. use_marker: If True, use Marker for structured parsing (slower but more accurate). Produces blocks.json with bbox/coordinates for precise source tracking. Default False uses PyMuPDF (faster but less structured). marker_max_pages_per_chunk: When using Marker, split PDFs into fixed-size page chunks. Set 0 to use the safe automatic strategy. extract_figures: When using Marker, control whether image crops are extracted and saved. Disable this first for image-heavy textbooks to reduce memory pressure. page_ranges: 1-indexed inclusive page ranges applied to every input file, e.g. ["1-50", "120-160"].
Returns:
Job ID for tracking progress with get_job_status.
Example: # Async (recommended for large files): ingest_documents(["/papers/study1.pdf"]) # Then check status: get_job_status("job_xxx")
# With Marker for precise source tracking:
ingest_documents(["/papers/textbook.pdf"], use_marker=True)
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| file_paths | Yes | ||
| async_mode | No | ||
| use_marker | No | ||
| ocr_enabled | No | ||
| ocr_language | No | eng | |
| rotate_pages | No | ||
| deskew | No | ||
| marker_max_pages_per_chunk | No | ||
| extract_figures | No | ||
| index_knowledge_graph | No | ||
| page_ranges | No | ||
| ctx | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |