| read_pdfA | Extract text from PDF files with intelligent page handling and chunking. Args:
file_path: Path to the PDF file
pages: Page range (e.g., '1,3,5-10,-1' for pages 1, 3, 5 to 10, and last page)
chunk_size: Maximum size of text chunks
chunk_overlap: Overlap between chunks to preserve context
Returns:
JSON string with extracted text and metadata
|
| split_pdfB | Split PDF into multiple files based on page ranges. Args:
file_path: Path to the source PDF file
split_ranges: List of page ranges (e.g., ["1-5", "6-10", "11-15"])
output_dir: Output directory (defaults to source file directory)
prefix: Output file prefix (defaults to source filename)
Returns:
JSON string with split operation results and output file information
|
| extract_pagesA | Extract specific pages from PDF to a new file. Args:
file_path: Path to the source PDF file
pages: Page range (e.g., "1,3,5-7" for pages 1, 3, and 5 to 7)
output_file: Output filename (optional, auto-generated if not provided)
output_dir: Output directory (defaults to source file directory)
Returns:
JSON string with extraction results and output file information
|
| merge_pdfsA | Merge multiple PDF files into a single file. Pages are processed in the order specified in the file_paths list,
preserving the original page sequence in the merged document.
Args:
file_paths: List of PDF file paths to merge
output_file: Output filename (optional, auto-generated if not provided)
output_dir: Output directory (defaults to first file's directory)
Returns:
JSON string with merge results and output file information
|
| ocr_pdfA | Perform OCR on PDF pages using Tesseract for scanned documents. Args:
file_path: Path to the PDF file
pages: Page range (e.g., '1,3,5-10,-1' for pages 1, 3, 5 to 10, and last page)
language: OCR language code (default: 'chi_sim' for simplified Chinese)
chunk_size: Maximum size of text chunks
chunk_overlap: Overlap between chunks to preserve context
dpi: DPI for PDF to image conversion (higher = better quality, slower)
Returns:
JSON string with OCR results and metadata
|
| pdf_to_imagesA | Convert PDF pages to images. Args:
file_path: Path to PDF file
pages: Page range (e.g., '1,3,5-10,-1' for pages 1, 3, 5 to 10, and last page)
dpi: Resolution for image conversion (default: 200)
image_format: Output format ('PNG', 'JPEG', etc.)
output_dir: Directory to save images (default: auto-generated)
save_to_disk: Whether to save images to disk or keep in memory
Returns:
JSON string with conversion results and file paths
|
| images_to_pdfA | Convert multiple images to a single PDF. Images are processed in the order specified in the image_paths list,
preserving their sequence in the final PDF document.
Args:
image_paths: List of image file paths to convert
output_file: Output PDF file path
page_size: Page size ('A4', 'Letter', 'Legal', or 'auto')
quality: JPEG quality for compression (1-100)
title: PDF document title (optional)
author: PDF document author (optional)
Returns:
JSON string with conversion results
|
| extract_pdf_imagesA | Extract images from PDF pages. Args:
file_path: Path to PDF file
pages: Page range (e.g., '1,3,5-10,-1' for specific pages)
min_size: Minimum image size to extract (format: 'WIDTHxHEIGHT', e.g., '100x100')
output_dir: Directory to save extracted images (default: auto-generated)
Returns:
JSON string with extraction results and file paths
|
| get_pdf_metadataA | Read PDF metadata including standard fields and optionally XMP metadata. Args:
file_path: Path to PDF file
include_xmp: Whether to include advanced XMP metadata (default: False)
Returns:
JSON string with comprehensive metadata information
|
| set_pdf_metadataB | Write or update PDF metadata fields. Args:
file_path: Path to source PDF file
output_file: Output PDF file path (optional, defaults to overwrite source)
title: Document title
author: Document author
subject: Document subject
creator: Creator application name
producer: Producer application name
keywords: Keywords or tags (comma-separated)
preserve_existing: Whether to preserve existing metadata (default: True)
Returns:
JSON string with operation results
|
| remove_pdf_metadataA | Remove specific metadata fields or all metadata from PDF. The fields_to_remove and remove_all parameters are mutually exclusive:
use either fields_to_remove for selective removal OR remove_all for complete removal.
Args:
file_path: Path to source PDF file
output_file: Output PDF file path (optional, defaults to overwrite source)
fields_to_remove: List of specific fields to remove (e.g., ['title', 'author'])
remove_all: Remove all metadata if True (default: False)
Returns:
JSON string with operation results
|
| search_pdf_textA | Search for text content across PDF pages with detailed match information. Args:
file_path: Path to PDF file
query: Text to search for (or regex pattern if regex_search=True)
pages: Page range (e.g., '1,3,5-10,-1') or None for all pages
case_sensitive: Whether search is case-sensitive (default: False)
regex_search: Whether to treat query as regex pattern (default: False)
context_chars: Number of characters to show around matches (default: 100)
max_matches: Maximum number of matches to return (default: 100)
Returns:
JSON string with search results, match locations, and context
|
| extract_page_textC | Extract text from a specific PDF page with various extraction options. Args:
file_path: Path to PDF file
page_number: Page number to extract (1-based)
extraction_mode: Text extraction mode ('default', 'layout', 'simple')
Returns:
JSON string with extracted text and statistics
|
| find_and_highlight_textB | Find text and return information for highlighting matches. Args:
file_path: Path to PDF file
query: Text to search for
pages: Page range (e.g., '1,3,5-10,-1') or None for all pages
case_sensitive: Whether search is case-sensitive (default: False)
Returns:
JSON string with page highlights and position information
|
| optimize_pdfA | Optimize PDF file using various compression techniques. Args:
file_path: Path to source PDF file
output_file: Output PDF file path (optional, defaults to '_optimized' suffix)
optimization_level: Optimization preset ('light', 'medium', 'heavy', 'maximum')
Returns:
JSON string with optimization results and file size statistics
|
| compress_pdf_imagesA | Compress images in PDF while preserving document structure. Args:
file_path: Path to source PDF file
output_file: Output PDF file path (optional, auto-generated)
quality: Image compression quality (1-100, where 100=best quality)
Returns:
JSON string with compression results and statistics
|
| remove_pdf_contentB | Remove specific content from PDF to reduce file size. Args:
file_path: Path to source PDF file
output_file: Output PDF file path (optional, auto-generated)
remove_images: Whether to remove all images
remove_annotations: Whether to remove annotations
compress_streams: Whether to compress content streams
Returns:
JSON string with content removal results and statistics
|
| analyze_pdf_sizeA | Analyze PDF file to identify optimization opportunities. Provides detailed size breakdown by content type (text, images, metadata, etc.)
and recommends specific optimization strategies for file size reduction.
Args:
file_path: Path to PDF file to analyze
Returns:
JSON string with size analysis breakdown and optimization recommendations
|