apaper_read_pdf_file
Extract text content from PDF files, supporting both local documents and online sources with customizable page ranges for academic research.
Instructions
Read and extract text content from a PDF file (local or online)
Args: pdf_source: Path to local PDF file or URL to online PDF start_page: Starting page number (1-indexed, inclusive). Defaults to 1. end_page: Ending page number (1-indexed, inclusive). Defaults to last page.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| pdf_source | Yes | ||
| start_page | No | ||
| end_page | No |
Implementation Reference
- src/apaper/server.py:456-487 (handler)MCP tool handler 'read_pdf_file' (likely namespaced as 'apaper_read_pdf_file') that reads PDF text from local file or URL, converting string page params to int and calling the pdf_reader utility.@mcp.tool() def read_pdf_file( pdf_source: str, start_page: int | str | None = None, end_page: int | str | None = None, ) -> str: """ Read and extract text content from a PDF file (local or online) Args: pdf_source: Path to local PDF file or URL to online PDF start_page: Starting page number (1-indexed, inclusive). Defaults to 1. end_page: Ending page number (1-indexed, inclusive). Defaults to last page. """ try: # Convert string parameters to integers if needed start_page_int = None end_page_int = None if start_page is not None: start_page_int = int(start_page) if end_page is not None: end_page_int = int(end_page) result = read_pdf(pdf_source, start_page=start_page_int, end_page=end_page_int) return result except ValueError as e: return f"Error: Invalid page number format. Please provide valid integers for start_page and end_page." except Exception as e: return f"Error reading PDF from {pdf_source}: {str(e)}"
- src/apaper/utils/pdf_reader.py:10-43 (helper)Core utility function 'read_pdf' that handles PDF text extraction for both local files and URLs, dispatching to private helpers and normalizing page ranges. Called by the MCP handler.def read_pdf(pdf_source: str | Path, start_page: int | None = None, end_page: int | None = None) -> str: """ Extract text content from a PDF file (local or online). Args: pdf_source: Path to local PDF file or URL to online PDF start_page: Starting page number (1-indexed, inclusive). Defaults to 1. end_page: Ending page number (1-indexed, inclusive). Defaults to last page. Returns: str: Extracted text content from the PDF Raises: FileNotFoundError: If local file doesn't exist ValueError: If URL is invalid, PDF cannot be processed, or page range is invalid Exception: For other PDF processing errors """ try: if isinstance(pdf_source, str | Path): pdf_source_str = str(pdf_source) # Check if it's a URL parsed = urlparse(pdf_source_str) if parsed.scheme in ("http", "https"): # Handle online PDF return _read_pdf_from_url(pdf_source_str, start_page, end_page) else: # Handle local file return _read_pdf_from_file(Path(pdf_source_str), start_page, end_page) else: raise ValueError("pdf_source must be a string or Path object") except Exception as e: raise Exception(f"Failed to read PDF from {pdf_source}: {e!s}") from e
- src/apaper/server.py:22-22 (registration)FastMCP server initialization with namespace 'apaper', which likely prefixes tool names (e.g., 'apaper_read_pdf_file'). All @mcp.tool() decorators register tools here.mcp = FastMCP("apaper")
- src/apaper/utils/pdf_reader.py:46-83 (helper)Helper function to normalize and validate PDF page range inputs (1-indexed to 0-indexed). Used by read_pdf.def _normalize_page_range(start_page: int | None, end_page: int | None, total_pages: int) -> tuple[int, int]: """ Normalize and validate page range parameters. Args: start_page: Starting page number (1-indexed, inclusive) or None end_page: Ending page number (1-indexed, inclusive) or None total_pages: Total number of pages in the PDF Returns: tuple[int, int]: (start_index, end_index) as 0-indexed values Raises: ValueError: If page range is invalid """ # Default values if start_page is None: start_page = 1 if end_page is None: end_page = total_pages # Validate page numbers if start_page < 1: raise ValueError(f"start_page must be >= 1, got {start_page}") if end_page < 1: raise ValueError(f"end_page must be >= 1, got {end_page}") if start_page > end_page: raise ValueError(f"start_page ({start_page}) must be <= end_page ({end_page})") if start_page > total_pages: raise ValueError(f"start_page ({start_page}) exceeds total pages ({total_pages})") # Clamp end_page to total_pages if end_page > total_pages: end_page = total_pages # Convert to 0-indexed return start_page - 1, end_page - 1