apaper_read_pdf_file

Extract text content from PDF files, supporting both local documents and online sources with customizable page ranges for academic research.

Instructions

Read and extract text content from a PDF file (local or online)

Args: pdf_source: Path to local PDF file or URL to online PDF start_page: Starting page number (1-indexed, inclusive). Defaults to 1. end_page: Ending page number (1-indexed, inclusive). Defaults to last page.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`pdf_source`	Yes
`start_page`	No
`end_page`	No

Output Schema

TableJSON Schema

Name	Required	Description	Default
`result`	Yes

Implementation Reference

src/apaper/server.py:456-487 (handler)

MCP tool handler 'read_pdf_file' (likely namespaced as 'apaper_read_pdf_file') that reads PDF text from local file or URL, converting string page params to int and calling the pdf_reader utility.

@mcp.tool()
def read_pdf_file(
    pdf_source: str,
    start_page: int | str | None = None,
    end_page: int | str | None = None,
) -> str:
    """
    Read and extract text content from a PDF file (local or online)
    
    Args:
        pdf_source: Path to local PDF file or URL to online PDF
        start_page: Starting page number (1-indexed, inclusive). Defaults to 1.
        end_page: Ending page number (1-indexed, inclusive). Defaults to last page.
    """
    try:
        # Convert string parameters to integers if needed
        start_page_int = None
        end_page_int = None
        
        if start_page is not None:
            start_page_int = int(start_page)
        
        if end_page is not None:
            end_page_int = int(end_page)
        
        result = read_pdf(pdf_source, start_page=start_page_int, end_page=end_page_int)
        return result
    except ValueError as e:
        return f"Error: Invalid page number format. Please provide valid integers for start_page and end_page."
    except Exception as e:
        return f"Error reading PDF from {pdf_source}: {str(e)}"

src/apaper/utils/pdf_reader.py:10-43 (helper)

Core utility function 'read_pdf' that handles PDF text extraction for both local files and URLs, dispatching to private helpers and normalizing page ranges. Called by the MCP handler.

def read_pdf(pdf_source: str | Path, start_page: int | None = None, end_page: int | None = None) -> str:
    """
    Extract text content from a PDF file (local or online).

    Args:
        pdf_source: Path to local PDF file or URL to online PDF
        start_page: Starting page number (1-indexed, inclusive). Defaults to 1.
        end_page: Ending page number (1-indexed, inclusive). Defaults to last page.

    Returns:
        str: Extracted text content from the PDF

    Raises:
        FileNotFoundError: If local file doesn't exist
        ValueError: If URL is invalid, PDF cannot be processed, or page range is invalid
        Exception: For other PDF processing errors
    """
    try:
        if isinstance(pdf_source, str | Path):
            pdf_source_str = str(pdf_source)

            # Check if it's a URL
            parsed = urlparse(pdf_source_str)
            if parsed.scheme in ("http", "https"):
                # Handle online PDF
                return _read_pdf_from_url(pdf_source_str, start_page, end_page)
            else:
                # Handle local file
                return _read_pdf_from_file(Path(pdf_source_str), start_page, end_page)
        else:
            raise ValueError("pdf_source must be a string or Path object")

    except Exception as e:
        raise Exception(f"Failed to read PDF from {pdf_source}: {e!s}") from e

src/apaper/server.py:22-22 (registration)
FastMCP server initialization with namespace 'apaper', which likely prefixes tool names (e.g., 'apaper_read_pdf_file'). All @mcp.tool() decorators register tools here.
```
mcp = FastMCP("apaper")
```

src/apaper/utils/pdf_reader.py:46-83 (helper)

Helper function to normalize and validate PDF page range inputs (1-indexed to 0-indexed). Used by read_pdf.

def _normalize_page_range(start_page: int | None, end_page: int | None, total_pages: int) -> tuple[int, int]:
    """
    Normalize and validate page range parameters.
    
    Args:
        start_page: Starting page number (1-indexed, inclusive) or None
        end_page: Ending page number (1-indexed, inclusive) or None
        total_pages: Total number of pages in the PDF
        
    Returns:
        tuple[int, int]: (start_index, end_index) as 0-indexed values
        
    Raises:
        ValueError: If page range is invalid
    """
    # Default values
    if start_page is None:
        start_page = 1
    if end_page is None:
        end_page = total_pages
        
    # Validate page numbers
    if start_page < 1:
        raise ValueError(f"start_page must be >= 1, got {start_page}")
    if end_page < 1:
        raise ValueError(f"end_page must be >= 1, got {end_page}")
    if start_page > end_page:
        raise ValueError(f"start_page ({start_page}) must be <= end_page ({end_page})")
    if start_page > total_pages:
        raise ValueError(f"start_page ({start_page}) exceeds total pages ({total_pages})")
        
    # Clamp end_page to total_pages
    if end_page > total_pages:
        end_page = total_pages
        
    # Convert to 0-indexed
    return start_page - 1, end_page - 1

All-in-MCP