Skip to main content
Glama

extract_text

Extract text from PDF pages by specifying a file path and optional page range to retrieve content for analysis or processing.

Instructions

Extract text from PDF pages

Args:
    pdf_path: Path to the PDF file
    start_page: Page number to start extraction (0-indexed). If None, starts from first page.
    end_page: Page number to end extraction (0-indexed, inclusive). If None, ends at start_page if specified, otherwise extracts all pages.
    
Returns:
    If extracting a single page: string containing the page text
    If extracting multiple pages: dictionary mapping page numbers to page text

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
pdf_pathYes
start_pageNo
end_pageNo

Implementation Reference

  • The handler function for the 'extract_text' MCP tool. It uses PyMuPDF (fitz) to open the PDF and extract text from specified page ranges, returning either a string for single page or a dict of page texts for multiple pages. Registered via @mcp.tool() decorator.
    @mcp.tool()
    def extract_text(pdf_path: str, start_page: Optional[int] = None, end_page: Optional[int] = None) -> Union[str, Dict[int, str]]:
        """
        Extract text from PDF pages
        
        Args:
            pdf_path: Path to the PDF file
            start_page: Page number to start extraction (0-indexed). If None, starts from first page.
            end_page: Page number to end extraction (0-indexed, inclusive). If None, ends at start_page if specified, otherwise extracts all pages.
            
        Returns:
            If extracting a single page: string containing the page text
            If extracting multiple pages: dictionary mapping page numbers to page text
        """
        try:
            doc = fitz.open(pdf_path)
            total_pages = len(doc)
            
            # Validate page parameters
            if start_page is not None and (start_page < 0 or start_page >= total_pages):
                raise ValueError(f"Start page {start_page} is out of range (0-{total_pages-1})")
                
            if end_page is not None and (end_page < 0 or end_page >= total_pages):
                raise ValueError(f"End page {end_page} is out of range (0-{total_pages-1})")
                
            # Set defaults if parameters are None
            if start_page is None:
                start_page = 0
                
            if end_page is None:
                if start_page is not None:
                    end_page = start_page
                else:
                    end_page = total_pages - 1
                    
            # Ensure start_page <= end_page
            if start_page > end_page:
                start_page, end_page = end_page, start_page
            
            # Extract text
            if start_page == end_page:
                # Single page extraction
                page = doc[start_page]
                text = page.get_text()
                doc.close()
                return text
            else:
                # Multiple page extraction
                result = {}
                for page_num in range(start_page, end_page + 1):
                    page = doc[page_num]
                    result[page_num] = page.get_text()
                
                doc.close()
                return result
        except Exception as e:
            raise Exception(f"Error extracting text: {str(e)}")

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Wildebeest/mcp_pdf_forms'

If you have feedback or need assistance with the MCP directory API, please join our Discord server