Skip to main content
Glama
gscfwid

NCCN Guidelines MCP Server

by gscfwid

extract_content

Extract text from specific pages of PDF files containing NCCN clinical guidelines to access precise content without searching entire documents.

Instructions

Extract content from specific pages of a PDF file.

Args:
    pdf_path: Path to the PDF file (relative to the downloads directory or absolute path)
    pages: Comma-separated page numbers to extract (e.g., "1,3,5-7"). 
           If not specified, extracts all pages. Supports negative indexing (-1 for last page).

Returns:
    Extracted text content from the specified pages

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
pdf_pathYes
pagesNo

Implementation Reference

  • MCP tool handler for 'extract_content'. This is the main async function registered via @mcp.tool() that handles the tool execution, resolves the PDF path, delegates to PDFReader, and returns the extracted content.
    @mcp.tool()
    async def extract_content(pdf_path: str, pages: Optional[str] = None) -> str:
        """
        Extract content from specific pages of a PDF file.
        
        Args:
            pdf_path: Path to the PDF file (relative to the downloads directory or absolute path)
            pages: Comma-separated page numbers to extract (e.g., "1,3,5-7"). 
                   If not specified, extracts all pages. Supports negative indexing (-1 for last page).
        
        Returns:
            Extracted text content from the specified pages
        """
        try:
            # Resolve PDF path
            if not os.path.isabs(pdf_path):
                # Try relative to downloads directory first
                download_path = current_dir / DOWNLOAD_DIR / pdf_path
                if download_path.exists():
                    pdf_path = str(download_path)
                else:
                    # Try relative to current directory
                    current_path = current_dir / pdf_path
                    if current_path.exists():
                        pdf_path = str(current_path)
                    else:
                        logger.error(f"PDF file not found: {pdf_path}")
                        return f"PDF file not found: {pdf_path}"
            
            # Extract content using PDFReader
            content = pdf_reader.extract_content(pdf_path, pages)
            
            if not content.strip():
                logger.warning(f"No content extracted from {pdf_path} (pages: {pages or 'all'})")
                return f"No content extracted from {pdf_path} (pages: {pages or 'all'})"
            
            logger.info(f"Successfully extracted content from {pdf_path} (pages: {pages or 'all'})")
            return content
        
        except Exception as e:
            logger.error(f"Error extracting content from PDF: {str(e)}")
            return f"Error extracting content from PDF: {str(e)}"
  • Core helper function in PDFReader class that performs the actual PDF parsing, extracts text with layout preservation, internal links, and formats the content from specified pages.
    def extract_content(self, pdf_path: str, pages: Optional[str] = None) -> str:
        """Main method for extracting PDF content with internal links"""
        if not pdf_path:
            raise ValueError("PDF path cannot be empty")
    
        try:
            logger.info(f"Starting PDF content extraction from: {pdf_path}")
            
            # Open PDF with pypdf
            self.reader = PdfReader(pdf_path)
            total_pages = len(self.reader.pages)
            
            # Build xref to page mapping
            self.build_xref_to_page_mapping(self.reader)
            
            # Build named destinations mapping
            self.build_named_destinations_mapping(self.reader)
            
            selected_pages = self.parse_pages(pages, total_pages)
            
            logger.info(f"PDF has {total_pages} pages, extracting pages: {[p+1 for p in selected_pages]}")
            
            extracted_contents = []
            
            for page_num in selected_pages:
                if page_num < len(self.reader.pages):
                    page = self.reader.pages[page_num]
                    content = self.extract_page_content(page, page_num)
                    formatted_content = self.format_page_content(content)
                    extracted_contents.append(formatted_content)
                    logger.debug(f"Extracted content from page {page_num + 1}")
            
            logger.info(f"Successfully extracted content from {len(extracted_contents)} pages")
            return "\n\n".join(extracted_contents)
                
        except Exception as e:
            logger.error(f"Failed to extract PDF content: {str(e)}")
            raise ValueError(f"Failed to extract PDF content: {str(e)}")
  • server.py:59-59 (registration)
    Initialization of the FastMCP server instance where tools are registered via decorators.
    mcp = FastMCP("nccn-guidelines")
Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/gscfwid/NCCN_guidelines_MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server