Skip to main content
Glama
gscfwid

NCCN Guidelines MCP Server

by gscfwid

extract_content

Extract text from specific pages of PDF files containing NCCN clinical guidelines to access precise content without searching entire documents.

Instructions

Extract content from specific pages of a PDF file.

Args:
    pdf_path: Path to the PDF file (relative to the downloads directory or absolute path)
    pages: Comma-separated page numbers to extract (e.g., "1,3,5-7"). 
           If not specified, extracts all pages. Supports negative indexing (-1 for last page).

Returns:
    Extracted text content from the specified pages

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
pdf_pathYes
pagesNo

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • MCP tool handler for 'extract_content'. This is the main async function registered via @mcp.tool() that handles the tool execution, resolves the PDF path, delegates to PDFReader, and returns the extracted content.
    @mcp.tool()
    async def extract_content(pdf_path: str, pages: Optional[str] = None) -> str:
        """
        Extract content from specific pages of a PDF file.
        
        Args:
            pdf_path: Path to the PDF file (relative to the downloads directory or absolute path)
            pages: Comma-separated page numbers to extract (e.g., "1,3,5-7"). 
                   If not specified, extracts all pages. Supports negative indexing (-1 for last page).
        
        Returns:
            Extracted text content from the specified pages
        """
        try:
            # Resolve PDF path
            if not os.path.isabs(pdf_path):
                # Try relative to downloads directory first
                download_path = current_dir / DOWNLOAD_DIR / pdf_path
                if download_path.exists():
                    pdf_path = str(download_path)
                else:
                    # Try relative to current directory
                    current_path = current_dir / pdf_path
                    if current_path.exists():
                        pdf_path = str(current_path)
                    else:
                        logger.error(f"PDF file not found: {pdf_path}")
                        return f"PDF file not found: {pdf_path}"
            
            # Extract content using PDFReader
            content = pdf_reader.extract_content(pdf_path, pages)
            
            if not content.strip():
                logger.warning(f"No content extracted from {pdf_path} (pages: {pages or 'all'})")
                return f"No content extracted from {pdf_path} (pages: {pages or 'all'})"
            
            logger.info(f"Successfully extracted content from {pdf_path} (pages: {pages or 'all'})")
            return content
        
        except Exception as e:
            logger.error(f"Error extracting content from PDF: {str(e)}")
            return f"Error extracting content from PDF: {str(e)}"
  • Core helper function in PDFReader class that performs the actual PDF parsing, extracts text with layout preservation, internal links, and formats the content from specified pages.
    def extract_content(self, pdf_path: str, pages: Optional[str] = None) -> str:
        """Main method for extracting PDF content with internal links"""
        if not pdf_path:
            raise ValueError("PDF path cannot be empty")
    
        try:
            logger.info(f"Starting PDF content extraction from: {pdf_path}")
            
            # Open PDF with pypdf
            self.reader = PdfReader(pdf_path)
            total_pages = len(self.reader.pages)
            
            # Build xref to page mapping
            self.build_xref_to_page_mapping(self.reader)
            
            # Build named destinations mapping
            self.build_named_destinations_mapping(self.reader)
            
            selected_pages = self.parse_pages(pages, total_pages)
            
            logger.info(f"PDF has {total_pages} pages, extracting pages: {[p+1 for p in selected_pages]}")
            
            extracted_contents = []
            
            for page_num in selected_pages:
                if page_num < len(self.reader.pages):
                    page = self.reader.pages[page_num]
                    content = self.extract_page_content(page, page_num)
                    formatted_content = self.format_page_content(content)
                    extracted_contents.append(formatted_content)
                    logger.debug(f"Extracted content from page {page_num + 1}")
            
            logger.info(f"Successfully extracted content from {len(extracted_contents)} pages")
            return "\n\n".join(extracted_contents)
                
        except Exception as e:
            logger.error(f"Failed to extract PDF content: {str(e)}")
            raise ValueError(f"Failed to extract PDF content: {str(e)}")
  • server.py:59-59 (registration)
    Initialization of the FastMCP server instance where tools are registered via decorators.
    mcp = FastMCP("nccn-guidelines")

Tool Definition Quality

Score is being calculated. Check back soon.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/gscfwid/NCCN_guidelines_MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server