extract_content

Instructions

Extract content from specific pages of a PDF file. Args: pdf_path: Path to the PDF file (relative to the downloads directory or absolute path) pages: Comma-separated page numbers to extract (e.g., "1,3,5-7"). If not specified, extracts all pages. Supports negative indexing (-1 for last page). Returns: Extracted text content from the specified pages

Input Schema

TableJSON Schema

Name	Required	Description	Default
`pdf_path`	Yes
`pages`	No

Implementation Reference

server.py:224-265 (handler)
MCP tool handler for 'extract_content'. This is the main async function registered via @mcp.tool() that handles the tool execution, resolves the PDF path, delegates to PDFReader, and returns the extracted content.
@mcp.tool() async def extract_content(pdf_path: str, pages: Optional[str] = None) -> str: """ Extract content from specific pages of a PDF file. Args: pdf_path: Path to the PDF file (relative to the downloads directory or absolute path) pages: Comma-separated page numbers to extract (e.g., "1,3,5-7"). If not specified, extracts all pages. Supports negative indexing (-1 for last page). Returns: Extracted text content from the specified pages """ try: # Resolve PDF path if not os.path.isabs(pdf_path): # Try relative to downloads directory first download_path = current_dir / DOWNLOAD_DIR / pdf_path if download_path.exists(): pdf_path = str(download_path) else: # Try relative to current directory current_path = current_dir / pdf_path if current_path.exists(): pdf_path = str(current_path) else: logger.error(f"PDF file not found: {pdf_path}") return f"PDF file not found: {pdf_path}" # Extract content using PDFReader content = pdf_reader.extract_content(pdf_path, pages) if not content.strip(): logger.warning(f"No content extracted from {pdf_path} (pages: {pages or 'all'})") return f"No content extracted from {pdf_path} (pages: {pages or 'all'})" logger.info(f"Successfully extracted content from {pdf_path} (pages: {pages or 'all'})") return content except Exception as e: logger.error(f"Error extracting content from PDF: {str(e)}") return f"Error extracting content from PDF: {str(e)}"
read_pdf.py:252-290 (helper)
Core helper function in PDFReader class that performs the actual PDF parsing, extracts text with layout preservation, internal links, and formats the content from specified pages.
def extract_content(self, pdf_path: str, pages: Optional[str] = None) -> str: """Main method for extracting PDF content with internal links""" if not pdf_path: raise ValueError("PDF path cannot be empty") try: logger.info(f"Starting PDF content extraction from: {pdf_path}") # Open PDF with pypdf self.reader = PdfReader(pdf_path) total_pages = len(self.reader.pages) # Build xref to page mapping self.build_xref_to_page_mapping(self.reader) # Build named destinations mapping self.build_named_destinations_mapping(self.reader) selected_pages = self.parse_pages(pages, total_pages) logger.info(f"PDF has {total_pages} pages, extracting pages: {[p+1 for p in selected_pages]}") extracted_contents = [] for page_num in selected_pages: if page_num < len(self.reader.pages): page = self.reader.pages[page_num] content = self.extract_page_content(page, page_num) formatted_content = self.format_page_content(content) extracted_contents.append(formatted_content) logger.debug(f"Extracted content from page {page_num + 1}") logger.info(f"Successfully extracted content from {len(extracted_contents)} pages") return "\n\n".join(extracted_contents) except Exception as e: logger.error(f"Failed to extract PDF content: {str(e)}") raise ValueError(f"Failed to extract PDF content: {str(e)}")
server.py:59-59 (registration)
Initialization of the FastMCP server instance where tools are registered via decorators.
mcp = FastMCP("nccn-guidelines")

NCCN Guidelines MCP Server

Instructions

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API