extract_content
Extract text from specific pages of PDF files containing NCCN clinical guidelines to access precise content without searching entire documents.
Instructions
Extract content from specific pages of a PDF file.
Args:
pdf_path: Path to the PDF file (relative to the downloads directory or absolute path)
pages: Comma-separated page numbers to extract (e.g., "1,3,5-7").
If not specified, extracts all pages. Supports negative indexing (-1 for last page).
Returns:
Extracted text content from the specified pages
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| pdf_path | Yes | ||
| pages | No |
Implementation Reference
- server.py:224-265 (handler)MCP tool handler for 'extract_content'. This is the main async function registered via @mcp.tool() that handles the tool execution, resolves the PDF path, delegates to PDFReader, and returns the extracted content.@mcp.tool() async def extract_content(pdf_path: str, pages: Optional[str] = None) -> str: """ Extract content from specific pages of a PDF file. Args: pdf_path: Path to the PDF file (relative to the downloads directory or absolute path) pages: Comma-separated page numbers to extract (e.g., "1,3,5-7"). If not specified, extracts all pages. Supports negative indexing (-1 for last page). Returns: Extracted text content from the specified pages """ try: # Resolve PDF path if not os.path.isabs(pdf_path): # Try relative to downloads directory first download_path = current_dir / DOWNLOAD_DIR / pdf_path if download_path.exists(): pdf_path = str(download_path) else: # Try relative to current directory current_path = current_dir / pdf_path if current_path.exists(): pdf_path = str(current_path) else: logger.error(f"PDF file not found: {pdf_path}") return f"PDF file not found: {pdf_path}" # Extract content using PDFReader content = pdf_reader.extract_content(pdf_path, pages) if not content.strip(): logger.warning(f"No content extracted from {pdf_path} (pages: {pages or 'all'})") return f"No content extracted from {pdf_path} (pages: {pages or 'all'})" logger.info(f"Successfully extracted content from {pdf_path} (pages: {pages or 'all'})") return content except Exception as e: logger.error(f"Error extracting content from PDF: {str(e)}") return f"Error extracting content from PDF: {str(e)}"
- read_pdf.py:252-290 (helper)Core helper function in PDFReader class that performs the actual PDF parsing, extracts text with layout preservation, internal links, and formats the content from specified pages.def extract_content(self, pdf_path: str, pages: Optional[str] = None) -> str: """Main method for extracting PDF content with internal links""" if not pdf_path: raise ValueError("PDF path cannot be empty") try: logger.info(f"Starting PDF content extraction from: {pdf_path}") # Open PDF with pypdf self.reader = PdfReader(pdf_path) total_pages = len(self.reader.pages) # Build xref to page mapping self.build_xref_to_page_mapping(self.reader) # Build named destinations mapping self.build_named_destinations_mapping(self.reader) selected_pages = self.parse_pages(pages, total_pages) logger.info(f"PDF has {total_pages} pages, extracting pages: {[p+1 for p in selected_pages]}") extracted_contents = [] for page_num in selected_pages: if page_num < len(self.reader.pages): page = self.reader.pages[page_num] content = self.extract_page_content(page, page_num) formatted_content = self.format_page_content(content) extracted_contents.append(formatted_content) logger.debug(f"Extracted content from page {page_num + 1}") logger.info(f"Successfully extracted content from {len(extracted_contents)} pages") return "\n\n".join(extracted_contents) except Exception as e: logger.error(f"Failed to extract PDF content: {str(e)}") raise ValueError(f"Failed to extract PDF content: {str(e)}")
- server.py:59-59 (registration)Initialization of the FastMCP server instance where tools are registered via decorators.mcp = FastMCP("nccn-guidelines")