Skip to main content
Glama
xraywu

PDF Extraction MCP Server

by xraywu

extract-pdf-contents

Extract text content from local PDF files with optional page selection, supporting both standard PDF reading and OCR capabilities.

Instructions

Extract contents from a local PDF file, given page numbers separated in comma. Negative page index number supported.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
pdf_pathYes
pagesNo

Implementation Reference

  • Handler logic for 'extract-pdf-contents' tool: validates arguments, instantiates PDFExtractor, calls extract_content, and returns the extracted text as TextContent.
    if name == "extract-pdf-contents":
        if not arguments:
            raise ValueError("Missing arguments")
    
        pdf_path = arguments.get("pdf_path")
        pages = arguments.get("pages")
    
        if not pdf_path:
            raise ValueError("Missing file path")
    
    
        extractor = PDFExtractor()
        extracted_text = extractor.extract_content(pdf_path, pages)
        return [
            types.TextContent(
                type="text",
                text=extracted_text,
            )
        ]
  • Input schema for 'extract-pdf-contents': requires 'pdf_path' string, optional 'pages' string for comma-separated page numbers.
    inputSchema={
        "type": "object",
        "properties": {
            "pdf_path": {"type": "string"},
            "pages": {"type": "string"},
        },
        "required": ["pdf_path"],
    },
  • Registers the 'extract-pdf-contents' tool in the MCP server's list_tools handler, including name, description, and input schema.
    @server.list_tools()
    async def handle_list_tools() -> list[types.Tool]:
        """
        Tools for PDF contents extraction
        """
        return [
            types.Tool(
                name="extract-pdf-contents",
                description="Extract contents from a local PDF file, given page numbers separated in comma. Negative page index number supported.",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "pdf_path": {"type": "string"},
                        "pages": {"type": "string"},
                    },
                    "required": ["pdf_path"],
                },
            )
        ]
  • Core helper method in PDFExtractor class that performs the actual PDF content extraction, supporting both text-based and scanned (OCR) PDFs, with page parsing.
    def extract_content(self, pdf_path: str, pages: Optional[str]) -> List[str]:
        """提取PDF内容的主方法"""
        if not pdf_path:
            raise ValueError("PDF路径不能为空")
    
        try:
            # 检查是否为扫描件
            is_scanned = self.is_scanned_pdf(pdf_path)
            
            # 解析页码
            reader = PdfReader(pdf_path)
            total_pages = len(reader.pages)
            selected_pages = self.parse_pages(pages, total_pages)
            
            # 根据PDF类型选择提取方式
            if is_scanned:
                text = self.extract_text_from_scanned(pdf_path, selected_pages)
            else:
                text = self.extract_text_from_normal(pdf_path, selected_pages)
                
            return text
        except Exception as e:
            raise ValueError(f"提取PDF内容失败: {str(e)}")
Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/xraywu/mcp-pdf-extraction-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server