# MCP PDF Flow - AI-Powered PDF Extraction & Conversion for LLMs
[δΈζ](README.md) | **English**
> π **The Ultimate MCP Server for RAG & LLM Document Processing** | PDF to Markdown | Format Converter | Intelligent Extraction
**MCP PDF Flow** is a powerful **Model Context Protocol (MCP)** server designed to bridge the gap between Large Language Models (LLMs, like Claude) and your local documents.
It serves as a comprehensive **ETL (Extract, Transform, Load)** tool for your personal knowledge base, optimizing PDFs into **clean, structured Markdown** perfectly suited for RAG pipelines and AI context windows.
## β¨ Key Highlights
* **π€ LLM-Ready Output**: Generates clean, structured Markdown optimized for AI reading and RAG indexing.
* **π RAG & Search**: Built-in fuzzy search and metadata extraction to quickly locate relevant document sections.
* **π Universal Converter**: Seamless conversion between PDF, Word (.docx), and Markdown.
## β¨ Features
* **π Intelligent Text Extraction**:
* Precisely extracts text from PDF pages.
* **Smart Markdown Recognition**: Automatically identifies headers, lists, and paragraph structures, merges cross-line text, and outputs clean Markdown.
* **πΌοΈ Image Extraction**:
* Extracts all images within pages.
* **Auto-save & Reference**: Defaults to saving images locally to `extracted_images/` and referencing them in Markdown paths, avoiding context overflow with large Base64 data.
* **Base64 Preview**: Optionally returns Base64 encoding for direct preview in MCP clients (suitable for small images).
* **π Batch Processing**: Supports batch extraction of PDF files in a specified directory, automatically generates Markdown and images, and maintains a clean directory structure.
* **π Enhanced Table Extraction**:
* **High-Precision Recognition**: Automatically handles multi-line headers, misaligned columns, and complex layouts.
* **Smart Cleaning**: Automatically removes empty rows/headers and filters pseudo-tables (e.g., text lists).
* **Batch Export**: Supports one-click extraction of tables from all PDFs in a directory to Markdown files.
* **π Fuzzy Search**:
* **Smart Lookup**: Support locating PDF file paths via filename keywords (fuzzy matching, spelling error compatibility).
* **Recursive Search**: Supports multi-level directory recursive search by default.
* **π Format Conversion**:
* **Markdown to Word**: Converts generated Markdown reports into perfectly formatted Word (.docx) documents with one click.
* **Word to PDF**: Supports converting Word documents to PDF files.
* *Auto-adapt*: Prioritizes Microsoft Word; falls back to WPS Office if not installed.
* **βοΈ Flexible Extraction**: Supports extracting all pages, specific page ranges (e.g., `1`, `1-5`), or smart search by keywords.
* **βΉοΈ Metadata Retrieval**: Supports retrieving PDF title, author, page count, and **Table of Contents (TOC)**.
## π οΈ Requirements
* **OS**: Windows (Recommended for Word/WPS conversion support) / macOS / Linux
* **Python**: >= 3.10
## π¦ Installation & Usage
This project uses `uv` for package management.
### 1. Clone the Project
```bash
git clone https://github.com/Dublin1231/PDF_MCP_Flow.git
cd mcp-pdf-flow
```
### 2. Install Dependencies
```bash
uv sync
```
### 3. Run the Server
This is an MCP Server designed to be **automatically started** by clients like Claude Desktop.
You **do not need** to manually run a startup command in the terminal. Please proceed to the [Claude Desktop Configuration](#-claude-desktop-configuration) section below. Once configured, Claude Desktop will automatically run this service in the background.
## π Claude Desktop Configuration
To add this tool to Claude Desktop, edit the configuration file:
* **Windows**: `%AppData%\Claude\claude_desktop_config.json`
* **macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
Add the following content (Please **ensure** to change the path to your actual local path):
```jsonc
{
"mcpServers": {
"simple-pdf": {
"command": "uv",
"args": [
"--directory",
"D:/path/to/your/mcp-pdf-flow", // β οΈ Make sure to update this to your actual local path
"run",
"simple-pdf"
],
"env": {
"PYTHONIOENCODING": "utf-8",
// (Optional) Specify extra PDF search paths. Use semicolon (;) for Windows, colon (:) for Mac/Linux
"PDF_SEARCH_PATHS": //"D:\\example;E:\\example"
}
}
}
}
```
*Note:*
* *Windows users please use `/` or `\\` as path separators.*
* *`PDF_SEARCH_PATHS` is optional configuration for specifying extra PDF search paths. Separate multiple paths with `;` (Windows) or `:` (Mac/Linux).*
## π¬ How to Use in Chat
**You don't need to manually fill in JSON arguments!** Just tell Claude what you want to do in the chat, and it will automatically call the tools.
### Scenario Examples
#### 1. Read PDF
* **Specify Absolute Path**:
> **User**: "Read pages 1 to 5 of `D:\Documents\report.pdf` for me."
* **Read file in project directory**:
> **User**: "Read `test.pdf` in the current directory."
> *(Note: Claude will automatically look for files in the running directory)*
#### 2. Convert Markdown to Word
> **User**: "Save the extracted content as a Word document at `D:\Documents\output.docx`."
>
> **Claude**: (Automatically calls `convert_markdown_to_docx` tool)
#### 3. Word to PDF
> **User**: "Convert `D:\Documents\draft.docx` to PDF."
>
> **Claude**: (Automatically calls `convert_docx_to_pdf` tool)
#### 4. Batch Process PDFs
> **User**: "Batch process all PDFs in the `D:\Docs` directory and export to Markdown."
>
> **Claude**: (Automatically calls the `batch_extract_pdf_content` tool)
#### 5. Batch Process with Custom Output Directories
> **User**: "Batch process PDFs in `D:\Docs`, save Markdown to `D:\Output\MD`."
>
> **Claude**: (Calls `batch_extract_pdf_content` with `custom_output_dir="D:\\Output\\MD"`)
#### 6. Batch Process with Flat Output (No Directory Structure)
> **User**: "Extract all PDFs from subfolders in `D:\Docs` and put them all into `D:\All_MDs` without creating subfolders."
>
> **Claude**: (Calls `batch_extract_pdf_content` with `custom_output_dir="D:\\All_MDs"`, `preserve_structure=False`, `create_folder=False`)
#### 7. Fuzzy Find PDF
> **User**: "Help me find PDF files about 'springboot' under `D:\Study`."
>
> **Claude**: (Automatically calls `search_pdf_files` tool)
#### 8. Get PDF Table of Contents
> **User**: "Extract the outline of `D:\Books\Guide.pdf`."
>
> **Claude**: (Automatically calls `get_pdf_metadata` tool)
#### 9. Batch Extract Tables
> **User**: "Extract tables from all PDFs in `D:\Books` and save them to `D:\Tables`."
>
> **Claude**: (Automatically calls `batch_extract_tables` tool)
---
## π Tool List
> **π‘ Tip: How to get "Absolute Path"?**
> * **Windows**: Hold `Shift` and right-click the file, select "Copy as path". Remove the surrounding quotes when pasting.
> * Example: `C:\Users\Name\Documents\report.pdf`
> * **macOS**: Select the file, press `Option + Command + C` to copy the path.
> * Example: `/Users/name/Documents/report.pdf`
### 1. `extract_pdf_content`
Core tool for extracting PDF content.
**Parameters:**
* `file_path` (Required): Absolute path of the PDF file.
* *Windows Example*: `D:\Documents\paper.pdf`
* *macOS Example*: `/Users/username/Documents/paper.pdf`
* `page_range` (Optional): Page range, default is "all".
* Examples: `"1"`, `"1-5"`, `"1,3,5"`, `"all"`.
* `keyword` (Optional): Keyword search. If provided, ignores page range and extracts only pages containing the keyword.
* `format` (Optional): Output format.
* `"text"` (Default): Plain text extraction.
* `"markdown"`: **Recommended**. Smartly identifies headers and paragraphs, suitable for LLM reading.
* `"json"`: Returns structured JSON data, suitable for programmatic processing.
* `include_text` (Optional): Whether to extract text, default is `true`.
* `include_images` (Optional): Whether to extract images, default is `false`.
* `use_local_images_only` (Optional): Image processing mode, default is `true`.
* `true` (Default): Saves images locally to `extracted_images` directory and uses path references in Markdown. **Recommended for large files or PDFs with many images to prevent token overflow**.
* `false`: Returns Base64 data stream for images, allowing direct preview but consuming significant tokens.
* `skip_table_detection` (Optional): **Speed Boost Mode** switch, default is `false`.
* `false` (Default): Intelligently detects and extracts tables, converting them to Markdown table format.
* `true`: **Skip table detection**. Suitable for scenarios requiring only plain text content. Speed can increase by 3-4x (approx. 400+ pages/sec).
### 2. `batch_extract_pdf_content`
Batch extracts PDF files in a specified directory.
**Parameters:**
* `directory` (Required): Absolute path of the root directory to search.
* `pattern` (Optional): File matching pattern, default is `"**/*.pdf"` (supports recursive search).
* `custom_output_dir` (Optional): Specifies the output root directory. If omitted, it defaults to the project root. The final output will be created in an `output` folder within this directory, containing the following subdirectories based on the mode:
* `output_standard_with_image`: With images, standard mode (includes tables).
* `output_fast_with_image_no_table`: With images, fast mode (skips table detection).
* `output_standard_no_image`: Standard mode + Plain text (**No images**, includes tables).
* `output_fast_no_image_and_table`: Fast mode + Plain text (**No images, No tables**).
* `output_only_image`: Image extraction only mode (No Markdown file generated, images saved in `output/output_only_image/extracted_images`).
* `format` (Optional): Output format, default is `"markdown"`.
* `include_text` (Optional): Whether to extract text, default is `true`.
* `include_images` (Optional): Whether to extract images, default is `false`.
* `use_local_images_only` (Optional): Image processing mode, default is `true`.
* `skip_table_detection` (Optional): **Speed Boost Mode** switch, default is `false`.
* `false` (Default): Intelligently detects and extracts tables, converting them to Markdown table format.
* `true`: **Skip table detection**. Suitable for scenarios requiring only plain text content. Speed can increase by 3-4x (approx. 400+ pages/sec).
* `create_folder` (Optional): Toggle for **creating a dedicated folder** for each PDF, default is `false`.
* `true`: Creates a dedicated folder with the same name as the PDF for each file (e.g., `output/subdir/file/file.md`).
* `false` (Default): Does not create a dedicated folder (e.g., `output/subdir/file.md`).
* `preserve_structure` (Optional): Toggle for **Directory Structure Preservation**, default is `true`.
* `true` (Default): Preserves the directory hierarchy of source files (e.g., `output/SourceSubDir/file.md`).
* `false`: **Flat Output**. Ignores source directory structure and places all results directly in the output root directory (e.g., `output/file.md`).
### 3. `get_pdf_metadata`
Quickly retrieves PDF metadata and Table of Contents (TOC).
**Parameters:**
* `file_path` (Required): Absolute path of the PDF file.
### 4. `convert_markdown_to_docx`
Converts Markdown content to a Word document.
**Parameters:**
* `markdown_content` (Required): Markdown text content to convert.
* `output_path` (Required): Absolute path for the output .docx file.
* *Note*: Enter the **new file path you want to save to**.
* *Example*: `D:\Documents\report_output.docx`
### 5. `convert_docx_to_pdf`
Converts a Word document to a PDF file.
**Parameters:**
* `docx_path` (Required): Absolute path of the input .docx file.
* `pdf_path` (Optional): Absolute path for the output .pdf file.
* If not provided, generates a pdf file with the same name in the same directory as the original docx file.
### 6. `search_pdf_files`
Fuzzy search for PDF file paths by filename.
**Parameters:**
* `query` (Required): Filename keyword. Supports fuzzy matching (e.g., spelling errors), Chinese matching.
* `directory` (Optional): Search root directory. Defaults to the current working directory.
* `limit` (Optional): Maximum number of results to return, default is 10.
* `threshold` (Optional): Matching threshold (0.0-1.0), default is 0.45.
### 7. `batch_extract_tables`
Batch extracts tables from all PDFs in a directory.
**Default Output Path:**
* `output/output_only_table`: Table extraction only mode (Tables saved in `output/output_only_table` directory).
**Parameters:**
* `directory` (Required): Absolute path of the root directory to search.
* `output_dir` (Optional): Specifies the output root directory. The final table Markdown files will be saved in the `output/output_only_table` folder within this directory. If omitted, defaults to the current working directory.
* `pattern` (Optional): File matching pattern, default is `"**/*.pdf"`.
**β¨ Enhanced Table Extraction Features:**
* **Smart Header Merging**: Automatically handles multi-line headers and split column names.
* **Misaligned Column Fix**: Intelligently identifies and merges misaligned header and content columns due to formatting issues.
* **Empty Row/Header Optimization**:
* Automatically removes invalid empty rows.
* Intelligently identifies KV structure tables, preserving headers only when necessary to avoid creating incorrect empty headers for Chinese KV tables.
* **Pseudo-Table Filtering**:
* Automatically identifies and filters text lists (e.g., Table of Contents, step instructions).
* Automatically filters code blocks or long text paragraphs misidentified as tables.
* **Layout Optimization**: Intelligently handles newlines within cells, maintaining clear list structures while allowing normal long text to reflow naturally.
## π Output Directory Structure
After running the tool, images will be saved in the following structure:
```
extracted_images/
βββ PDF_Filename/
βββ page_1_img_1.jpeg
βββ page_1_img_2.png
βββ ...
```
## π» Development
* **Core Code**: `src/simple_pdf/server.py`
* **Conversion Logic**: `src/simple_pdf/convert.py`