pdfmux

PyPI Python 3.11+ License: MIT Downloads

通用 PDF 提取编排器。将每一页路由到最佳后端，审计输出结果，并对失败页面进行重新提取。包含 5 个基于规则的提取器 + BYOK LLM 回退机制。一个 CLI。一个 API。零配置。

PDF ──> pdfmux router ──> best extractor per page ──> audit ──> re-extract failures ──> Markdown / JSON / chunks
            |
            ├─ PyMuPDF         (digital text, 0.01s/page)
            ├─ OpenDataLoader  (complex layouts, 0.05s/page)
            ├─ RapidOCR        (scanned pages, CPU-only)
            ├─ Docling          (tables, 97.9% TEDS)
            ├─ Surya            (heavy OCR fallback)
            └─ YOUR LLM        (Gemini / Claude / GPT-4o / Ollama — BYOK via 5-line YAML)

安装

pip install pdfmux

就是这样。开箱即用，支持数字 PDF。为更复杂的文档添加后端：

pip install "pdfmux[ocr]"             # RapidOCR — scanned/image pages (~200MB, CPU-only)
pip install "pdfmux[tables]"          # Docling — table-heavy docs (~500MB)
pip install "pdfmux[opendataloader]"  # OpenDataLoader — complex layouts (Java 11+)
pip install "pdfmux[llm]"            # LLM fallback — Gemini, Claude, GPT-4o, Ollama
pip install "pdfmux[all]"            # everything

需要 Python 3.11+。

快速入门

CLI

# zero config — just works
pdfmux convert invoice.pdf
# invoice.pdf -> invoice.md (2 pages, 95% confidence, via pymupdf4llm)

# RAG-ready chunks with token limits
pdfmux convert report.pdf --chunk --max-tokens 500

# cost-aware extraction with budget cap
pdfmux convert report.pdf --mode economy --budget 0.50

# schema-guided structured extraction (5 built-in presets)
pdfmux convert invoice.pdf --schema invoice

# BYOK any LLM for hardest pages
pdfmux convert scan.pdf --llm-provider claude

# batch a directory
pdfmux convert ./docs/ -o ./output/

Python

import pdfmux

# text -> markdown
text = pdfmux.extract_text("report.pdf")

# structured data -> dict with tables, key-values, metadata
data = pdfmux.extract_json("report.pdf")

# RAG chunks -> list of dicts with token estimates
chunks = pdfmux.chunk("report.pdf", max_tokens=500)

架构

                           ┌─────────────────────────────┐
                           │     Segment Detector         │
                           │  text / tables / images /    │
                           │  formulas / headers per page │
                           └─────────────┬───────────────┘
                                         │
                    ┌────────────────────────────────────────┐
                    │            Router Engine                │
                    │                                        │
                    │   economy ── balanced ── premium        │
                    │   (minimize $)  (default)  (max quality)│
                    │   budget caps: --budget 0.50            │
                    └────────────────────┬───────────────────┘
                                         │
          ┌──────────┬──────────┬────────┴────────┬──────────┐
          │          │          │                  │          │
     PyMuPDF   OpenData    RapidOCR           Docling     LLM
     digital   Loader      scanned            tables    (BYOK)
     0.01s/pg  complex     CPU-only           97.9%    any provider
               layouts                        TEDS
          │          │          │                  │          │
          └──────────┴──────────┴────────┬────────┴──────────┘
                                         │
                    ┌────────────────────────────────────────┐
                    │           Quality Auditor               │
                    │                                        │
                    │   4-signal dynamic confidence scoring   │
                    │   per-page: good / bad / empty          │
                    │   if bad -> re-extract with next backend│
                    └────────────────────┬───────────────────┘
                                         │
                    ┌────────────────────────────────────────┐
                    │           Output Pipeline               │
                    │                                        │
                    │   heading injection (font-size analysis)│
                    │   table extraction + normalization      │
                    │   text cleanup + merge                  │
                    │   confidence score (honest, not inflated)│
                    └────────────────────────────────────────┘

关键设计决策

路由器，而非提取器。 pdfmux 不与 PyMuPDF 或 Docling 竞争。它为每一页选择最合适的工具。
智能体多轮处理。 提取、审计置信度、使用更强大的后端重新提取失败页面。错误页面会自动重试。
分段级检测。 在路由之前，每一页都会根据内容类型（文本、表格、图像、公式、标题）进行分类。
4 信号置信度。 基于字符密度、OCR 噪声比、表格完整性和标题结构的动态质量评分。非硬编码阈值。
文档缓存。 每个 PDF 只打开一次，而不是每个提取器打开一次。在整个流水线中共享。
数据飞轮。 本地遥测跟踪哪些提取器在特定文档类型中表现更好。路由策略随使用而改进。

功能特性

功能	作用	命令
零配置提取	自动路由到最佳后端	`pdfmux convert file.pdf`
RAG 分块	具备 Token 估算的感知分块	`pdfmux convert file.pdf --chunk --max-tokens 500`
成本模式	经济 / 平衡 / 高级，支持预算上限	`pdfmux convert file.pdf --mode economy --budget 0.50`
模式提取	5 个内置预设（发票、收据、合同、简历、论文）	`pdfmux convert file.pdf --schema invoice`
BYOK LLM	Gemini, Claude, GPT-4o, Ollama, 任何兼容 OpenAI 的 API	`pdfmux convert file.pdf --llm-provider claude`
基准测试	评估所有已安装提取器相对于基准数据的表现	`pdfmux benchmark`
医生工具	显示已安装后端、覆盖范围差距、建议	`pdfmux doctor`
MCP 服务器	AI 智能体通过 stdio 或 HTTP 读取 PDF	`pdfmux serve`
批处理	转换整个目录	`pdfmux convert ./docs/`
流式传输	针对大文件的内存受限页面迭代	`for page in ext.extract("500pg.pdf")`

CLI 参考

`pdfmux convert`

pdfmux convert <file-or-dir> [options]

Options:
  -o, --output PATH          Output file or directory
  -f, --format FORMAT        markdown | json | csv | llm (default: markdown)
  -q, --quality QUALITY      fast | standard | high (default: standard)
  -s, --schema SCHEMA        JSON schema file or preset (invoice, receipt, contract, resume, paper)
  --chunk                    Output RAG-ready chunks
  --max-tokens N             Max tokens per chunk (default: 500)
  --mode MODE                economy | balanced | premium (default: balanced)
  --budget AMOUNT            Max spend per document in USD
  --llm-provider PROVIDER    LLM backend: gemini | claude | openai | ollama
  --confidence               Include confidence score in output
  --stdout                   Print to stdout instead of file

`pdfmux serve`

启动用于 AI 智能体集成的 MCP 服务器。

pdfmux serve              # stdio mode (Claude Desktop, Cursor)
pdfmux serve --http 8080  # HTTP mode

`pdfmux doctor`

pdfmux doctor
# ┌──────────────────┬─────────────┬─────────┬──────────────────────────────────┐
# │ Extractor        │ Status      │ Version │ Install                          │
# ├──────────────────┼─────────────┼─────────┼──────────────────────────────────┤
# │ PyMuPDF          │ installed   │ 1.25.3  │                                  │
# │ OpenDataLoader   │ installed   │ 0.3.1   │                                  │
# │ RapidOCR         │ installed   │ 3.0.6   │                                  │
# │ Docling          │ missing     │ --      │ pip install pdfmux[tables]       │
# │ Surya            │ missing     │ --      │ pip install pdfmux[ocr-heavy]    │
# │ LLM (Gemini)     │ configured  │ --      │ GEMINI_API_KEY set               │
# └──────────────────┴─────────────┴─────────┴──────────────────────────────────┘

`pdfmux benchmark`

pdfmux benchmark report.pdf
# ┌──────────────────┬────────┬────────────┬─────────────┬──────────────────────┐
# │ Extractor        │   Time │ Confidence │      Output │ Status               │
# ├──────────────────┼────────┼────────────┼─────────────┼──────────────────────┤
# │ PyMuPDF          │  0.02s │        95% │ 3,241 chars │ all pages good       │
# │ Multi-pass       │  0.03s │        95% │ 3,241 chars │ all pages good       │
# │ RapidOCR         │  4.20s │        88% │ 2,891 chars │ ok                   │
# │ OpenDataLoader   │  0.12s │        97% │ 3,310 chars │ best                 │
# └──────────────────┴────────┴────────────┴─────────────┴──────────────────────┘

Python API

文本提取

import pdfmux

text = pdfmux.extract_text("report.pdf")                    # -> str (markdown)
text = pdfmux.extract_text("report.pdf", quality="fast")    # PyMuPDF only, instant
text = pdfmux.extract_text("report.pdf", quality="high")    # LLM-assisted

结构化提取

data = pdfmux.extract_json("report.pdf")
# data["page_count"]   -> 12
# data["confidence"]   -> 0.91
# data["ocr_pages"]    -> [2, 5, 8]
# data["pages"][0]["key_values"]  -> [{"key": "Date", "value": "2026-02-28"}]
# data["pages"][0]["tables"]      -> [{"headers": [...], "rows": [...]}]

RAG 分块

chunks = pdfmux.chunk("report.pdf", max_tokens=500)
for c in chunks:
    print(f"{c['title']}: {c['tokens']} tokens (pages {c['page_start']}-{c['page_end']})")

模式引导提取

data = pdfmux.extract_json("invoice.pdf", schema="invoice")
# Uses built-in invoice preset: extracts date, vendor, line items, totals
# Also accepts a path to a custom JSON Schema file

流式传输（内存受限）

from pdfmux.extractors import get_extractor

ext = get_extractor("fast")
for page in ext.extract("large-500-pages.pdf"):  # Iterator[PageResult]
    process(page.text)  # constant memory, even on 500-page PDFs

类型与错误

from pdfmux import (
    # Enums
    Quality,              # FAST, STANDARD, HIGH
    OutputFormat,         # MARKDOWN, JSON, CSV, LLM
    PageQuality,          # GOOD, BAD, EMPTY

    # Data objects (frozen dataclasses)
    PageResult,           # page: text, page_num, confidence, quality, extractor
    DocumentResult,       # document: pages, source, confidence, extractor_used
    Chunk,                # chunk: title, text, page_start, page_end, tokens

    # Errors
    PdfmuxError,          # base -- catch this for all pdfmux errors
    FileError,            # file not found, unreadable, not a PDF
    ExtractionError,      # extraction failed
    ExtractorNotAvailable,# requested backend not installed
    FormatError,          # invalid output format
    AuditError,           # audit could not complete
)

框架集成

LangChain

pip install langchain-pdfmux

from langchain_pdfmux import PDFMuxLoader

loader = PDFMuxLoader("report.pdf", quality="standard")
docs = loader.load()  # -> list[Document] with confidence metadata

LlamaIndex

pip install llama-index-readers-pdfmux

from llama_index.readers.pdfmux import PDFMuxReader

reader = PDFMuxReader(quality="standard")
docs = reader.load_data("report.pdf")  # -> list[Document]

MCP 服务器 (AI 智能体)

已在 mcpservers.org 上列出。一行命令设置：

{
  "mcpServers": {
    "pdfmux": {
      "command": "npx",
      "args": ["-y", "pdfmux-mcp"]
    }
  }
}

或通过 Claude Code：

claude mcp add pdfmux -- npx -y pdfmux-mcp

暴露的工具：convert_pdf, analyze_pdf, extract_structured, get_pdf_metadata, batch_convert。

BYOK LLM 配置

pdfmux 通过 5 行 YAML 支持任何 LLM。自带密钥——除非你配置了，否则没有任何数据会离开你的机器。

# ~/.pdfmux/llm.yaml
provider: claude          # gemini | claude | openai | ollama | any OpenAI-compatible
model: claude-sonnet-4-20250514
api_key: ${ANTHROPIC_API_KEY}
base_url: https://api.anthropic.com  # optional, for custom endpoints
max_cost_per_page: 0.02   # budget cap

支持的提供商：

提供商	模型	本地运行?	成本
Gemini	2.5 Flash, 2.5 Pro	否	~$0.01/页
Claude	Sonnet, Opus	否	~$0.015/页
GPT-4o	GPT-4o, GPT-4o-mini	否	~$0.01/页
Ollama	任何本地模型	是	免费
自定义	任何兼容 OpenAI 的 API	可配置	不定

基准测试

在 opendataloader-bench 上进行了测试——涵盖财务报告、法律文件、学术论文和扫描文档的 200 个真实世界 PDF。

引擎	综合得分	阅读顺序	表格 (TEDS)	标题	要求
opendataloader hybrid	0.909	0.935	0.928	0.828	API 调用 ($)
pdfmux	0.905	0.920	0.911	0.852	仅 CPU, $0
docling	0.877	0.900	0.887	0.802	~500MB 模型
marker	0.861	0.890	0.808	0.796	推荐 GPU
opendataloader local	0.844	0.913	0.494	0.761	仅 CPU
mineru	0.831	0.857	0.873	0.743	GPU + ~2GB 模型

综合排名第 2，免费工具中排名第 1。以零成本实现了付费第一名 99.5% 的分数。在所有测试引擎中标题检测效果最好。图像表格 OCR 可以提取作为图像嵌入的表格。

置信度评分

每个结果都包含一个 4 信号置信度评分：

95-100% -- 干净的数字文本，完全可提取
80-95% -- 良好的提取，部分页面有轻微 OCR 噪声
50-80% -- 部分提取，部分页面无法恢复
<50% -- 丢失大量内容，包含警告

当置信度低于 80% 时，pdfmux 会准确告诉你哪里出了问题以及如何修复：

Page 4: 32% confidence. 0 chars extracted from image-heavy page.
  -> Install pdfmux[ocr] for RapidOCR support on 6 image-heavy pages.

成本模式

模式	行为	典型成本
economy	仅基于规则的后端。无 LLM 调用。	$0/页
balanced	仅对基于规则提取失败的页面使用 LLM。	平均 ~$0.002/页
premium	每一页都使用 LLM 以获得最高质量。	~$0.01/页

设置硬预算上限：--budget 0.50 会在每个文档花费达到 $0.50 时停止 LLM 调用。

为什么选择 pdfmux？

pdfmux 不是另一个 PDF 提取器。它是编排层，负责为每一页选择正确的提取器，验证结果，并重试失败项。

工具	擅长	局限性
PyMuPDF	快速数字文本	无法处理扫描件或图像布局
Docling	表格 (97.9% 准确率)	非表格文档速度慢
Marker	GPU ML 提取	需要 GPU，对数字 PDF 来说是大材小用
Unstructured	企业级平台	设置复杂，有付费层级
LlamaParse	云原生	需要 API 密钥，非本地
Reducto	高准确率	$0.015/页，闭源
pdfmux	编排上述所有工具	按页路由、审计、重新提取

开源的 Reducto 替代方案：在其他地方花费 $0.015/页的功能，使用 pdfmux 的基于规则后端是免费的，或者使用 BYOK LLM 回退平均仅需 ~$0.002/页。

开发

git clone https://github.com/NameetP/pdfmux.git
cd pdfmux
python3.12 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

pytest              # 151 tests
ruff check src/ tests/
ruff format src/ tests/

贡献

Fork 本仓库
创建分支 (git checkout -b feature/your-feature)
为新功能编写测试
确保 pytest 和 ruff check 通过
提交 PR

许可证

MIT

pdfmux

pdfmux

安装

快速入门

CLI

Python

架构

关键设计决策

功能特性

CLI 参考

`pdfmux convert`

`pdfmux serve`

`pdfmux doctor`

`pdfmux benchmark`

Python API

文本提取

结构化提取

RAG 分块

模式引导提取

流式传输（内存受限）

类型与错误

框架集成

LangChain

LlamaIndex

MCP 服务器 (AI 智能体)

BYOK LLM 配置

基准测试

置信度评分

成本模式

为什么选择 pdfmux？

开发

贡献

许可证

Resources

Tools

Latest Blog Posts

MCP directory API