document-converter-mcp
The document-converter-mcp server is a local-first MCP server for converting documents between Markdown, PDF, DOCX, and HTML formats using Pandoc and MarkItDown engines.
Core Conversion Tools:
Markdown → PDF: Options for table of contents, page size (A4/Letter), themes, PDF engine (pdflatex, xelatex, lualatex, wkhtmltopdf, weasyprint, typst), CJK font support, and source sidecar preservation.
Markdown → DOCX: Options for table of contents and custom reference DOCX templates for styling.
Markdown → HTML: Options for standalone output, custom CSS, and self-contained single-file generation.
DOCX → Markdown: Uses Pandoc or MarkItDown, with image extraction, multiple Markdown flavors (GFM, CommonMark, Pandoc), and LLM-optimized output cleaning.
PDF → Markdown: Uses MarkItDown or Pandoc, with sidecar recovery to restore original Markdown from PDFs generated with
preserveSource=true.Batch Convert: Convert entire directories between formats (md/docx/pdf → md/docx/pdf/html) with recursive traversal, glob include/exclude filters, concurrency control, and dry-run mode.
Additional Capabilities:
Environment Diagnostics (
doctortool): Check availability of Node.js, Pandoc, Python, MarkItDown, and PDF engines.Workspace isolation: All file operations are confined to a configured directory, with path traversal prevention and sensitive file blocking.
No overwrite by default: Files are protected unless explicitly allowed.
cleanForLLMflag: Produces AI-friendly Markdown output across conversion tools.Configuration file (
.document-converter.json): Set workspace-level defaults overridable by tool arguments.Structured JSON results: Consistent output format across all tools, including quality reports for PDF-to-Markdown conversions.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@document-converter-mcpconvert my report.pdf to markdown"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
@lifeng688/document-converter-mcp
A local-first MCP server for converting documents between Markdown, PDF, DOCX, and HTML, with environment diagnostics and workspace-level configuration.
English: This project focuses on AI-friendly document conversion, not pixel-perfect layout reconstruction.
中文: 本项目重点是 AI 友好的文档转换,而不是像素级版式还原。
Features
7 conversion tools: Markdown <-> PDF, Markdown <-> DOCX, Markdown <-> HTML, PDF -> Markdown
doctortool: Diagnose local environment (Node.js, Pandoc, Python, MarkItDown, PDF engines)Configuration file:
.document-converter.jsonfor workspace-level defaultsDual engine support: Pandoc (primary) + MarkItDown (enhanced PDF/DOCX extraction)
Safe file access: Workspace-isolated path validation, sensitive file blocking, no-overwrite-by-default
Secure command execution: Spawn-based, no shell injection, structured errors with timeouts
AI-friendly output: Optional
cleanForLLMflag for cleaner MarkdownBatch processing: Convert entire directories with concurrency control, dry run, include/exclude filters
PDF style options: Margin, section numbering, syntax highlighting, metadata
HTML style options: Themes, embedded CSS, self-contained output, syntax highlighting
DOCX image extraction: Extract embedded images with metadata reporting
PDF sidecar recovery: Accurate Markdown restoration from PDFs generated with
preserveSource: trueStructured results: Consistent JSON response format across all tools
Related MCP server: Document Conversion Assistant
Supported Formats
Source | Targets |
Markdown ( | PDF, DOCX, HTML |
DOCX ( | Markdown |
PDF ( | Markdown |
Installation
Prerequisites
Node.js >= 18.0.0
Pandoc >= 3.0
Python 3 >= 3.8 (optional, for MarkItDown)
PDF Engine (required for Markdown -> PDF)
Pandoc can convert Markdown to PDF, but it requires an external PDF engine.
Engine | Install | Notes |
| MiKTeX (Windows), TeX Live (Linux/macOS) | Most common, ~2 GB install |
| TeX Live / MiKTeX | Recommended for Chinese/CJK documents |
| TeX Live / MiKTeX | Lua-based LaTeX engine |
|
| Lightweight HTML-to-PDF engine |
|
| Python-based HTML-to-PDF |
|
| Modern, fast typesetting system |
Chinese documents: Use
pdfEngine: "xelatex"with a TeX Live / MiKTeX installation that includes thectexpackage.
Windows:
cjkMainFont: "Microsoft YaHei"macOS:
cjkMainFont: "Songti SC"Linux:
cjkMainFont: "Noto Sans CJK SC"
Install Pandoc
macOS:
brew install pandocUbuntu/Debian:
sudo apt-get update && sudo apt-get install -y pandocWindows: Download from https://pandoc.org/installing.html
Verify:
pandoc --versionInstall MarkItDown (optional, recommended for PDF -> Markdown)
pip install markitdownVerify:
python3 -c "import markitdown; print('ok')"PDF support requires optional dependencies:
# For PDF extraction only: python -m pip install -U "markitdown[pdf]" # For DOCX extraction: python -m pip install -U "markitdown[docx]" # For all optional converters (PDF, EPUB, HTML, DOCX, etc.): python -m pip install -U "markitdown[all]"
markitdowninstalled does not guarantee PDF or DOCX support is available.
Install the Server
npm install -g @lifeng688/document-converter-mcpOr use directly via npx:
npx @lifeng688/document-converter-mcpFor development, clone the repo and build locally:
git clone https://github.com/guanweiqiang/document-convert-mcp.git
cd document-convert-mcp
npm install
npm run buildMCP Client Configuration
Install the package globally first:
npm install -g @lifeng688/document-converter-mcpClaude Desktop
Edit your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS, or %APPDATA%\Claude\claude_desktop_config.json on Windows):
{
"mcpServers": {
"document-converter": {
"command": "npx",
"args": ["-y", "@lifeng688/document-converter-mcp"],
"env": {
"DOC_CONVERTER_WORKSPACE": "E:/MCPWorkDir"
}
}
}
}Or if installed globally, use the local path:
{
"mcpServers": {
"document-converter": {
"command": "document-converter-mcp",
"env": {
"DOC_CONVERTER_WORKSPACE": "E:/MCPWorkDir"
}
}
}
}Sample configs are in examples/:
mcp.json-- MCP Inspector configclaude-desktop-config.json-- Claude Desktop config
Configuration File
Place .document-converter.json in your workspace root to set defaults for all tools.
Example:
{
"defaults": {
"pdfEngine": "xelatex",
"cjkMainFont": "Microsoft YaHei",
"pageSize": "A4",
"theme": "github",
"cleanForLLM": true,
"overwrite": false
},
"batch": {
"maxConcurrency": 2,
"continueOnError": true
},
"security": {
"maxFileSizeMB": 50
}
}Precedence:
tool args > .document-converter.json > built-in defaultsNotes:
The config file is read from the workspace root only (not nested directories).
Config values cannot bypass
pathGuard-- paths must still be within the workspace.overwritedefaults tofalsein the config for safety; do not set it totrueunless intentional.For Chinese/CJK PDF generation, recommended config:
Windows:
"pdfEngine": "xelatex", "cjkMainFont": "Microsoft YaHei"macOS:
"pdfEngine": "xelatex", "cjkMainFont": "Songti SC"Linux:
"pdfEngine": "xelatex", "cjkMainFont": "Noto Sans CJK SC"
Tools
1. doctor
Check the local environment for document-converter-mcp dependencies.
This tool never fails due to missing dependencies -- missing tools appear as false in the output with warnings.
Checks:
Node.js version
Workspace path, existence, writability
Pandoc availability and version
Python availability
MarkItDown availability and PDF support
PDF engines:
pdflatex,xelatex,lualatex,wkhtmltopdf,weasyprint,typstRecommendations for missing dependencies
Example output:
{
"success": true,
"summary": "Environment check completed.",
"data": {
"node": { "available": true, "version": "v22.18.0" },
"workspace": { "path": "E:/MCPWorkDir", "exists": true, "writable": true },
"pandoc": { "available": true, "version": "pandoc 3.8.2" },
"python": { "available": true, "command": "python" },
"markitdown": { "available": true, "pdfSupport": true },
"pdfEngines": {
"pdflatex": true,
"xelatex": true,
"lualatex": true,
"wkhtmltopdf": false,
"weasyprint": false,
"typst": false
},
"recommendations": []
},
"warnings": [],
"error": null
}2. markdown_to_pdf
Convert Markdown to PDF using Pandoc.
Note: Pandoc requires an external PDF engine (LaTeX distribution or alternative) to generate PDFs.
中文文档:
pdflatex不支持中文 Unicode 字符。中文 Markdown 转 PDF 请使用pdfEngine: "xelatex"(推荐)并设置cjkMainFont。
Parameter | Type | Required | Default | Description |
| string | Yes | -- | Input Markdown file path (relative to workspace) |
| string | No | Auto-derived | Output PDF path |
| string | No | -- | PDF document title |
| boolean | No | false | Include table of contents |
| enum | No | A4 | Page size: |
| enum | No | default | Theme: |
| enum | No | Pandoc default | PDF engine: |
| string | No | -- | CJK main font for Chinese/Japanese/Korean documents (e.g. |
| boolean | No | false | Save original Markdown as sidecar files ( |
| boolean | No | false | Reject input if Markdown has structural issues like unclosed code blocks |
| boolean | No | false | Allow overwriting existing files |
| string | No | -- | Page margin in safe format (e.g. |
| boolean | No | false | Number section headings in the PDF |
| string | No | -- | Code highlight theme: |
| object | No | -- | Additional metadata key-value pairs |
Sidecar files (when preserveSource=true):
document.pdf.source.md-- Original Markdown contentdocument.pdf.meta.json-- Conversion metadata
3. markdown_to_docx
Convert Markdown to DOCX using Pandoc.
Parameter | Type | Required | Default | Description |
| string | Yes | -- | Input Markdown file path |
| string | No | Auto-derived | Output DOCX path |
| string | No | -- | Word template file |
| boolean | No | false | Include table of contents |
| boolean | No | false | Reject input if Markdown has structural issues |
| boolean | No | false | Allow overwriting existing files |
4. docx_to_markdown
Convert DOCX to Markdown using Pandoc or MarkItDown.
Parameter | Type | Required | Default | Description |
| string | Yes | -- | Input DOCX file path |
| string | No | Auto-derived | Output Markdown path |
| boolean | No | false | Extract embedded images from the DOCX |
| string | No | Auto-derived | Directory for extracted images (must be within workspace). If omitted, defaults to |
| enum | No | pandoc | Engine: |
| enum | No | gfm | Markdown dialect: |
| boolean | No | false | Clean Markdown for AI consumption |
| boolean | No | false | Allow overwriting existing files |
Image extraction:
When extractImages=true, the response includes:
{
"imageCount": 2,
"imageDir": "out/document_media",
"images": [
{
"filename": "media/image1.png",
"sizeBytes": 12345
}
]
}Even if no images are found:
{
"imageCount": 0,
"imageDir": "out/document_media",
"images": []
}Supported image extensions: .png, .jpg, .jpeg, .gif, .webp, .svg, .bmp, .tif, .tiff.
Path safety: imageDir is validated against path traversal. Values like ../outside-media will be rejected with an error containing "Access denied" and "workspace".
5. pdf_to_markdown
Extract text from PDF to Markdown.
Warning: This is content extraction, not layout reconstruction. Scanned PDFs, complex tables, two-column papers, and mathematical formulas may not convert reliably. For scanned PDFs, OCR is required (not included).
PDF 转 Markdown 是内容提取,不是版式或语义结构还原。
普通 PDF 通常不保存 Markdown 语义。标题、表格、代码块、列表、阅读顺序都可能无法可靠恢复。
Parameter | Type | Required | Default | Description |
| string | Yes | -- | Input PDF file path |
| string | No | Auto-derived | Output Markdown path |
| enum | No | markitdown | Engine: |
| boolean | No | false | Clean Markdown for AI consumption |
| boolean | No | true | First check for a |
| boolean | No | false | Allow overwriting existing files |
Sidecar recovery:
If the PDF was generated by this server with preserveSource: true, the original Markdown is available as sidecar files (document.pdf.source.md, document.pdf.meta.json). The default preferSourceSidecar: true will automatically find and return it.
Quality report:
Sidecar recovery mode:
{
"quality": {
"mode": "source-sidecar",
"layoutPreserved": true,
"headingsReliable": true,
"tablesReliable": true,
"codeBlocksReliable": true,
"readingOrderReliable": true
}
}Plain text extraction mode:
{
"quality": {
"mode": "text-extraction",
"layoutPreserved": false,
"headingsReliable": false,
"tablesReliable": false,
"codeBlocksReliable": false,
"readingOrderReliable": false
}
}6. markdown_to_html
Convert Markdown to HTML using Pandoc.
Parameter | Type | Required | Default | Description |
| string | Yes | -- | Input Markdown file path |
| string | No | Auto-derived | Output HTML path |
| string | No | -- | External CSS file path (validated via workspace pathGuard) |
| boolean | No | true | Generate complete HTML document with head/body |
| boolean | No | false | Reject input if Markdown has structural issues |
| boolean | No | false | Allow overwriting existing files |
| string | No | -- | Pandoc HTML theme: |
| boolean | No | false | Embed CSS and resources into the HTML document |
| boolean | No | false | Generate a self-contained single-file HTML |
| string | No | -- | Code highlight theme: |
theme=githubis ideal for README-style documentation.embedCss=trueembeds CSS directly into the HTML.selfContained=trueproduces a single HTML file with all resources inline.
7. batch_convert
Convert all matching files in a directory from one format to another.
Parameter | Type | Required | Default | Description |
| string | Yes | -- | Source directory (relative to workspace) |
| string | Yes | -- | Destination directory (relative to workspace) |
| enum | Yes | -- | Source format: |
| enum | Yes | -- | Target format: |
| boolean | No | false | Traverse subdirectories |
| boolean | No | false | Overwrite existing files |
| boolean | No | false | Clean Markdown output for LLM consumption |
| boolean | No | false | Generate a conversion plan without writing files |
| string[] | No | -- | Only convert files matching these glob patterns (e.g. |
| string[] | No | -- | Skip files matching these glob patterns (e.g. |
| number | No | 1 | Max concurrent conversions (1-8). Useful for low-memory machines. |
| boolean | No | true | Continue processing other files when one fails |
Dry run example:
{
"inputDir": "docs/source",
"outputDir": "docs/published",
"from": "md",
"to": "pdf",
"dryRun": true
}Returns a plan with plannedCount but does not write any files.
Return structure:
{
"success": true,
"summary": "Batch conversion completed: 4 succeeded, 0 failed, 0 skipped.",
"total": 4,
"plannedCount": 4,
"skippedCount": 0,
"successCount": 4,
"failedCount": 0,
"durationMs": 1201,
"results": [...]
}Usage Examples
Run doctor
Tool: doctor
Args: {}Create .document-converter.json
{
"defaults": {
"pdfEngine": "xelatex",
"cjkMainFont": "Microsoft YaHei",
"overwrite": false
}
}Markdown to Chinese PDF using config
With .document-converter.json setting pdfEngine: "xelatex" and cjkMainFont: "Microsoft YaHei":
Tool: markdown_to_pdf
Args: {
"inputPath": "docs/chinese-report.md",
"title": "季度报告",
"toc": true,
"pageSize": "A4",
"preserveSource": true,
"overwrite": true
}Markdown to PDF with preserveSource
Tool: markdown_to_pdf
Args: {
"inputPath": "docs/report.md",
"outputPath": "docs/report.pdf",
"preserveSource": true,
"overwrite": true
}Generates docs/report.pdf.source.md and docs/report.pdf.meta.json for accurate recovery.
PDF to Markdown using source sidecar
Tool: pdf_to_markdown
Args: {
"inputPath": "docs/report.pdf",
"preferSourceSidecar": true
}Automatically finds and returns the original Markdown from the sidecar file.
Markdown to HTML with GitHub theme
Tool: markdown_to_html
Args: {
"inputPath": "docs/readme.md",
"theme": "github",
"standalone": true,
"selfContained": true
}Batch convert with dry run
Tool: batch_convert
Args: {
"inputDir": "docs/articles",
"outputDir": "docs/html",
"from": "md",
"to": "html",
"dryRun": true
}Batch convert with include/exclude
Tool: batch_convert
Args: {
"inputDir": "docs/articles",
"outputDir": "docs/published",
"from": "md",
"to": "pdf",
"recursive": true,
"include": ["report-*.md"],
"exclude": ["draft-*", "internal-*"],
"maxConcurrency": 2,
"continueOnError": true,
"overwrite": true
}DOCX to Markdown with image extraction
Tool: docx_to_markdown
Args: {
"inputPath": "docs/presentation.docx",
"extractImages": true,
"imageDir": "docs/presentation_media",
"overwrite": true
}Returns imageCount, imageDir, and images array in the response.
Security
This server implements strict security measures:
Workspace isolation: All file access is confined to a configured workspace directory (
DOC_CONVERTER_WORKSPACEenv var)Path traversal prevention:
..sequences and absolute path escapes are blockedSensitive file blocking:
.env,.ssh/,.npmrc, etc. are never accessibleFile size limits: Input files over 50 MB are rejected by default (configurable via config file)
No shell injection: All commands use
spawn()with argument arraysNo overwrite by default: Existing files are protected unless explicitly allowed
Config file cannot bypass pathGuard: Configuration defaults respect the same path safety rules as tool arguments
See docs/security.md for full details.
Recommended Workflows
Good
Markdown -> PDF -- High-quality PDF output with Pandoc
Markdown -> DOCX -- High-quality Word output
Markdown -> HTML -- High-quality HTML output
DOCX -> Markdown -- Good text extraction with image metadata
PDF -> Markdown -- For text extraction only. Use
preferSourceSidecar: truefor PDFs generated by this server.
Not recommended
Markdown -> PDF -> Markdown for structure recovery
PDFs do not preserve Markdown semantics (headings, tables, code blocks, lists, reading order)
The round-trip will lose structural information
Use
preserveSource: trueinstead when generating the PDF
推荐工作流
推荐
Markdown -> PDF -- 高质量的 PDF 输出
Markdown -> DOCX -- 高质量的 Word 输出
Markdown -> HTML -- 高质量的 HTML 输出
DOCX -> Markdown -- 良好的文本提取和图片元数据
PDF -> Markdown -- 仅用于内容提取。对本服务生成的 PDF 请使用
preferSourceSidecar: true精确恢复。
不推荐
Markdown -> PDF -> Markdown 用于结构恢复
PDF 不保存 Markdown 语义(标题、表格、代码块、列表、阅读顺序)
往返转换将丢失结构信息
生成 PDF 时请使用
preserveSource: true
Conversion Quality
This project focuses on AI-friendly document conversion, not pixel-perfect layout reconstruction.
See docs/conversion-quality.md for format-specific quality notes and engine comparisons.
Development
# Install dependencies
npm install
# Build TypeScript
npm run build
# Run in development mode (hot reload)
npm run dev
# Type check without emitting
npm run typecheckLicense
MIT
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
- Your AI Chatbot Just Exposed Your CEO's Salary to an InternBy Om-Shree-0709 on .Agent IdentityMCP SecurityOAuth Delegation
- Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)By Om-Shree-0709 on .Agentic AiPrompt InjectionWebAssembly
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/guanweiqiang/document-converter-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server