kordoc
Parses and converts complex Korean document formats including HWP, HWPX, and PDF into structured Markdown, preserving tables, merged cells, and nested structures.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@kordocparse the tables in document.hwpx into markdown"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
kordoc
모두 파싱해버리겠다 — Parse any Korean document to Markdown.
HWP, HWPX, PDF — 대한민국 문서라면 남김없이 파싱해버립니다.

Why kordoc?
South Korea's government runs on HWP — a proprietary word processor the rest of the world has never heard of. Every day, 243 local governments and thousands of public institutions produce mountains of .hwp files. Extracting text from them has always been a nightmare: COM automation that only works on Windows, proprietary binary formats with zero documentation, and tables that break every existing parser.
kordoc was born from that document hell. Built by a Korean civil servant who spent 7 years buried under HWP files at a district office. One day he snapped — and decided to parse them all. Its parsers have been battle-tested across 5 real government projects, processing school curriculum plans, facility inspection reports, legal annexes, and municipal newsletters. If a Korean public servant wrote it, kordoc can parse it.
Features
HWP 5.x Binary Parsing — OLE2 container + record stream + UTF-16LE. No Hancom Office needed.
HWPX ZIP Parsing — OPF manifest resolution, multi-section, nested tables.
PDF Text Extraction — Y-coordinate line grouping, table reconstruction, image PDF detection.
2-Pass Table Builder — Correct
colSpan/rowSpanvia grid algorithm. No broken tables.Broken ZIP Recovery — Corrupted HWPX? Scans raw Local File Headers.
3 Interfaces — npm library, CLI tool, and MCP server (Claude/Cursor).
Cross-Platform — Pure JavaScript. Runs on Linux, macOS, Windows.
Supported Formats
Format | Engine | Features |
HWPX (한컴 2020+) | ZIP + XML DOM | Manifest, nested tables, merged cells, broken ZIP recovery |
HWP 5.x (한컴 레거시) | OLE2 + CFB | 21 control chars, zlib decompression, DRM detection |
pdfjs-dist | Line grouping, table detection, image PDF warning |
Installation
npm install kordoc
# PDF support requires pdfjs-dist (optional peer dependency)
npm install pdfjs-dist
pdfjs-distis an optional peer dependency. Not needed for HWP/HWPX parsing.
Usage
As a Library
import { parse } from "kordoc"
import { readFileSync } from "fs"
const buffer = readFileSync("document.hwpx")
const result = await parse(buffer.buffer)
if (result.success) {
console.log(result.markdown)
}Format-Specific
import { parseHwpx, parseHwp, parsePdf } from "kordoc"
const hwpxResult = await parseHwpx(buffer) // HWPX
const hwpResult = await parseHwp(buffer) // HWP 5.x
const pdfResult = await parsePdf(buffer) // PDFFormat Detection
import { detectFormat } from "kordoc"
detectFormat(buffer) // → "hwpx" | "hwp" | "pdf" | "unknown"As a CLI
npx kordoc document.hwpx # stdout
npx kordoc document.hwp -o output.md # save to file
npx kordoc *.pdf -d ./converted/ # batch convert
npx kordoc report.hwpx --format json # JSON with metadataAs an MCP Server
Works with Claude Desktop, Cursor, Windsurf, and any MCP-compatible client.
{
"mcpServers": {
"kordoc": {
"command": "npx",
"args": ["-y", "kordoc-mcp"]
}
}
}Tools exposed:
Tool | Description |
| Parse HWP/HWPX/PDF file → Markdown |
| Detect file format via magic bytes |
API Reference
parse(buffer: ArrayBuffer): Promise<ParseResult>
Auto-detects format and converts to Markdown.
interface ParseResult {
success: boolean
markdown?: string
fileType: "hwpx" | "hwp" | "pdf" | "unknown"
isImageBased?: boolean // scanned PDF detection
pageCount?: number // PDF only
error?: string
}Types
import type { ParseResult, ParseSuccess, ParseFailure, FileType } from "kordoc"Internal types (
IRBlock,IRTable,IRCell,CellContext) and utilities (KordocError,sanitizeError,isPathTraversal,buildTable,blocksToMarkdown) are not part of the public API.
Requirements
Node.js >= 18
pdfjs-dist >= 4.0.0 — Optional. Only needed for PDF. HWP/HWPX work without it.
Security
Production-grade security hardening:
ZIP bomb protection — Entry count validation, 100MB decompression limit, 500 entry cap
Known limitation: Pre-check reads declared sizes from ZIP Central Directory, which an attacker can falsify. The primary defense is per-file cumulative size tracking during actual decompression. For fully untrusted input where streaming decompression is required, consider wrapping kordoc behind a size-limited sandbox.
XXE/Billion Laughs prevention — Internal DTD subsets fully stripped from HWPX XML
Decompression bomb guard —
maxOutputLengthon HWP5 zlib streams, cumulative 100MB limit across sectionsPDF resource limits — MAX_PAGES=5,000, cumulative text size 100MB cap,
doc.destroy()cleanupHWP5 record cap — Max 500,000 records per section, prevents memory exhaustion from crafted files
Table dimension clamping — rows/cols read from HWP5 binary clamped to MAX_ROWS/MAX_COLS before allocation
colSpan/rowSpan clamping — Crafted merge values clamped to grid bounds (MAX_COLS=200, MAX_ROWS=10,000)
Path traversal guard — Backslash normalization,
.., absolute paths, Windows drive letters all rejectedMCP error sanitization — Allowlist-based error filtering, unknown errors return generic message
MCP path restriction — Only
.hwp,.hwpx,.pdfextensions allowed, symlink resolutionFile size limit — 500MB max in MCP server and CLI
HWP5 section limit — Max 100 sections in both primary and fallback paths
HWP5 control char fix — Character code 10 (footnote/endnote) now correctly handled
How It Works
┌─────────────┐ Magic Bytes ┌──────────────────┐
│ File Input │ ──── Detection ────→ │ Format Router │
└─────────────┘ └────────┬─────────┘
│
┌──────────────────────────┼──────────────────────────┐
│ │ │
┌─────▼─────┐ ┌───────▼───────┐ ┌──────▼──────┐
│ HWPX │ │ HWP 5.x │ │ PDF │
│ ZIP+XML │ │ OLE2+Record │ │ pdfjs-dist │
└─────┬─────┘ └───────┬───────┘ └──────┬──────┘
│ │ │
│ ┌──────────────────┤ │
│ │ �� │
┌─────▼───────▼─────┐ │ │
│ 2-Pass Table │ │ │
│ Builder (Grid) │ │ │
└─────────┬─────────┘ │ │
│ │ │
┌─────▼──────────────────────▼──────────────────────────▼─────┐
│ IRBlock[] │
│ (Intermediate Representation) │
└────────────────────────┬───────────────────────────────────┘
│
┌──────▼──────┐
│ Markdown │
│ Output │
└─────────────┘Credits
Production-tested across 5 Korean government technology projects:
School curriculum plans (학교교육과정)
Facility inspection reports (사전기획 보고서)
Legal document annexes (법률 별표)
Municipal newsletters (소식지)
Public data extraction tools (공공데이터)
Thousands of real government documents parsed without breaking a sweat.
License
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/chrisryugj/kordoc'
If you have feedback or need assistance with the MCP directory API, please join our Discord server