@shuji-bonji/pdf-reader-mcp
PDF Reader MCP Server
English | 日本語
An MCP (Model Context Protocol) server specialized in deciphering PDF internal structures.
While typical PDF MCP servers are thin wrappers for text extraction, this project focuses on reading and analyzing the internal structure of PDF documents. Pair it with pdf-spec-mcp for specification-aware structural analysis and validation.
Features
16 tools organized into three tiers:
Tier 1: Basic Operations
Tool | Description |
| Lightweight page count retrieval |
| Full metadata extraction (title, author, PDF version...) |
| Text extraction with Y-coordinate reading order (opt-in |
| Full-text search with surrounding context |
| Image extraction as base64 with metadata |
| Fetch and process remote PDFs from URLs |
| Quick overview report (metadata + text + image count) |
Tier 2: Structure Inspection
Tool | Description |
| Object tree and catalog dictionary analysis |
| Tagged PDF structure tree visualization |
| Font inventory (embedded/subset/type detection) |
| Annotation listing (categorized by subtype) |
| Digital signature field structure analysis |
| Tagged PDF |
Tier 3: Validation & Analysis
Tool | Description |
| PDF/UA tag structure validation (8 checks) |
| Metadata conformance checking (10 checks) |
| Structural diff between two PDFs (properties + fonts) |
Installation
npx (recommended)
npx @shuji-bonji/pdf-reader-mcpClaude Desktop
Add to your claude_desktop_config.json:
{
"mcpServers": {
"pdf-reader-mcp": {
"command": "npx",
"args": ["-y", "@shuji-bonji/pdf-reader-mcp"]
}
}
}Claude Code
claude mcp add pdf-reader-mcp -- npx -y @shuji-bonji/pdf-reader-mcpFrom Source
git clone https://github.com/shuji-bonji/pdf-reader-mcp.git
cd pdf-reader-mcp
npm install
npm run buildUsage Examples
Get Page Count
get_page_count({ file_path: "/path/to/document.pdf" })
→ 42Search Text
search_text({
file_path: "/path/to/spec.pdf",
query: "digital signature",
pages: "1-20",
max_results: 10
})
→ Found 5 matches (page 3, 7, 12, 15, 18)Summarize
summarize({ file_path: "/path/to/document.pdf" })
→ | Pages | 42 |
| PDF Version | 2.0 |
| Tagged | Yes |
| Signatures | No |
| Images | 15 |Validate Tagged Structure (PDF/UA)
validate_tagged({ file_path: "/path/to/document.pdf" })
→ ✅ [TAG-001] Document is marked as tagged
✅ [TAG-002] Structure tree root exists
⚠️ [TAG-004] Heading hierarchy has gaps: H1, H3
❌ [TAG-005] Document has 3 image(s) but no Figure tagsValidate Metadata
validate_metadata({ file_path: "/path/to/document.pdf" })
→ ✅ [META-001] Title: "Annual Report 2025"
⚠️ [META-002] Author is missing
✅ [META-006] PDF version: 2.0Compare Structure
compare_structure({
file_path_1: "/path/to/v1.pdf",
file_path_2: "/path/to/v2.pdf"
})
→ | Page Count | 10 | 12 | ❌ |
| PDF Version | 1.7 | 2.0 | ❌ |
| Tagged | true | true | ✅ |Extract Tables (Tagged PDF)
extract_tables({ file_path: "/path/to/kaisei-tsutatsu.pdf", pages: "1" })
→ # Extracted Tables
- **Tagged**: Yes / **Pages Scanned**: 1 / **Tables Found**: 1
## Page 1 — Table 1
| 改正後 | 改正前 |
| --- | --- |
| …第2条第 16 項《定義》… | …第2条第 15 項《定義》… |Untagged PDFs return an empty result with a note recommending the
column-aware fallback below.
Read Untagged Multi-Column PDF
read_text({ file_path: "/path/to/older-shinkyu.pdf", split_columns: 2 })
→ // Plain Y-sort would interleave columns:
// "改正後セル1 改正前セル1\n 改正後セル2 改正前セル2..."
//
// With split_columns: 2 the left column is emitted first, then the right:
// "改正後セル1\n改正後セル2\n…\n\n改正前セル1\n改正前セル2\n…"Use split_columns: 2 | 3 for untagged multi-column PDFs. For Tagged
PDFs with proper <Table> markup, extract_tables (above) is preferred.
Compact Whitespace (Japanese Forms)
read_text({ file_path: "/path/to/form.pdf", compact_whitespace: true })
→ // Original PDF uses U+3000 fullwidth space as visual indentation:
// " ( ) 自 年 月 日 法 有 ( 年 月 日) 有 有"
//
// With compact_whitespace: true:
// "( ) 自 年 月 日 法 有 ( 年 月 日) 有 有"
//
// Empirically reduces character count by ~40% on form PDFs.compact_whitespace is orthogonal to split_columns — both can be combined.
Tech Stack
TypeScript + MCP TypeScript SDK
pdfjs-dist (Mozilla) — text/image extraction, tag tree, annotations
pdf-lib — low-level object structure analysis
Vitest — unit + E2E testing (171 tests)
Biome — linting + formatting
Zod — input validation
Testing
npm test # Run all tests (unit: 39 tests)
npm run test:e2e # E2E tests only (132 tests)
npm run test:watch # Watch modeArchitecture
pdf-reader-mcp/
├── src/
│ ├── index.ts # MCP Server entry point
│ ├── constants.ts # Shared constants
│ ├── types.ts # Type definitions
│ ├── tools/
│ │ ├── tier1/ # Basic tools (7)
│ │ ├── tier2/ # Structure inspection (6)
│ │ ├── tier3/ # Validation & analysis (3)
│ │ └── index.ts # Tool registration
│ ├── services/
│ │ ├── pdfjs-service.ts # pdfjs-dist wrapper (parallel page processing)
│ │ ├── pdflib-service.ts # pdf-lib wrapper
│ │ ├── validation-service.ts # Validation & comparison logic
│ │ └── url-fetcher.ts # URL fetching
│ ├── schemas/ # Zod validation schemas
│ └── utils/
│ ├── pdf-helpers.ts # PDF utilities (page range parsing, file I/O)
│ ├── batch-processor.ts # Batch processing for large PDFs
│ ├── formatter.ts # Output formatting
│ └── error-handler.ts # Error handling
└── tests/
├── tier1/ # Unit tests
└── e2e/ # E2E tests (9 suites, 132 tests)Error Contract (houki-hub family)
Since v0.6.0, this MCP returns structured errors that follow the houki-hub family error contract, sharing a unified code vocabulary across the family. Combined with houki-egov-mcp / houki-nta-mcp, an LLM or Skill layer can interpret errors with consistent logic.
docs/ERROR-CODES.md— error code vocabulary (houki-research-skill)docs/ERROR-HANDLING.md— handling policy / next_actions templates
Implementation is independent — no dependency on houki-abbreviations or other family packages. The reference implementation is houki-egov-mcp/src/errors.ts; pdf-reader-mcp's local definition is in src/errors.ts.
On error, every tool returns isError: true and the JSON-stringified LawServiceError in content[0].text:
{
"error": "The file does not appear to be a valid PDF.",
"code": "INVALID_PDF",
"hint": "ファイルが破損していないか確認してください。",
"next_actions": [
{
"action": "inspect_structure",
"reason": "PDF が壊れている可能性があります。Catalog / Pages 等の構造を確認してください"
}
],
"detail": { "cause": "Invalid PDF structure" }
}Codes used by pdf-reader-mcp
code | 用途 |
| パス・URL・ページ範囲などクライアント側引数の不正 |
| ファイル未存在 (ENOENT) |
| PDF として不正・破損 |
| 暗号化 PDF (現状未対応) |
| サポート外の PDF 機能 |
| 50MB 上限超過 (pdf-reader 固有) |
| URL fetch の HTTP エラー (4xx/5xx) |
| リモート取得タイムアウト |
| DNS / 接続失敗 |
| パーミッション拒否を含むその他バグ |
Migration note (v0.5.x → v0.6.0)
旧 v0.5.x までは content[0].text に Error: ...\n\nSuggestion: ... という人間可読文字列を入れていました。v0.6.0 では同じ場所に JSON 文字列 が入ります。LLM 側でテキスト解釈に依存していた場合は、JSON.parse(content[0].text) での解釈に切り替えてください。isError: true フラグで構造化エラーかどうかを判定できます。
Pairing with pdf-spec-mcp
pdf-spec-mcp provides PDF specification knowledge (ISO 32000-2, etc.). With both servers enabled, an LLM can perform specification-aware workflows:
summarize— get a PDF overviewinspect_tags— examine the tag structurepdf-spec-mcp
get_requirements— fetch PDF/UA requirementsvalidate_tagged— check conformancecompare_structure— diff before/after fixes
License
MIT
Maintenance
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/shuji-bonji/pdf-reader-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server