Extract Tables (Tagged PDF)
extract_tablesExtract tables from tagged PDFs: walks structure tree to pull cell text, outputting as Markdown or JSON for structured data.
Instructions
Extract every <Table> subtree from a Tagged PDF as a structured row/cell list,
optionally rendered as Markdown tables.
How it works: walks the StructTree and pulls cell text for each <TR> →
<TH>/<TD>, then collapses kerning whitespace (e.g. "消 費 税 法" → "消費税法").
This sidesteps reading-order extraction's failure mode on multi-column tables
(typical of 新旧対照表 PDFs).
Args:
file_path (string): Absolute path to a local PDF file
pages (string, optional): Page range. Format: "1-5", "3", or "1,3,5-7". Omit for all pages.
response_format ('markdown' | 'json'): Output format (default: 'markdown')
Returns:
Markdown — # Extracted Tables summary block followed by one
## Page N — Table M section per table with a GFM table.
JSON — { isTagged, tables: [{ page, index, headerRows, bodyRows, footerRows }],
totalTables, pagesScanned, note? }.
Limitations:
Untagged PDFs return an empty result and a
note.colspan/rowspan are not honoured (cells are listed in source order).
Nested tables are skipped to keep page indices stable.
Examples:
Pull 新旧対照表 from a kaisei tsutatsu PDF for diffing
Convert 帳票 (form template) tables into structured data
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| file_path | Yes | Absolute path to a local PDF file (e.g., "/path/to/document.pdf") | |
| pages | No | Page range to process. Format: "1-5", "3", or "1,3,5-7". Omit for all pages. | |
| response_format | No | Output format: "markdown" for human-readable, "json" for structured data | markdown |
Implementation Reference
- src/services/pdfjs-service.ts:655-665 (handler)Public `extractTables` function - loads the PDF document and delegates to `extractTablesFromDoc`. This is the entry point called by the tool handler.
export async function extractTables( filePath: string, pages?: string, ): Promise<TablesExtractionResult> { const doc = await loadDocument(filePath); try { return await extractTablesFromDoc(doc, pages); } finally { await doc.destroy(); } } - src/services/pdfjs-service.ts:604-653 (handler)Core `extractTablesFromDoc` - checks if document is tagged, resolves page numbers, walks StructTree per page, builds id-to-text map, and collects all Table subtrees into structured ExtractedTable objects.
export async function extractTablesFromDoc( doc: PDFDocumentProxy, pages?: string, ): Promise<TablesExtractionResult> { const markInfo = await getMarkInfo(doc); const isTagged = markInfo?.Marked === true; if (!isTagged) { return { isTagged: false, tables: [], totalTables: 0, pagesScanned: 0, note: 'Document is not tagged. extract_tables relies on /MarkInfo /Marked true ' + 'and a StructTree. For untagged two-column PDFs, fall back to a ' + 'column-aware reading strategy (see pdf-reader-mcp Issue #3).', }; } const pageNumbers = resolvePageNumbers(pages, doc.numPages); const perPage = await Promise.all( pageNumbers.map(async (pageNum) => { const page = await doc.getPage(pageNum); try { const [tree, textContent] = await Promise.all([ page.getStructTree(), page.getTextContent({ includeMarkedContent: true }), ]); if (!tree) return [] as ExtractedTable[]; const idToText = buildIdToTextMap(textContent.items); const tables: ExtractedTable[] = []; collectTables(tree as unknown as StructNode, pageNum, idToText, tables); return tables; } catch { return [] as ExtractedTable[]; } }), ); const tables = perPage.flat(); return { isTagged: true, tables, totalTables: tables.length, pagesScanned: pageNumbers.length, }; } - src/services/pdfjs-service.ts:667-834 (handler)Internal helper functions (StructNode interface, buildIdToTextMap, collectTables, appendTableRowsFromSection, buildRowFromTR, collectTextUnder, compactCellText) that implement the StructTree walking and cell text extraction logic.
// ─── extract_tables internals ────────────────────────────────────────────── /** A node from `page.getStructTree()`. Has either `role`+`children` or `type === 'content'`+`id`. */ interface StructNode { role?: string; children?: StructNode[]; type?: 'content'; id?: string; } /** A `getTextContent({ includeMarkedContent: true })` item. */ interface TextContentItemLike { type?: string; id?: string | null; tag?: string; str?: string; hasEOL?: boolean; } /** * Build a map from a marked-content `id` (e.g. `p715R_mc4`) to the concatenated * raw text inside the corresponding `beginMarkedContentProps`/`endMarkedContent` * pair. Nested marked content is supported via a stack — text counts toward * every active id (so a `<Span>` inside a `<P>` contributes to both). * * Items with `tag === 'Artifact'` are page-level artifacts (page numbers, * running headers, etc.) outside the structure tree, and are skipped. */ function buildIdToTextMap(items: TextContentItemLike[]): Map<string, string> { const map = new Map<string, string[]>(); const stack: { id: string | null; isArtifact: boolean }[] = []; for (const item of items) { const t = item.type; if (t === 'beginMarkedContent' || t === 'beginMarkedContentProps') { const isArtifact = item.tag === 'Artifact'; const id = item.id ?? null; stack.push({ id, isArtifact }); continue; } if (t === 'endMarkedContent') { stack.pop(); continue; } if (t !== undefined) continue; // unknown marker // Text item if (stack.some((s) => s.isArtifact)) continue; const str = item.hasEOL ? ' ' : (item.str ?? ''); if (!str) continue; for (const frame of stack) { if (frame.id) { const buf = map.get(frame.id); if (buf) buf.push(str); else map.set(frame.id, [str]); } } } const out = new Map<string, string>(); for (const [id, parts] of map) out.set(id, parts.join('')); return out; } /** Walk the StructTree and append every `<Table>` subtree as an ExtractedTable. */ function collectTables( node: StructNode, pageNum: number, idToText: Map<string, string>, out: ExtractedTable[], ): void { if (node.type === 'content') return; if (node.role === 'Table') { const headerRows: TableRow[] = []; const bodyRows: TableRow[] = []; const footerRows: TableRow[] = []; for (const child of node.children ?? []) { if (child.type === 'content') continue; if (child.role === 'THead') { appendTableRowsFromSection(child, idToText, headerRows); } else if (child.role === 'TBody') { appendTableRowsFromSection(child, idToText, bodyRows); } else if (child.role === 'TFoot') { appendTableRowsFromSection(child, idToText, footerRows); } else if (child.role === 'TR') { // Tables sometimes omit THead/TBody and place TRs directly under <Table>. const row = buildRowFromTR(child, idToText); if (row) bodyRows.push(row); } } // `out` is a per-page accumulator passed in by the caller, so // `out.length + 1` is the next index within this page (1-based). out.push({ page: pageNum, index: out.length + 1, headerRows, bodyRows, footerRows, }); return; // Don't recurse into a Table — nested tables are uncommon and // would confuse the per-page index. (Add nested-table support later.) } for (const child of node.children ?? []) { collectTables(child, pageNum, idToText, out); } } function appendTableRowsFromSection( section: StructNode, idToText: Map<string, string>, out: TableRow[], ): void { for (const child of section.children ?? []) { if (child.type === 'content') continue; if (child.role === 'TR') { const row = buildRowFromTR(child, idToText); if (row) out.push(row); } } } function buildRowFromTR(tr: StructNode, idToText: Map<string, string>): TableRow | null { const cells: TableCell[] = []; for (const child of tr.children ?? []) { if (child.type === 'content') continue; if (child.role === 'TH' || child.role === 'TD') { const text = compactCellText(collectTextUnder(child, idToText)); cells.push({ text, isHeader: child.role === 'TH' }); } } return cells.length === 0 ? null : { cells }; } function collectTextUnder(node: StructNode, idToText: Map<string, string>): string { if (node.type === 'content') { return node.id ? (idToText.get(node.id) ?? '') : ''; } const parts: string[] = []; for (const child of node.children ?? []) { parts.push(collectTextUnder(child, idToText)); } return parts.join(' '); } /** * Normalise raw cell text: * 1. Collapse any whitespace run (`\s` + U+3000) to a single ASCII space. * 2. Fold per-character kerning runs between CJK characters * (e.g. "消 費 税 法" → "消費税法") — but only when at least three * single CJK chars are separated by single spaces in a row, so that * natural inter-word spacing like "事業者 法人番号" is preserved. * 3. Trim and Markdown-escape pipes / newlines. */ function compactCellText(s: string): string { if (!s) return ''; // Step 1: collapse whitespace runs (incl. U+3000) to one ASCII space. let t = s.replace(/[\s ]+/g, ' ').trim(); // Step 2: fold runs of `CJK + space` repeated at least twice followed by // a final CJK char. Anything shorter is treated as a real word boundary. const cjk = '[\\u3040-\\u30ff\\u3400-\\u9fff\\uff00-\\uffef]'; const kerningRun = new RegExp(`(?:${cjk} ){2,}${cjk}`, 'g'); t = t.replace(kerningRun, (m) => m.replace(/ /g, '')); // Step 3: escape Markdown table delimiters. return t.replace(/\|/g, '\\|').replace(/\n/g, ' '); } - src/tools/tier2/extract-tables.ts:1-77 (handler)MCP tool registration for 'extract_tables'. Defines title, description, input schema, and the async handler that calls extractTables() and formats the result as Markdown or JSON.
/** * extract_tables - Tagged PDF Table → Markdown extraction. * * Walks the structure tree (Tagged PDF only) and emits every `<Table>` * subtree as a Markdown table or a JSON object. Designed for documents * such as 国税庁 新旧対照表 / 帳票, where pure reading-order extraction * collapses two-column tables into ambiguous text. */ import type { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js'; import { ResponseFormat } from '../../constants.js'; import { type ExtractTablesInput, ExtractTablesSchema } from '../../schemas/tier2.js'; import { extractTables } from '../../services/pdfjs-service.js'; import { handleStructuredError } from '../../utils/error-handler.js'; import { formatTablesMarkdown, truncateIfNeeded } from '../../utils/formatter.js'; export function registerExtractTables(server: McpServer): void { server.registerTool( 'extract_tables', { title: 'Extract Tables (Tagged PDF)', description: `Extract every \`<Table>\` subtree from a Tagged PDF as a structured row/cell list, optionally rendered as Markdown tables. How it works: walks the StructTree and pulls cell text for each \`<TR>\` → \`<TH>/<TD>\`, then collapses kerning whitespace (e.g. "消 費 税 法" → "消費税法"). This sidesteps reading-order extraction's failure mode on multi-column tables (typical of 新旧対照表 PDFs). Args: - file_path (string): Absolute path to a local PDF file - pages (string, optional): Page range. Format: "1-5", "3", or "1,3,5-7". Omit for all pages. - response_format ('markdown' | 'json'): Output format (default: 'markdown') Returns: Markdown — \`# Extracted Tables\` summary block followed by one \`## Page N — Table M\` section per table with a GFM table. JSON — \`{ isTagged, tables: [{ page, index, headerRows, bodyRows, footerRows }], totalTables, pagesScanned, note? }\`. Limitations: - Untagged PDFs return an empty result and a \`note\`. - colspan/rowspan are not honoured (cells are listed in source order). - Nested tables are skipped to keep page indices stable. Examples: - Pull 新旧対照表 from a kaisei tsutatsu PDF for diffing - Convert 帳票 (form template) tables into structured data`, inputSchema: ExtractTablesSchema, annotations: { readOnlyHint: true, destructiveHint: false, idempotentHint: true, openWorldHint: false, }, }, async (params: ExtractTablesInput) => { try { const result = await extractTables(params.file_path, params.pages); const raw = params.response_format === ResponseFormat.JSON ? JSON.stringify(result, null, 2) : formatTablesMarkdown(result); const { text } = truncateIfNeeded(raw); return { content: [{ type: 'text' as const, text }] }; } catch (error) { const err = handleStructuredError(error); return { content: [{ type: 'text' as const, text: JSON.stringify(err, null, 2) }], isError: true, }; } }, ); } - src/schemas/tier2.ts:49-56 (schema)Zod schema `ExtractTablesSchema` defining input validation: file_path (string), pages (optional string), response_format ('markdown'|'json').
/** extract_tables — Tagged PDF Table → Markdown */ export const ExtractTablesSchema = z .object({ file_path: FilePathSchema, pages: PagesSchema, response_format: ResponseFormatSchema, }) .strict(); - src/tools/index.ts:30-52 (registration)Central registration: imports `registerExtractTables` and calls it at line 46 within `registerAllTools`.
export function registerAllTools(server: McpServer): void { // Tier 1: Basic PDF operations registerGetPageCount(server); registerGetMetadata(server); registerReadText(server); registerSearchText(server); registerReadImages(server); registerReadUrl(server); registerSummarize(server); // Tier 2: Structure analysis registerInspectStructure(server); registerInspectTags(server); registerInspectFonts(server); registerInspectAnnotations(server); registerInspectSignatures(server); registerExtractTables(server); // Tier 3: Validation & analysis registerValidateTagged(server); registerValidateMetadata(server); registerCompareStructure(server); } - src/types.ts:156-199 (helper)Type definitions: TableCell, TableRow, ExtractedTable, TablesExtractionResult used by the extract_tables tool.
/** A single cell within an extracted table row. */ export interface TableCell { /** Concatenated text content of the cell (whitespace compacted). */ text: string; /** True when the cell came from a `TH` element (header), false for `TD`. */ isHeader: boolean; } /** A table row composed of one or more cells. */ export interface TableRow { cells: TableCell[]; } /** A single table extracted from a Tagged PDF's structure tree. */ export interface ExtractedTable { /** 1-based page number where the table appears. */ page: number; /** 1-based index of the table within the page (1 = first table on the page). */ index: number; /** Rows from `<THead>`. Empty if the table has no explicit header section. */ headerRows: TableRow[]; /** Rows from `<TBody>` (or directly under `<Table>` when no `<TBody>` exists). */ bodyRows: TableRow[]; /** Rows from `<TFoot>`. Rare; usually empty. */ footerRows: TableRow[]; } /** Output of `extract_tables`. */ export interface TablesExtractionResult { /** Whether the PDF claims to be tagged (`/MarkInfo /Marked true`). */ isTagged: boolean; /** All tables extracted from the requested pages. */ tables: ExtractedTable[]; /** Total number of tables across `pagesScanned`. */ totalTables: number; /** Number of pages traversed (subject to the `pages` filter). */ pagesScanned: number; /** * Optional human-readable note. Set when no tables could be extracted * because the PDF is not tagged (in that case callers should fall back to * the planned column-aware extraction). */ note?: string; }