Skip to main content
Glama

Extract Tables (Tagged PDF)

extract_tables
Read-onlyIdempotent

Extract tables from tagged PDFs: walks structure tree to pull cell text, outputting as Markdown or JSON for structured data.

Instructions

Extract every <Table> subtree from a Tagged PDF as a structured row/cell list, optionally rendered as Markdown tables.

How it works: walks the StructTree and pulls cell text for each <TR><TH>/<TD>, then collapses kerning whitespace (e.g. "消 費 税 法" → "消費税法"). This sidesteps reading-order extraction's failure mode on multi-column tables (typical of 新旧対照表 PDFs).

Args:

  • file_path (string): Absolute path to a local PDF file

  • pages (string, optional): Page range. Format: "1-5", "3", or "1,3,5-7". Omit for all pages.

  • response_format ('markdown' | 'json'): Output format (default: 'markdown')

Returns: Markdown — # Extracted Tables summary block followed by one ## Page N — Table M section per table with a GFM table.

JSON — { isTagged, tables: [{ page, index, headerRows, bodyRows, footerRows }], totalTables, pagesScanned, note? }.

Limitations:

  • Untagged PDFs return an empty result and a note.

  • colspan/rowspan are not honoured (cells are listed in source order).

  • Nested tables are skipped to keep page indices stable.

Examples:

  • Pull 新旧対照表 from a kaisei tsutatsu PDF for diffing

  • Convert 帳票 (form template) tables into structured data

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
file_pathYesAbsolute path to a local PDF file (e.g., "/path/to/document.pdf")
pagesNoPage range to process. Format: "1-5", "3", or "1,3,5-7". Omit for all pages.
response_formatNoOutput format: "markdown" for human-readable, "json" for structured datamarkdown

Implementation Reference

  • Public `extractTables` function - loads the PDF document and delegates to `extractTablesFromDoc`. This is the entry point called by the tool handler.
    export async function extractTables(
      filePath: string,
      pages?: string,
    ): Promise<TablesExtractionResult> {
      const doc = await loadDocument(filePath);
      try {
        return await extractTablesFromDoc(doc, pages);
      } finally {
        await doc.destroy();
      }
    }
  • Core `extractTablesFromDoc` - checks if document is tagged, resolves page numbers, walks StructTree per page, builds id-to-text map, and collects all Table subtrees into structured ExtractedTable objects.
    export async function extractTablesFromDoc(
      doc: PDFDocumentProxy,
      pages?: string,
    ): Promise<TablesExtractionResult> {
      const markInfo = await getMarkInfo(doc);
      const isTagged = markInfo?.Marked === true;
    
      if (!isTagged) {
        return {
          isTagged: false,
          tables: [],
          totalTables: 0,
          pagesScanned: 0,
          note:
            'Document is not tagged. extract_tables relies on /MarkInfo /Marked true ' +
            'and a StructTree. For untagged two-column PDFs, fall back to a ' +
            'column-aware reading strategy (see pdf-reader-mcp Issue #3).',
        };
      }
    
      const pageNumbers = resolvePageNumbers(pages, doc.numPages);
    
      const perPage = await Promise.all(
        pageNumbers.map(async (pageNum) => {
          const page = await doc.getPage(pageNum);
          try {
            const [tree, textContent] = await Promise.all([
              page.getStructTree(),
              page.getTextContent({ includeMarkedContent: true }),
            ]);
            if (!tree) return [] as ExtractedTable[];
    
            const idToText = buildIdToTextMap(textContent.items);
            const tables: ExtractedTable[] = [];
            collectTables(tree as unknown as StructNode, pageNum, idToText, tables);
            return tables;
          } catch {
            return [] as ExtractedTable[];
          }
        }),
      );
    
      const tables = perPage.flat();
      return {
        isTagged: true,
        tables,
        totalTables: tables.length,
        pagesScanned: pageNumbers.length,
      };
    }
  • Internal helper functions (StructNode interface, buildIdToTextMap, collectTables, appendTableRowsFromSection, buildRowFromTR, collectTextUnder, compactCellText) that implement the StructTree walking and cell text extraction logic.
    // ─── extract_tables internals ──────────────────────────────────────────────
    
    /** A node from `page.getStructTree()`. Has either `role`+`children` or `type === 'content'`+`id`. */
    interface StructNode {
      role?: string;
      children?: StructNode[];
      type?: 'content';
      id?: string;
    }
    
    /** A `getTextContent({ includeMarkedContent: true })` item. */
    interface TextContentItemLike {
      type?: string;
      id?: string | null;
      tag?: string;
      str?: string;
      hasEOL?: boolean;
    }
    
    /**
     * Build a map from a marked-content `id` (e.g. `p715R_mc4`) to the concatenated
     * raw text inside the corresponding `beginMarkedContentProps`/`endMarkedContent`
     * pair. Nested marked content is supported via a stack — text counts toward
     * every active id (so a `<Span>` inside a `<P>` contributes to both).
     *
     * Items with `tag === 'Artifact'` are page-level artifacts (page numbers,
     * running headers, etc.) outside the structure tree, and are skipped.
     */
    function buildIdToTextMap(items: TextContentItemLike[]): Map<string, string> {
      const map = new Map<string, string[]>();
      const stack: { id: string | null; isArtifact: boolean }[] = [];
    
      for (const item of items) {
        const t = item.type;
        if (t === 'beginMarkedContent' || t === 'beginMarkedContentProps') {
          const isArtifact = item.tag === 'Artifact';
          const id = item.id ?? null;
          stack.push({ id, isArtifact });
          continue;
        }
        if (t === 'endMarkedContent') {
          stack.pop();
          continue;
        }
        if (t !== undefined) continue; // unknown marker
        // Text item
        if (stack.some((s) => s.isArtifact)) continue;
        const str = item.hasEOL ? ' ' : (item.str ?? '');
        if (!str) continue;
        for (const frame of stack) {
          if (frame.id) {
            const buf = map.get(frame.id);
            if (buf) buf.push(str);
            else map.set(frame.id, [str]);
          }
        }
      }
    
      const out = new Map<string, string>();
      for (const [id, parts] of map) out.set(id, parts.join(''));
      return out;
    }
    
    /** Walk the StructTree and append every `<Table>` subtree as an ExtractedTable. */
    function collectTables(
      node: StructNode,
      pageNum: number,
      idToText: Map<string, string>,
      out: ExtractedTable[],
    ): void {
      if (node.type === 'content') return;
    
      if (node.role === 'Table') {
        const headerRows: TableRow[] = [];
        const bodyRows: TableRow[] = [];
        const footerRows: TableRow[] = [];
    
        for (const child of node.children ?? []) {
          if (child.type === 'content') continue;
          if (child.role === 'THead') {
            appendTableRowsFromSection(child, idToText, headerRows);
          } else if (child.role === 'TBody') {
            appendTableRowsFromSection(child, idToText, bodyRows);
          } else if (child.role === 'TFoot') {
            appendTableRowsFromSection(child, idToText, footerRows);
          } else if (child.role === 'TR') {
            // Tables sometimes omit THead/TBody and place TRs directly under <Table>.
            const row = buildRowFromTR(child, idToText);
            if (row) bodyRows.push(row);
          }
        }
    
        // `out` is a per-page accumulator passed in by the caller, so
        // `out.length + 1` is the next index within this page (1-based).
        out.push({
          page: pageNum,
          index: out.length + 1,
          headerRows,
          bodyRows,
          footerRows,
        });
        return; // Don't recurse into a Table — nested tables are uncommon and
        // would confuse the per-page index. (Add nested-table support later.)
      }
    
      for (const child of node.children ?? []) {
        collectTables(child, pageNum, idToText, out);
      }
    }
    
    function appendTableRowsFromSection(
      section: StructNode,
      idToText: Map<string, string>,
      out: TableRow[],
    ): void {
      for (const child of section.children ?? []) {
        if (child.type === 'content') continue;
        if (child.role === 'TR') {
          const row = buildRowFromTR(child, idToText);
          if (row) out.push(row);
        }
      }
    }
    
    function buildRowFromTR(tr: StructNode, idToText: Map<string, string>): TableRow | null {
      const cells: TableCell[] = [];
      for (const child of tr.children ?? []) {
        if (child.type === 'content') continue;
        if (child.role === 'TH' || child.role === 'TD') {
          const text = compactCellText(collectTextUnder(child, idToText));
          cells.push({ text, isHeader: child.role === 'TH' });
        }
      }
      return cells.length === 0 ? null : { cells };
    }
    
    function collectTextUnder(node: StructNode, idToText: Map<string, string>): string {
      if (node.type === 'content') {
        return node.id ? (idToText.get(node.id) ?? '') : '';
      }
      const parts: string[] = [];
      for (const child of node.children ?? []) {
        parts.push(collectTextUnder(child, idToText));
      }
      return parts.join(' ');
    }
    
    /**
     * Normalise raw cell text:
     *   1. Collapse any whitespace run (`\s` + U+3000) to a single ASCII space.
     *   2. Fold per-character kerning runs between CJK characters
     *      (e.g. "消 費 税 法" → "消費税法") — but only when at least three
     *      single CJK chars are separated by single spaces in a row, so that
     *      natural inter-word spacing like "事業者 法人番号" is preserved.
     *   3. Trim and Markdown-escape pipes / newlines.
     */
    function compactCellText(s: string): string {
      if (!s) return '';
      // Step 1: collapse whitespace runs (incl. U+3000) to one ASCII space.
      let t = s.replace(/[\s ]+/g, ' ').trim();
      // Step 2: fold runs of `CJK + space` repeated at least twice followed by
      // a final CJK char. Anything shorter is treated as a real word boundary.
      const cjk = '[\\u3040-\\u30ff\\u3400-\\u9fff\\uff00-\\uffef]';
      const kerningRun = new RegExp(`(?:${cjk} ){2,}${cjk}`, 'g');
      t = t.replace(kerningRun, (m) => m.replace(/ /g, ''));
      // Step 3: escape Markdown table delimiters.
      return t.replace(/\|/g, '\\|').replace(/\n/g, ' ');
    }
  • MCP tool registration for 'extract_tables'. Defines title, description, input schema, and the async handler that calls extractTables() and formats the result as Markdown or JSON.
    /**
     * extract_tables - Tagged PDF Table → Markdown extraction.
     *
     * Walks the structure tree (Tagged PDF only) and emits every `<Table>`
     * subtree as a Markdown table or a JSON object. Designed for documents
     * such as 国税庁 新旧対照表 / 帳票, where pure reading-order extraction
     * collapses two-column tables into ambiguous text.
     */
    
    import type { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
    import { ResponseFormat } from '../../constants.js';
    import { type ExtractTablesInput, ExtractTablesSchema } from '../../schemas/tier2.js';
    import { extractTables } from '../../services/pdfjs-service.js';
    import { handleStructuredError } from '../../utils/error-handler.js';
    import { formatTablesMarkdown, truncateIfNeeded } from '../../utils/formatter.js';
    
    export function registerExtractTables(server: McpServer): void {
      server.registerTool(
        'extract_tables',
        {
          title: 'Extract Tables (Tagged PDF)',
          description: `Extract every \`<Table>\` subtree from a Tagged PDF as a structured row/cell list,
    optionally rendered as Markdown tables.
    
    How it works: walks the StructTree and pulls cell text for each \`<TR>\` →
    \`<TH>/<TD>\`, then collapses kerning whitespace (e.g. "消 費 税 法" → "消費税法").
    This sidesteps reading-order extraction's failure mode on multi-column tables
    (typical of 新旧対照表 PDFs).
    
    Args:
      - file_path (string): Absolute path to a local PDF file
      - pages (string, optional): Page range. Format: "1-5", "3", or "1,3,5-7". Omit for all pages.
      - response_format ('markdown' | 'json'): Output format (default: 'markdown')
    
    Returns:
      Markdown — \`# Extracted Tables\` summary block followed by one
      \`## Page N — Table M\` section per table with a GFM table.
    
      JSON — \`{ isTagged, tables: [{ page, index, headerRows, bodyRows, footerRows }],
      totalTables, pagesScanned, note? }\`.
    
    Limitations:
      - Untagged PDFs return an empty result and a \`note\`.
      - colspan/rowspan are not honoured (cells are listed in source order).
      - Nested tables are skipped to keep page indices stable.
    
    Examples:
      - Pull 新旧対照表 from a kaisei tsutatsu PDF for diffing
      - Convert 帳票 (form template) tables into structured data`,
          inputSchema: ExtractTablesSchema,
          annotations: {
            readOnlyHint: true,
            destructiveHint: false,
            idempotentHint: true,
            openWorldHint: false,
          },
        },
        async (params: ExtractTablesInput) => {
          try {
            const result = await extractTables(params.file_path, params.pages);
            const raw =
              params.response_format === ResponseFormat.JSON
                ? JSON.stringify(result, null, 2)
                : formatTablesMarkdown(result);
    
            const { text } = truncateIfNeeded(raw);
            return { content: [{ type: 'text' as const, text }] };
          } catch (error) {
            const err = handleStructuredError(error);
            return {
              content: [{ type: 'text' as const, text: JSON.stringify(err, null, 2) }],
              isError: true,
            };
          }
        },
      );
    }
  • Zod schema `ExtractTablesSchema` defining input validation: file_path (string), pages (optional string), response_format ('markdown'|'json').
    /** extract_tables — Tagged PDF Table → Markdown */
    export const ExtractTablesSchema = z
      .object({
        file_path: FilePathSchema,
        pages: PagesSchema,
        response_format: ResponseFormatSchema,
      })
      .strict();
  • Central registration: imports `registerExtractTables` and calls it at line 46 within `registerAllTools`.
    export function registerAllTools(server: McpServer): void {
      // Tier 1: Basic PDF operations
      registerGetPageCount(server);
      registerGetMetadata(server);
      registerReadText(server);
      registerSearchText(server);
      registerReadImages(server);
      registerReadUrl(server);
      registerSummarize(server);
    
      // Tier 2: Structure analysis
      registerInspectStructure(server);
      registerInspectTags(server);
      registerInspectFonts(server);
      registerInspectAnnotations(server);
      registerInspectSignatures(server);
      registerExtractTables(server);
    
      // Tier 3: Validation & analysis
      registerValidateTagged(server);
      registerValidateMetadata(server);
      registerCompareStructure(server);
    }
  • Type definitions: TableCell, TableRow, ExtractedTable, TablesExtractionResult used by the extract_tables tool.
    /** A single cell within an extracted table row. */
    export interface TableCell {
      /** Concatenated text content of the cell (whitespace compacted). */
      text: string;
      /** True when the cell came from a `TH` element (header), false for `TD`. */
      isHeader: boolean;
    }
    
    /** A table row composed of one or more cells. */
    export interface TableRow {
      cells: TableCell[];
    }
    
    /** A single table extracted from a Tagged PDF's structure tree. */
    export interface ExtractedTable {
      /** 1-based page number where the table appears. */
      page: number;
      /** 1-based index of the table within the page (1 = first table on the page). */
      index: number;
      /** Rows from `<THead>`. Empty if the table has no explicit header section. */
      headerRows: TableRow[];
      /** Rows from `<TBody>` (or directly under `<Table>` when no `<TBody>` exists). */
      bodyRows: TableRow[];
      /** Rows from `<TFoot>`. Rare; usually empty. */
      footerRows: TableRow[];
    }
    
    /** Output of `extract_tables`. */
    export interface TablesExtractionResult {
      /** Whether the PDF claims to be tagged (`/MarkInfo /Marked true`). */
      isTagged: boolean;
      /** All tables extracted from the requested pages. */
      tables: ExtractedTable[];
      /** Total number of tables across `pagesScanned`. */
      totalTables: number;
      /** Number of pages traversed (subject to the `pages` filter). */
      pagesScanned: number;
      /**
       * Optional human-readable note. Set when no tables could be extracted
       * because the PDF is not tagged (in that case callers should fall back to
       * the planned column-aware extraction).
       */
      note?: string;
    }
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description discloses key behavioral traits: how it walks the StructTree, collapses kerning whitespace, and handles unsupported features (colspan, nested tables). These details go beyond the annotations (readOnlyHint, idempotentHint), which already indicate safe, non-destructive usage, providing agents with full context.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with clear sections: main purpose, how it works, args, returns, limitations, examples. Every sentence adds value, and the most critical information (what the tool does) is front-loaded, making it easy for an agent to quickly understand and act.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the complexity of table extraction from tagged PDFs, the description covers all necessary aspects: purpose, mechanics, parameter details, return formats (including examples), and limitations. No output schema is provided, but the description compensates by detailing the JSON structure. The examples further clarify use cases.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The input schema covers all three parameters with descriptions (100% coverage). The description adds value by explaining the page range format, enum options for response_format, and return structure details, going beyond the schema to clarify usage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool extracts `<Table>` subtrees from tagged PDFs and provides structured output. It distinguishes itself from sibling tools like `read_text` by explicitly mentioning it handles multi-column tables where reading-order extraction fails, making its purpose highly specific.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explains when to use (tagged PDFs with tables, especially multi-column) and notes limitations (untagged PDFs return empty, no colspan/rowspan support, nested tables skipped). It implies an alternative for non-table text extraction but does not explicitly name sibling tools for comparison, leaving room for slight ambiguity.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/shuji-bonji/pdf-reader-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server