Canadian Building Code MCP Server

06_HYBRID_PIPELINE.md•13.8 KiB

# 06. Hybrid Pipeline ## Overview A pipeline combining Marker's text extraction ability + pdfplumber's table extraction ability. **Key Discovery:** Marker's `meta.json` already has page information → No need to scan entire PDF! ``` ┌─────────────────────────────────────────────────────────────────┐ │ SMART HYBRID PIPELINE │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ [PDF] ──→ [Marker] ──→ [MD + meta.json] │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ 1. BUILD INDEX │ │ │ │ Table→Page map │ │ │ │ from meta.json │ │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ 2. SCANNER │ │ │ │ Extract table │ │ │ │ sections from MD│ │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ 3. JUDGE │ │ │ │ Quality check │ │ │ │ (6 criteria) │ │ │ └────────┬────────┘ │ │ / \ │ │ OK / \ BROKEN │ │ ↓ ↓ │ │ [Keep] ┌──────────────┐ │ │ │ 4. SURGERY │ │ │ │ Get page from │ │ │ │ INDEX → pdfplumber │ │ │ └──────┬───────┘ │ │ │ │ │ ▼ │ │ [5. INJECTION] │ │ │ │ │ ▼ │ │ [Final MD] │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## Page Finding Strategy Comparison | Method | Speed | Accuracy | Manual Work | Recommended | |--------|-------|----------|-------------|-------------| | A: Full PDF scan | Slow | High | None | No | | B: Manual mapping | Fast | High | A lot | No | | **C: meta.json** | **Fast** | **High** | **None** | **Yes** | ### Why C (meta.json) is Best? 1. **Marker already did the work** - No duplicate work 2. **Has polygon coordinates** - Can use for Fast Mode later 3. **OK when PDF version changes** - Just re-run Marker 4. **14 codes × hundreds of tables** - Automation required ```json // meta.json example { "table_of_contents": [ { "title": "Table 9.10.14.4", "page_id": 245, "polygon": [[x1,y1], [x2,y2], ...] } ] } ``` --- ## Step-by-Step Details ### Step 1: Marker (Full Conversion) ```bash # Run for each PDF marker_single "sources/NBC2025p1.pdf" --output_dir "marker/nbc_2025" ``` **Output:** - `nbc_2025.md` - Full Markdown - `nbc_2025_meta.json` - Coordinate info (polygon) - `*.jpeg` - Extracted images **Expected Time:** 30min~1hour per PDF (CPU) --- ### Step 2: Index Builder (meta.json → Table Map) ```python # scripts/index_builder.py import json import re from pathlib import Path from typing import Dict def build_table_index(meta_json_path: str) -> Dict[str, int]: """ Create Table ID → Page number index from Marker meta.json Returns: {"Table 9.10.14.4": 245, "Table 9.8.2.1": 123, ...} """ with open(meta_json_path, 'r', encoding='utf-8') as f: meta = json.load(f) index = {} # Find tables in table_of_contents for item in meta.get('table_of_contents', []): title = item.get('title', '') page = item.get('page_id', 0) # Match "Table X.X.X.X" pattern match = re.search(r'Table\s+(\d+\.\d+\.\d+\.\d+[A-Z]?(?:-[A-Z])?)', title) if match: table_id = f"Table {match.group(1)}" index[table_id] = page return index ``` --- ### Step 3: Scanner (Table Detection) ```python # scripts/scanner.py import re from pathlib import Path from dataclasses import dataclass from typing import List @dataclass class TableLocation: """Table location info""" table_id: str start_line: int end_line: int content: str page_num: int = 0 # Filled later from Index def scan_tables(md_path: str) -> List[TableLocation]: """ Find all table sections in MD file. """ with open(md_path, 'r', encoding='utf-8') as f: lines = f.readlines() tables = [] i = 0 while i < len(lines): line = lines[i] # Find table heading table_match = re.match(r'^#{2,4}\s+(Table\s+\d+\.\d+\.\d+\.\d+[A-Z]?(?:-[A-Z])?)', line) if table_match: table_id = table_match.group(1) start_line = i # Find table end j = i + 1 while j < len(lines): if lines[j].startswith('#'): break j += 1 end_line = j content = ''.join(lines[start_line:end_line]) tables.append(TableLocation( table_id=table_id, start_line=start_line, end_line=end_line, content=content )) i = j else: i += 1 return tables ``` --- ### Step 4: Judge (Quality Assessment) - 6 Criteria ```python # scripts/judge.py from enum import Enum from dataclasses import dataclass, field from typing import List class BrokenReason(Enum): NO_PIPES = "Flat text (no markdown table)" COL_MISMATCH = "Column count mismatch" EMPTY_CELLS = "Too many empty cells (>25%)" ROWSPAN_BROKEN = "Rowspan structure broken" NO_DATA = "No data rows" DUPLICATE_HEADER = "Multi-page split" TRUNCATED = "Content truncated" @dataclass class TableQuality: table_id: str is_broken: bool reasons: List[BrokenReason] = field(default_factory=list) confidence: float = 0.9 def judge_table(content: str, table_id: str = "") -> TableQuality: """ Assess markdown table quality (6 criteria) Broken table signs: 1. No pipes (|) → Flat Text 2. Column count mismatch → Structure broken 3. Empty cells 25%+ → Data loss 4. Consecutive empty first column → Rowspan broken 5. No data rows → Header only 6. Duplicate headers → Multi-page split """ result = TableQuality(table_id=table_id, is_broken=False) lines = content.strip().split('\n') table_lines = [l for l in lines if '|' in l and not l.strip().startswith('#')] # 1. Flat Text detection (no pipes) if len(table_lines) < 2: result.is_broken = True result.reasons.append(BrokenReason.NO_PIPES) return result # 2. Column count consistency col_counts = [l.count('|') - 1 for l in table_lines] if col_counts: variance = max(col_counts) - min(col_counts) if variance > 1: result.is_broken = True result.reasons.append(BrokenReason.COL_MISMATCH) # ... (remaining criteria) return result ``` --- ### Step 5: Surgery (pdfplumber) ```python # scripts/surgery.py import pdfplumber from typing import Optional def extract_table_with_pdfplumber( pdf_path: str, page_num: int, table_id: str ) -> Optional[str]: """ Extract table as HTML using pdfplumber. """ with pdfplumber.open(pdf_path) as pdf: if page_num < 1 or page_num > len(pdf.pages): return None page = pdf.pages[page_num - 1] tables = page.extract_tables() if not tables: return None # Select largest table largest_table = max(tables, key=lambda t: len(t) * len(t[0]) if t and t[0] else 0) return convert_to_html(largest_table, table_id) def convert_to_html(table_data: list, table_id: str) -> str: """Convert pdfplumber table data to HTML""" if not table_data or not table_data[0]: return "" html_lines = [f'<table class="building-code-table" id="{table_id.replace(" ", "-")}">'] for i, row in enumerate(table_data): if i == 0: html_lines.append(' <thead>') html_lines.append(' <tr>') for cell in row: cell_text = escape_html(cell if cell else '') html_lines.append(f' <th>{cell_text}</th>') html_lines.append(' </tr>') html_lines.append(' </thead>') html_lines.append(' <tbody>') else: html_lines.append(' <tr>') for cell in row: cell_text = escape_html(cell if cell else '') html_lines.append(f' <td>{cell_text}</td>') html_lines.append(' </tr>') html_lines.append(' </tbody>') html_lines.append('</table>') return '\n'.join(html_lines) ``` --- ### Step 6: Injection (Insert into MD) ```python # scripts/injection.py import re from typing import Dict def inject_tables( md_path: str, replacements: Dict[str, str], output_path: str = None ) -> str: """Replace broken tables in MD file with HTML""" with open(md_path, 'r', encoding='utf-8') as f: content = f.read() for table_id, html in replacements.items(): # Replace table heading + entire content pattern = rf'(#{2,4}\s+{re.escape(table_id)}.*?)(?=\n#{2,4}\s|\Z)' replacement = f'#### {table_id}\n\n{html}\n\n' content = re.sub(pattern, replacement, content, flags=re.DOTALL) output = output_path or md_path with open(output, 'w', encoding='utf-8') as f: f.write(content) return output ``` --- ## Integrated Pipeline ```python # scripts/smart_hybrid_pipeline.py def run_smart_pipeline( pdf_path: str, marker_md_path: str, marker_meta_path: str, output_path: str ) -> dict: """ Smart Hybrid Pipeline 1. INDEX: Build Table→Page map from meta.json 2. SCAN: Extract table sections from MD 3. JUDGE: Quality check (6 criteria) 4. SURGERY: Broken tables → pdfplumber 5. INJECT: Insert into MD """ stats = { "total": 0, "ok": 0, "broken": 0, "fixed": 0, "failed": 0 } # STEP 1: BUILD INDEX table_index = build_table_index(marker_meta_path) # STEP 2: SCAN TABLES tables = scan_tables(marker_md_path) # STEP 3: JUDGE QUALITY broken_tables = [] for table in tables: quality = judge_table(table.content, table.table_id) if quality.is_broken: table.page_num = table_index.get(table.table_id, 0) broken_tables.append((table, quality)) # STEP 4: SURGERY replacements = {} for table, quality in broken_tables: html = extract_table_with_pdfplumber(pdf_path, table.page_num, table.table_id) if html: replacements[table.table_id] = html # STEP 5: INJECTION inject_tables(marker_md_path, replacements, output_path) return stats ``` --- ## Execution Order ```bash # 1. Marker batch execution (6~12 hours) python scripts/batch_process.py marker # 2. Hybrid pipeline (1~2 hours) python scripts/batch_process.py # 3. Check results ls -la marker/*/ ``` --- ## Quality Criteria | Item | Target | |------|--------| | Table detection rate | > 95% | | Broken table detection accuracy | > 90% | | pdfplumber extraction success rate | > 85% | | Final table quality | > 95% | --- ## Broken Table Pattern Summary | Pattern | Example | Detection Criteria | |---------|---------|-------------------| | **Flat Text** | `Row1 Col1 Col2 Row2...` | No pipes | | **Column Mismatch** | Header 4 cols, body 3 cols | Column count diff > 1 | | **Excessive Empty Cells** | `\| \| \| \|` | Empty cells 25%+ | | **Rowspan Broken** | Repeated empty first column | Empty first col 40%+ | | **Multi-page Split** | Duplicate headers | Same header 2+ times | | **No Data** | Header only | Data rows < 1 | --- ## Next Document → [04_ROADMAP.md](./04_ROADMAP.md) - Development Roadmap

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/DavidCho1999/Canada-AEC-Code-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

06_HYBRID_PIPELINE.md•13.8 KiB