Zotero Chunk RAG

CLAUDE.md•20.5 KiB

# zotero-chunk-rag — Project Rules ## Hard Rules ### Never Optimise for "Minimal Refactoring" or "Minimal API Change" When designing a fix, pick the approach that solves the problem best. Do not choose a worse approach because it touches fewer files, preserves an existing function signature, or "minimises refactoring". If the right fix requires restructuring, restructure. Code exists to serve correctness, not to avoid diffs. ### No Hard-Coded Thresholds Every numeric threshold in the extraction pipeline MUST be adaptive — computed from the actual data on the current page or paper. Examples: - Row clustering tolerance: derive from the actual line spacing on the page, not a fixed 5pt constant. - Column gap detection: derive from the gap distribution of the words in the table, not a fixed ratio or minimum. - Fill rate triggers: derive from the table's own statistics relative to other tables on the same page. - Merge triggers (empty col0 %, sparse row %): derive from the table's structure, not global constants. If you find yourself writing a literal number as a threshold (e.g., `if ratio > 2.0`, `if fill < 0.55`, `if len(rows) < 6`), STOP. Either: 1. Compute the threshold from the data (preferred), or 2. If you genuinely cannot avoid a fixed number, **present it to the user for approval** with at least two alternatives and an explanation of why adaptive computation isn't feasible. This applies to ALL extraction code: `pdf_processor.py`, `feature_extraction/methods/`, `feature_extraction/postprocessors/`, `_gap_fill.py`, `section_classifier.py`, and any new modules. ### Tables in Academic Papers Are Not "Structurally Sparse" Almost no tables in academic papers have genuinely sparse data. When extracted tables show low fill rates (many empty cells), the cause is almost always an extraction error: - **Over-detected columns**: Gap threshold too low → one real column becomes 3-4 columns, most of which are empty. - **Inline headers**: Sub-group headings that span the full table width (e.g. "Panel A: Males") appear as rows with content only in column 0. - **Continuation rows**: Multi-line cell content split across rows, where the continuation row has content only in the column being continued. Do NOT report low fill rates as "structural sparsity", "inherent to the PDF", or "the table genuinely has many empty cells". Investigate the actual cause. If a table has < 70% fill, something is wrong with the extraction. ### Never Dismiss Problems as "Inherent to the PDF" or "PyMuPDF Limitation" PDFs are not deficient — if they were, human readers wouldn't use them. PyMuPDF's high-level functions (`find_tables()`, `get_text()`) do have limitations, but PyMuPDF also provides low-level tools (`get_text("words")`, `get_text("dict")`, `get_text("rawdict")`, `get_image_info()`, `get_drawings()`, page geometry, font metadata) that can work around those limitations. When a `find_tables()` result is wrong: - Re-extract from word positions (`page.get_text("words")`) - Use font metadata (`get_text("dict")`) to detect headers, footnotes, captions - Use drawing/line data (`get_drawings()`) to find actual table rules - Use block structure to detect boundaries The correct response to "find_tables() merges two tables" is "detect the merge and split using word positions and caption detection", not "this is a PyMuPDF limitation". ### Test Integrity Test conditions and assertions are NEVER modified to fit the PDF corpus, perceived limitations of extraction methods, or to make a failing test pass. If a test fails, the code is wrong — fix the code, not the test. Tests assert 100% functionality or are regarded as a complete failure. There is no "80% correct is good enough." If a table has 10 columns, the extraction must find 10 columns. If a cell contains "0.047", the extraction must return "0.047". Tests on synthetic data (programmatically generated PDFs, mock objects) are syntax checks only. They verify that functions accept the right arguments and return the right types. They tell you NOTHING about whether extraction works on real papers. The only data that matters is real-paper performance measured by the stress test. ### Extraction Time Budget Extra extraction time is acceptable. The pipeline should try every available method and pick the best result, not skip methods to save time. Correctness always beats speed. If running all methods takes 30 seconds per table instead of 3 seconds, that is fine. ### Performance Testing **The stress test is the ground truth.** Synthetic unit tests validate individual functions but tell you nothing about whether the pipeline actually works on real papers. #### How to run it ```bash "./.venv/Scripts/python.exe" tests/stress_test_real_library.py ``` This takes ~2-5 minutes. It: 1. Loads 10 specific papers from the user's live Zotero library (corpus defined in `CORPUS` at top of file) 2. Extracts each through the full pipeline (text, tables, figures, sections) 3. Indexes into a temp ChromaDB with local embeddings 4. Runs ~290 assertions across extraction quality, search accuracy, table/figure search, metadata filtering, and context expansion 5. Produces `STRESS_TEST_REPORT.md` and `_stress_test_debug.db` #### The debug database `_stress_test_debug.db` is a SQLite database containing every extracted artifact from the stress test run. **Core tables** (written by the main extraction pipeline): | Table | Contents | |-------|----------| | `papers` | Per-paper: pages, chunks, grade, completeness fields, full markdown | | `extracted_tables` | Every table: caption, headers_json, rows_json, fill_rate, bbox, artifact_type, rendered markdown | | `extracted_figures` | Every figure: caption, bbox, image_path, reference_context | | `sections` | Every section span: label, heading, char offsets, confidence | | `chunks` | Every text chunk: section, page, text, char offsets | | `pages` | Per-page markdown | | `test_results` | Every assertion: test_name, paper, passed, detail, severity | | `run_metadata` | Timestamps, counts, timings | **Extended tables** (written by `_test_pipeline_methods()` for per-method analysis): | Table | Contents | |-------|----------| | `method_results` | Per-table per-method: `table_id`, `method_name` (format: `structure+cell`), `boundary_hypotheses_json`, `cell_grid_json`, `quality_score` (GT accuracy or fill rate), `execution_time_ms` | | `pipeline_runs` | Per-table pipeline outcome: `table_id`, `pipeline_config_json`, `winning_method` (format: `structure:cell`), `final_score` (GT accuracy) | | `ground_truth_diffs` | Per-table GT comparison: `table_id`, `run_id`, `diff_json` (full `ComparisonResult`), `cell_accuracy_pct`, `num_splits`, `num_merges`, `num_cell_diffs`, `gt_shape`, `ext_shape` | **Use this database to audit extraction quality.** Example queries: ```sql -- Tables with low fill rates (likely broken extraction) SELECT short_name, caption, fill_rate, num_rows, num_cols FROM extracted_tables JOIN papers USING(item_key) WHERE fill_rate < 0.5 AND artifact_type IS NULL; -- Find decimal displacement (T1) — cells starting with "." SELECT short_name, caption, rows_json FROM extracted_tables JOIN papers USING(item_key) WHERE rows_json LIKE '%".%' AND artifact_type IS NULL; -- Papers with unmatched table captions (extraction missed a table) SELECT short_name, unmatched_table_captions FROM papers WHERE unmatched_table_captions != '[]'; -- All MAJOR failures SELECT test_name, paper, detail FROM test_results WHERE passed = 0 AND severity = 'MAJOR'; ``` **Per-method analysis queries** (require extended tables): ```sql -- Which structure method wins most often (highest GT accuracy)? SELECT SUBSTR(method_name, 1, INSTR(method_name, '+') - 1) AS structure, COUNT(*) AS wins FROM method_results mr WHERE quality_score = ( SELECT MAX(quality_score) FROM method_results WHERE table_id = mr.table_id AND quality_score IS NOT NULL ) GROUP BY structure ORDER BY wins DESC; -- Per-table: best single method vs pipeline consensus accuracy SELECT mr.table_id, MAX(mr.quality_score) AS best_single, gtd.cell_accuracy_pct AS pipeline_acc, gtd.cell_accuracy_pct - MAX(mr.quality_score) AS delta FROM method_results mr LEFT JOIN ground_truth_diffs gtd ON mr.table_id = gtd.table_id WHERE mr.quality_score IS NOT NULL GROUP BY mr.table_id ORDER BY delta; -- Tables where combination HURTS (best method > pipeline) SELECT mr.table_id, MAX(mr.quality_score) AS best, gtd.cell_accuracy_pct AS pipeline FROM method_results mr JOIN ground_truth_diffs gtd ON mr.table_id = gtd.table_id WHERE mr.quality_score IS NOT NULL GROUP BY mr.table_id HAVING MAX(mr.quality_score) > gtd.cell_accuracy_pct; -- All methods tried for a specific table, ranked by quality SELECT method_name, quality_score FROM method_results WHERE table_id = '<TABLE_ID>' AND quality_score IS NOT NULL ORDER BY quality_score DESC; -- Pipeline winning method distribution SELECT winning_method, COUNT(*) AS tables, AVG(final_score) AS avg_accuracy FROM pipeline_runs GROUP BY winning_method ORDER BY tables DESC; -- GT comparison summary: worst tables SELECT table_id, cell_accuracy_pct, num_splits, num_merges, num_cell_diffs FROM ground_truth_diffs ORDER BY cell_accuracy_pct ASC LIMIT 20; ``` #### When to run it Run the stress test after any change to the extraction pipeline. A change that passes unit tests but regresses the stress test is a bad change. The stress test verdict (MAJOR failures = unreliable, minor = rough edges, all pass = reliable) is the acceptance criterion. #### Severity rules - **MAJOR**: Orphan tables/figures (extraction missed something the paper contains), search failures (researcher can't find what's in the paper), data corruption (table values wrong enough to mislead). - **MINOR**: Section detection misses, abstract detection misses, chunk count deviations. Annoying but won't cause a researcher to reach wrong conclusions. ## Architecture Notes ### Key files | File | Role | |------|------| | `src/zotero_chunk_rag/pdf_processor.py` | Document extraction entry point, prose tables, cross-page coordination, stats/completeness | | `src/zotero_chunk_rag/feature_extraction/pipeline.py` | Pipeline orchestrator: page-level detection, per-table multi-method extraction, scoring/selection | | `src/zotero_chunk_rag/feature_extraction/methods/` | Structure detection methods (pymupdf, ruled lines, camelot, pdfplumber, hotspot, cliff, header anchor) and cell extraction methods (rawdict, words, pdfminer) | | `src/zotero_chunk_rag/feature_extraction/postprocessors/` | Post-processors (caption strip, header detection, continuation merge, inline headers, footnotes, cell cleaning) | | `src/zotero_chunk_rag/feature_extraction/models.py` | Pipeline data models (BoundaryHypothesis, CellGrid, ExtractionResult, TableContext, PipelineConfig) | | `src/zotero_chunk_rag/feature_extraction/scoring.py` | Quality scoring framework for grid selection | | `src/zotero_chunk_rag/feature_extraction/combination.py` | Boundary combination engine | | `src/zotero_chunk_rag/feature_extraction/captions.py` | Unified caption detection (table + figure) | | `src/zotero_chunk_rag/_gap_fill.py` | Post-extraction recovery pass for orphan captions | | `src/zotero_chunk_rag/section_classifier.py` | Section heading classification | | `src/zotero_chunk_rag/models.py` | Dataclasses: ExtractedTable, ExtractedFigure, SectionSpan, etc. | | `src/zotero_chunk_rag/_reference_matcher.py` | Maps figures/tables to body-text chunks that cite them | | `tests/stress_test_real_library.py` | 10-paper stress test (run directly, not via pytest) | | `tests/tune_weights.py` | Data-driven weight tuning from stress test results | | `tests/pipeline_weights.json` | Confidence multipliers for pipeline methods | | `spec/pipeline_operators_guide.md` | Operator's guide: debug DB, comparison, scoring, weight tuning, diagnostic workflows, SQL cookbook | ### Architecture overview The extraction pipeline uses a multi-method approach: 1. **Page-level detection** (`Pipeline.extract_page()`): finds table bboxes via `find_tables()` (3 strategies with dedup), detects figures via unified captions, matches captions to tables, classifies figure-data-table overlaps. 2. **Per-table extraction** (`Pipeline.extract()`): runs multiple structure detection methods in parallel (pymupdf, ruled lines, camelot, pdfplumber, hotspot, cliff, header anchor), combines boundary hypotheses, extracts cell text via multiple methods (rawdict, word assignment, pdfminer), scores/selects best grid, applies post-processors in canonical order. 3. **Cross-page coordination** (in `pdf_processor.py`): gap fill for orphan captions, heading/continuation caption assignment, artifact tagging, completeness grading. 4. **Named configs**: `DEFAULT_CONFIG` (all methods), `FAST_CONFIG` (subset), `RULED_CONFIG` (boosted ruled-line weight), `MINIMAL_CONFIG` (baseline). ### Ground truth comparison framework `feature_extraction/ground_truth.py` provides `compare_extraction()` — the core comparison function that diffs an extraction attempt against verified ground truth. **Table IDs**: Generated by `make_table_id(paper_key, caption, page_num, index)`: - Captioned: `{paper_key}_table_{N}` (e.g., `5SIZVS65_table_1`) - Continuation: `{paper_key}_table_{N}_p{page}` (e.g., `SCPXVBLY_table_1_p16`) - Orphan: `{paper_key}_orphan_p{page}_t{index}` **ComparisonResult fields**: | Field | Meaning | |-------|---------| | `gt_shape`, `ext_shape` | (rows, cols) of ground truth vs extraction | | `matched_columns` | List of (gt_col, ext_col) index pairs aligned by header text | | `extra_columns` | Extraction columns with no GT match (over-detection) | | `missing_columns` | GT columns not found in extraction (under-detection) | | `column_splits` | One GT column split into multiple extraction columns | | `column_merges` | Multiple GT columns merged into one extraction column | | `matched_rows`, `extra_rows`, `missing_rows` | Same as columns but for rows | | `row_splits`, `row_merges` | Row-level structural mismatches | | `cell_diffs` | Individual cell mismatches (row, col, expected, actual) | | `cell_accuracy_pct` | Percentage of comparable cells that match (0--100) | | `structural_coverage_pct` | Fraction of GT cells that were comparable | | `header_diffs` | Header-level cell mismatches | | `footnote_match` | Whether extracted footnotes match GT footnotes | **Column alignment algorithm** (3-pass): 1. Exact match on normalized header text 2. LCS (longest common substring) fallback for ≥80% similarity 3. Split/merge detection: concatenation of adjacent headers 4. Positional matching for empty-string headers **Row alignment**: Sequential walk with split/merge detection (up to 3-row spans) and skip-ahead for spurious extra rows. **Cell normalization**: Whitespace collapse, dash/hyphen unification (unicode minus, en-dash, em-dash → ASCII hyphen), ligature expansion (ff, fi, fl, ffi, ffl). ### Scoring and grid selection `feature_extraction/scoring.py` uses **rank-based selection**: each grid is ranked across multiple quality metrics, ranks are summed, and the lowest total rank wins. No absolute weights to calibrate. **Metrics** (4 built-in, optional GT accuracy): | Metric | Direction | Meaning | |--------|-----------|---------| | `fill_rate` | Higher = better | Fraction of non-empty cells | | `decimal_displacement_count` | Lower = better | Cells matching `^\.\d+` (leading dot without zero) | | `garbled_text_score` | Lower = better | Fraction of cells with avg word length > 25 | | `numeric_coherence` | Higher = better | Fraction of numeric columns that are all-numeric or all-text | | GT accuracy (optional) | Higher = better | Cell accuracy vs ground truth when `ground_truth_fn` is provided | **Grid identification**: Each grid has a composite key `structure_method:cell_method` (e.g., `single_point_hotspot:rawdict`, `consensus:word_assignment`). ### Pipeline configurations **Named configs** (defined in `pipeline.py`): | Config | Structure Methods | Cell Methods | Post-Processors | Notes | |--------|------------------|-------------|-----------------|-------| | `DEFAULT_CONFIG` | All 13 | All 3 | All 7 | Full pipeline, activation rules for camelot/cliff | | `FAST_CONFIG` | PyMuPDFLines, GapSpanHotspot | Rawdict, WordAssignment | All 7 | Speed-focused | | `RULED_CONFIG` | All 13 | All 3 | All 7 | Boosted `ruled_lines` multiplier (3.0×) | | `MINIMAL_CONFIG` | PyMuPDFLines only | Rawdict only | AbsorbedCaption + CellCleaning | Baseline | **Activation rules** gate method execution: - `camelot_lattice`, `camelot_hybrid`: Only run when `has_ruled_lines(ctx)` is True - `global_cliff`, `per_row_cliff`: Only run when `has_ruled_lines(ctx)` is False **Confidence multipliers** (`PipelineConfig.confidence_multipliers`): Scale boundary confidence during combination. Higher multiplier → more influence on consensus. Loaded from `tests/pipeline_weights.json` at `Pipeline.__init__()`. ### Weight tuning workflow 1. **Run the stress test** to generate `_stress_test_debug.db` 2. **Run weight tuning** to compute win rates and generate multipliers: ```bash "./.venv/Scripts/python.exe" tests/tune_weights.py ``` 3. **Output**: `tests/pipeline_weights.json` containing: ```json { "confidence_multipliers": { "single_point_hotspot": 1.0, "gap_span_hotspot": 0.5, ... } } ``` **How it works**: For each table in `method_results`, the structure method whose boundaries produced the highest GT accuracy gets a "win". Win rate = wins / participation count. The best method gets multiplier 1.0; others are proportional. Zero-win methods get a floor of 0.1. **Pipeline reads this at init**: `Pipeline.__init__()` checks for the weights file and merges its multipliers with the config's defaults (file values override). ### Combination algorithm `combine_hypotheses()` uses per-divider voting with median method count acceptance: 1. **Confidence scaling**: Each boundary point's confidence is multiplied by its method's multiplier (from `pipeline_weights.json`). 2. **Point expansion**: Narrow boundary ranges (span < `spatial_precision`) are expanded symmetrically around their midpoint. `spatial_precision` is derived adaptively: ruled line thickness > median word gap > median word height. 3. **Overlap merge**: Expanded points are sorted by position and merged into clusters where ranges overlap. 4. **Acceptance**: Compute each cluster's distinct method count. The acceptance threshold is the median of all clusters' method counts. Clusters meeting or exceeding this threshold are accepted. Ruled line boundaries (`provenance == "ruled_lines"`) are unconditionally accepted regardless of method count. 5. **Consensus confidence**: Each accepted boundary's confidence is the mean of its constituent points' scaled confidences. Single-hypothesis input is passed through unchanged. Empty input returns empty boundaries. ### Combination tracing `combine_hypotheses(hypotheses, ctx, trace=True)` returns a `(BoundaryHypothesis, CombinationTrace)` tuple for diagnostics. **CombinationTrace** contains: - `col_trace` / `row_trace` (`AxisTrace`): Per-axis combination details - `spatial_precision`: Adaptive merge tolerance (derived from ruled line thickness, median word gap, or median word height) - `source_methods`: List of structure methods that contributed hypotheses **AxisTrace** contains: - `input_points`: All `BoundaryPoint`s from all methods - `expansions`: How each point was expanded (narrow → widened, wide → unchanged) - `clusters`: `ClusterRecord` objects with `method_names`, `acceptance_reason` (`"above_threshold"`, `"ruled_line_override"`, `"rejected"`), and `distinct_methods` - `acceptance_threshold`: Median method count used as the acceptance cutoff - `median_method_count`: The computed median of distinct method counts across clusters - `accepted_positions`: Final boundary positions ### Stress test report sections The stress test produces `STRESS_TEST_REPORT.md` with these sections: 1. **Executive Summary**: Total tests, pass rate, verdict 2. **Performance**: Indexing time 3. **Extraction Quality per Paper**: Pages, sections, tables, figures, grade, issues 4. **Failures (Detailed)**: Per-failure explanation with severity 5. **Passes**: Full assertion pass list 6. **Ground Truth Comparison**: Per-table cell accuracy, splits, merges, cell diffs 7. **Pipeline Depth Report**: Per-method win rates, combination value (best-single vs consensus), per-table accuracy chain 8. **Variant Comparison**: All 4 named configs compared on accuracy and speed ### Known shortcomings The pipeline addresses most historical extraction issues (T1-T8, D1-D7 from the original `table_shortcomings.md`) through adaptive thresholds and multi-method consensus. Remaining edge cases are tracked via the stress test and ground truth database (44 tables across 10 papers).

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ccam80/zotero-chunk-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

CLAUDE.md•20.5 KiB