Zotero Chunk RAG

progress.md•17.7 KiB

# Implementation Progress Progress is recorded here by implementation agents. Each completed task appends its status below. ## Baseline (pre-plan) - Stress test: 250/291 passed (86%), 21 MAJOR failures - GT cell accuracy: 73.5% (pipeline), 100% (best single method) - Best config: FAST (84.9%) > DEFAULT (75.6%) - Root cause: consensus combination creates phantom boundaries - Agent QA: infrastructure built, never executed ## Phase 1: Pipeline Architecture Fix ### Task 1.1.1: Add `structure_method` field to CellGrid - **Status**: complete - `CellGrid.structure_method` field added with default `"consensus"` - `CellGrid.with_structure_method()` helper returns copy with field replaced - `CellGrid.to_dict()` includes `structure_method` - **Files modified**: `feature_extraction/models.py` - **Tests**: 4 methods in `TestCellGridProvenance` (test_integration.py) ### Task 1.1.2: Restructure `Pipeline.extract()` to score all grids - **Status**: complete - `extract()` runs cell extraction per-structure-method + consensus - `result.cell_grids` contains grids from multiple structure methods - Composite keys `"structure:cell"` in `grid_scores` - **Files modified**: `feature_extraction/pipeline.py`, `feature_extraction/scoring.py` - **Tests**: 5 methods in `TestExtractAllGrids` (test_integration.py) ### Task 1.1.3: Remove `extract_with_all_boundaries()` and update stress test - **Status**: complete - `extract_with_all_boundaries()` deleted from `Pipeline` - Stress test uses `pipeline.extract()` with `grid.structure_method` grouping - **Files modified**: `feature_extraction/pipeline.py`, `tests/stress_test_real_library.py` - **Tests**: 1 method in `TestExtractWithAllBoundariesRemoved` (test_integration.py) ### Task 1.2.1: Replace column-count grouping tolerance - **Status**: complete (superseded by Phase 0 + Phase 2 rewrite) - **Files modified**: `feature_extraction/combination.py` ### Task 1.2.2: Replace expansion/tolerance/acceptance thresholds - **Status**: complete (acceptance logic superseded by Phase 2 median method count) - `_compute_spatial_precision()` priority chain: ruled line thickness -> word gap -> word height (no multiplier constants) - `_compute_expansion_threshold()` and `_compute_tolerance()` deleted - `CombinationTrace` has `spatial_precision` field - **Files modified**: `feature_extraction/combination.py`, `feature_extraction/models.py` - **Tests**: 4 methods in `TestSpatialPrecision`, 2 methods in `TestAcceptanceThreshold` (test_combination.py) ### Task 1.3.1: Combination trace tests - **Status**: complete - 6 test methods in `TestCombinationTrace` (test_combination.py) - Covers: empty input, single input, multi input, trace=True, trace=False - **Files modified**: `tests/test_combination.py` ### Task 1.3.2: Pipeline integration tests with real PDF - **Status**: complete - 11 test methods across `TestCellGridProvenance`, `TestExtractAllGrids`, `TestExtractWithAllBoundariesRemoved` - Uses `noname1.pdf` fixture - **Files modified**: `tests/test_feature_extraction/test_integration.py` ## Phase 1 Review Fixes - Replaced `pytest.skip` with `pytest.fail` in `_pipeline_result` fixture (V1) - Switched fixture from `noname2.pdf` to spec-mandated `noname1.pdf` (V2) - Changed acceptance threshold (later superseded by Phase 2 median method count) (V3) - Added `TestSpatialPrecision` (4 tests), `TestColumnGrouping` (5 tests) (G1, G4) - Added `test_percentile_based_acceptance`, `test_single_cluster_always_accepted` (G2, G3) - Strengthened `test_winning_grid_has_provenance` assertion (WT2) - Updated `progress.md` to reflect current spec tasks (G5) ## Phase 0: Dead Code Removal ### Task 0.1.1: Remove column-count grouping and fix vacuous accuracy default - **Status**: complete - **Agent**: implementer - **Files modified**: `src/zotero_chunk_rag/feature_extraction/combination.py`, `src/zotero_chunk_rag/feature_extraction/ground_truth.py`, `tests/test_combination.py` - **Tests**: 16/24 passing (8 fail with expected NotImplementedError per spec -- restored in Phase 2) - `_select_best_column_group()` and `_group_confidence()` deleted from combination.py - Multi-hypothesis path in `combine_hypotheses()` raises `NotImplementedError("awaiting per-divider voting rewrite")` - Empty-list and single-hypothesis passthrough unchanged and working - `ground_truth.py` cell_accuracy_pct default changed from `else 100.0` to `else 0.0` - `TestColumnGrouping` class and `_make_hyp` helper deleted from test_combination.py - `_select_best_column_group` removed from test imports --- ## Wave 0.1 Summary - **Status**: complete - **Tasks completed**: 1/1 - **Rounds**: 1 ## Phase 1 (new): Accuracy Metric Rewrite ### Task 1.1.1: Fuzzy cell scoring function - **Status**: complete - **Agent**: implementer - **Files modified**: `src/zotero_chunk_rag/feature_extraction/ground_truth.py`, `tests/test_feature_extraction/test_ground_truth.py` - **Tests**: 12/12 passing - `_is_numeric_cell()` and `_fuzzy_cell_score()` added after `_normalize_cell()` - Exact match -> 1.0, both empty -> 1.0, one empty -> 0.0, numeric mismatch -> 0.0, text -> LCS ratio ### Task 1.1.2: Compute fuzzy accuracy (precision, recall, F1) - **Status**: complete - **Agent**: implementer - **Files modified**: `src/zotero_chunk_rag/feature_extraction/ground_truth.py`, `tests/test_feature_extraction/test_ground_truth.py` - **Tests**: 7/7 passing - `_compute_fuzzy_accuracy()` added after `_fuzzy_cell_score()` - Bag-of-cells matching with best-first greedy assignment - Returns (precision, recall, F1) as fractions ### Task 1.1.3: Add fuzzy fields to ComparisonResult - **Status**: complete - **Agent**: implementer - **Files modified**: `src/zotero_chunk_rag/feature_extraction/ground_truth.py`, `tests/test_feature_extraction/test_ground_truth.py` - **Tests**: 3/3 passing - `fuzzy_accuracy_pct`, `fuzzy_precision_pct`, `fuzzy_recall_pct` fields added to ComparisonResult - `compare_extraction()` computes and populates fuzzy fields - Fuzzy metric produces meaningful scores even when structural alignment fails - Also fixed 2 pre-existing test failures (test_split_row_coverage_penalty, test_artifact_table) caused by Phase 0 cell_accuracy_pct default change --- ## Wave 1.1 Summary - **Status**: complete - **Tasks completed**: 3/3 - **Rounds**: 1 ## Phase 2: Combination Engine Rewrite ### Task 2.1.1: Rewrite combine_hypotheses() with per-divider voting - **Status**: complete - **Agent**: implementer - **Files modified**: `src/zotero_chunk_rag/feature_extraction/combination.py`, `tests/test_combination.py` - **Tests**: 37/37 passing - `NotImplementedError` stub replaced with per-divider voting algorithm - `_scale_point()` helper added for confidence multiplier scaling - `confidence_multipliers` parameter added to `combine_hypotheses()` signature - `_combine_axis()` acceptance changed from Q1-confidence to median-method-count threshold - Ruled line override: clusters with `provenance == "ruled_lines"` always accepted - Consensus boundary confidence is mean of constituent points (not sum) - 8 tests broken by Phase 0 now pass; 6 new test classes added (TestMethodCountRejection, TestConfidenceMultipliers, TestRuledLineOverride, TestScalePoint) - `test_percentile_based_acceptance` renamed to `test_equal_method_agreement_all_pass`, assertion updated to 5 boundaries ### Task 2.1.2: Pass confidence multipliers from Pipeline - **Status**: complete - **Agent**: implementer - **Files modified**: `src/zotero_chunk_rag/feature_extraction/pipeline.py` - **Tests**: 26/26 integration tests passing - `Pipeline.extract()` now passes `self._config.confidence_multipliers` to `combine_hypotheses()` --- ## Wave 2.1 Summary - **Status**: complete - **Tasks completed**: 2/2 - **Rounds**: 1 ### Task 2.2.1: Update trace models for per-divider voting - **Status**: complete - **Agent**: implementer - **Files modified**: `src/zotero_chunk_rag/feature_extraction/models.py`, `src/zotero_chunk_rag/feature_extraction/combination.py`, `tests/test_combination.py` - **Tests**: 37/37 passing - `ClusterRecord`: added `method_names: list[str]` and `acceptance_reason: str` fields - `AxisTrace`: added `median_method_count: float` field - `_combine_axis()` populates new fields when `collect_trace=True` - `_make_passthrough_trace()` sets appropriate values for new fields - 3 new trace tests: `test_trace_cluster_has_method_names`, `test_trace_has_median_method_count`, `test_acceptance_reason_values` --- ## Wave 2.2 Summary - **Status**: complete - **Tasks completed**: 1/1 - **Rounds**: 1 ### Task 1.2.1: Update stress test to use fuzzy metrics - **Status**: complete - **Agent**: implementer - **Files modified**: `src/zotero_chunk_rag/feature_extraction/debug_db.py`, `tests/stress_test_real_library.py`, `tests/test_feature_extraction/test_debug_db.py` - **Tests**: 6/6 passing - Added `fuzzy_accuracy_pct`, `fuzzy_precision_pct`, `fuzzy_recall_pct` columns to `ground_truth_diffs` schema - Updated `write_ground_truth_diff()` to write fuzzy fields from ComparisonResult - Changed `quality_score` in `method_results` to use `cmp.fuzzy_accuracy_pct` instead of `cmp.cell_accuracy_pct` - Changed `final_score` lookup in `pipeline_runs` to use `fuzzy_accuracy_pct` from `ground_truth_diffs` - Updated GT comparison report section to show fuzzy accuracy, precision, and recall - Updated pipeline depth report combination-value analysis to use `fuzzy_accuracy_pct` - Updated variant comparison to use `cmp.fuzzy_accuracy_pct` - Added `TestExtendedSchema::test_ground_truth_diffs_has_fuzzy_columns` test --- ## Wave 1.2 Summary - **Status**: complete - **Tasks completed**: 1/1 - **Rounds**: 1 ## Phase 3: LLM Prompt Method ### Task 3.1.1: Generate table PNGs and prompts - **Status**: complete - **Agent**: implementer - **Files created**: `tests/llm_structure/__init__.py`, `tests/llm_structure/generate_prompts.py` - **Tests**: 2/2 passing (test_script_importable, test_manifest_schema) - Script generates cropped PNG + prompt for each GT table - Resolves PDF paths via --corpus-json or --from-stress-test (Zotero library) - find_table_bbox uses find_tables(), caption search, and word-bbox fallback - Manifest JSON lists all generated tables with table_id, pdf_path, page_num, bbox ### Task 3.1.2: Prompt template - **Status**: complete - **Agent**: implementer - **Files created**: `tests/llm_structure/prompt_template.md` - **Tests**: 2/2 passing (test_template_exists, test_template_contains_json_example) - Coordinate system: fractions of table bbox (0.0-1.0) - Requests internal dividers only (not outer edges) - JSON output format with position + confidence fields - Includes worked 3-column, 4-row example --- ## Wave 3.1 Summary - **Status**: complete - **Tasks completed**: 2/2 - **Rounds**: 1 ### Task 3.2.1: Parse LLM responses into BoundaryHypothesis objects - **Status**: complete - **Agent**: implementer - **Files created**: `tests/llm_structure/parse_responses.py` - **Tests**: 4/4 passing (test_valid_json_parsed, test_markdown_fenced_json, test_invalid_json_returns_none, test_confidence_mapping) - Handles markdown code fences, extra text around JSON - Validates structure: columns/rows arrays with position and confidence - Converts fractional positions to absolute PDF coordinates - Maps confidence labels: high=0.9, medium=0.6, low=0.3 - Invalid input returns None with error message ### Task 3.2.2: Inject LLM boundaries and evaluate against GT - **Status**: complete - **Agent**: implementer - **Files created**: `tests/llm_structure/inject_and_evaluate.py` - **Tests**: 1/1 passing (test_script_importable) - Runs all 3 cell extraction methods (rawdict, word_assignment, pdfminer) against LLM boundaries - Computes fuzzy_accuracy_pct via compare_extraction() - Writes method_results rows to debug DB (llm_sonnet+rawdict, etc.) - Prints summary table with per-table per-method accuracy --- ## Wave 3.2 Summary - **Status**: complete - **Tasks completed**: 2/2 - **Rounds**: 1 ### Task 3.3.1: Comparison report generator - **Status**: complete - **Agent**: implementer - **Files created**: `tests/llm_structure/compare_models.py` - **Tests**: 1/1 passing (test_script_importable) - Reads method_results from debug DB for LLM and pipeline entries - Generates comparison_report.md with: per-table accuracy, LLM wins/losses, win rates, summary statistics - Compares Sonnet, Haiku, pipeline best, and consensus accuracy --- ## Wave 3.3 Summary - **Status**: complete - **Tasks completed**: 1/1 - **Rounds**: 1 ## Phase 4: Validation ### Task 4.1.1: Run stress test and verify targets - **Status**: complete (targets NOT met) - **Agent**: orchestrator (execution task) - **Files modified**: None - **Tests**: Stress test ran successfully (272 tests, 247 passed, 25 failed) - **Acceptance criteria results**: 1. **0 MAJOR failures**: FAILED -- 4 MAJOR failures - missing-figures (active-inference-tutorial): 1 figure caption with no extracted image - unmatched-captions (active-inference-tutorial): appendix captions A.1 (fig), A.1/A.2/A.3 (tables) unmatched - table-dimensions-sanity (roland-emg-filter): 1 degenerate 1x1 table - table-dimensions-sanity (fortune-impedance): 1 degenerate 1x1 table 2. **Overall fuzzy GT accuracy >= 95%**: FAILED -- 8.4% (39 tables compared) 3. **Consensus delta >= -5%**: FAILED -- avg delta is -53.6% (best single method 62.0% vs consensus 8.4%) - 37 of 39 tables show consensus worse than best single method - Worst deltas: -97.6% (DPYRZTFI_table_1), -93.5% (VP3NJ74M_table_4), -91.2% (Z9X4JVZ5_table_1) 4. **DEFAULT config >= FAST and MINIMAL**: FAILED - DEFAULT: 7.2%, FAST: 7.7%, RULED: 7.2%, MINIMAL: 55.7% - MINIMAL (PyMuPDFLines only + rawdict) dramatically outperforms all multi-method configs - **Root cause diagnosis**: The consensus combination engine is creating phantom boundaries that destroy table structure. When multiple methods contribute boundary hypotheses, the voting/merging produces boundaries that don't correspond to actual table dividers. MINIMAL config avoids this entirely by using only PyMuPDFLines boundaries without combination, achieving 55.7% accuracy. The per-divider voting rewrite (Phase 2) has not fixed the fundamental combination problem -- it may have made it worse by changing acceptance thresholds. The scoring/selection system in Pipeline.extract() is also selecting consensus grids over per-method grids that have higher accuracy. ### Task 4.1.2: Generate tuned weights and verify - **Status**: complete (no regression from tuned weights, but targets still not met) - **Agent**: orchestrator (execution task) - **Files modified**: `tests/pipeline_weights.json` (generated by tune_weights.py) - **Weight tuning results**: - pymupdf_lines: 1.000 (dominant winner, 16% win rate) - pymupdf_text: 0.526 (9% win rate) - ruled_lines: 0.283 (5% win rate) - single_point_hotspot: 0.162 (3% win rate) - camelot_hybrid: 0.158 (3% win rate) - global_cliff: 0.147 (2% win rate) - header_anchor: 0.055 (1% win rate) - All others (pdfplumber_text, pymupdf_lines_strict, camelot_lattice, consensus, per_row_cliff, gap_span_hotspot): 0.100 (floor, 0% win rate) - **Re-run stress test results**: Identical to pre-tuning (8.4% accuracy, 4 MAJOR failures, same variant comparison). Tuned weights did not regress but also did not improve -- the problem is structural (combination engine), not weight-related. - **Key finding**: The consensus method itself has 0% win rate -- it never produces the best result for any table. Yet Pipeline.extract() appears to be selecting consensus-derived grids over superior per-method grids. This is the core bug. --- ## Wave 4.1 Summary - **Status**: complete (targets NOT met -- requires iteration on combination engine) - **Tasks completed**: 2/2 - **Rounds**: 1 - **Critical findings**: - Consensus combination destroys accuracy (8.4% vs 62.0% best single method) - MINIMAL config (no combination) achieves 55.7% -- 7.7x better than DEFAULT - The scoring system is selecting consensus grids over better per-method grids - Weight tuning has no effect because the problem is structural, not weight-related - To meet targets, Pipeline.extract() must prefer per-method grids when they score better than consensus ## Phase 5: Legacy Reference Review ### Task 5.1.1: Search and remove all stale references - **Status**: complete - **Agent**: implementer - **Files modified**: - `src/zotero_chunk_rag/feature_extraction/ground_truth.py` -- removed dead `field` import, replaced historical-provenance comment - `tests/test_feature_extraction/test_ground_truth.py` -- replaced historical-provenance docstring, tightened 3 weak assertions (test_headers_included, test_compare_extraction_populates_fuzzy, test_fuzzy_not_vacuous_when_structural_fails) - `tests/test_combination.py` -- tightened 3 weak assertions (test_multi_hypothesis_trace_structure, test_trace_cluster_has_method_names, test_trace_has_median_method_count) - `tests/diagnose_combination.py` -- renamed "Column count grouping" label to "Methods by column count" - `CLAUDE.md` -- added combination algorithm section, updated combination tracing to describe median method count acceptance and ClusterRecord fields - `MEMORY.md` -- added combination algorithm design decisions, updated current state - `spec/progress.md` -- updated Phase 1 entries that referenced superseded algorithms - **Tests**: 397/397 passing - **Verification**: - `_select_best_column_group`: 0 matches in src/ and tests/ - `_group_confidence`: 0 matches in src/ and tests/ - `column.count.group` (case-insensitive): 0 matches in src/ and tests/ - `normalize_method_confidence`: 0 matches in src/ and tests/ - `else 100.0` in ground_truth.py: only in `structural_coverage_pct` (correct default) - No dead imports in combination.py or ground_truth.py - CLAUDE.md accurately describes per-divider voting with median method count acceptance - All existing doc tests (test_claude_md_no_table_extraction_refs, test_claude_md_no_figure_extraction_ref) pass --- ## Wave 5.1 Summary - **Status**: complete - **Tasks completed**: 1/1 - **Rounds**: 1

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ccam80/zotero-chunk-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

progress.md•17.7 KiB