Zotero Chunk RAG

plan.md•14.3 KiB

# Pipeline Accuracy & Combination Fix — Implementation Plan ## Context Deep exploration of the debug database and source code revealed three critical issues with the extraction pipeline: 1. **Vacuous accuracy metric**: `compare_extraction()` reports 100% on 31/41 tables by comparing 0 cells. When column alignment fails (headers don't match), `comparable_cells = 0` and accuracy defaults to `100.0`. The metric is useless for grid selection or evaluation. 2. **Destructive consensus combination**: The boundary combination engine groups methods by column count, then merges all boundary points in the winning group. When methods in the same group disagree on exact positions, this creates phantom columns — producing 3-8x the correct column count. 10 tables drop from 100% to 0% accuracy because of consensus. The pipeline architecture already scores per-method grids alongside consensus (Tasks 1.1.1-1.1.2 from the previous plan are complete), but the scoring can't distinguish grids effectively because the accuracy metric is vacuous. 3. **No LLM structure method**: Phase 4 of the original plan built a Haiku vision QA layer (verification), but no LLM-based method was ever implemented as a competing structure detection approach. Vision models excel at table structure recognition — this capability is missing from the pipeline. ### Current state | Metric | Value | |--------|-------| | Pipeline GT accuracy (vacuous metric) | 76.2% | | Best single-method GT accuracy (vacuous metric) | 100.0% | | Tables degraded by consensus | 10 (8 at 0%, 2 at 8-17%) | | MAJOR stress test failures | 18 | | Tables with vacuous 100% (0 comparable cells) | 31/41 | The pipeline already scores per-method grids alongside consensus grids (`Pipeline.extract()` runs cell extraction for each structure method individually, then also for consensus). The problem is that: (a) the accuracy metric can't distinguish real quality, so scoring falls back to heuristics like fill_rate which don't always pick the right grid; and (b) consensus boundaries are destructive even when they're just one candidate among many. ### Completed infrastructure From the previous implementation pass: - `CellGrid.structure_method` field with provenance tracking (complete) - `Pipeline.extract()` scores all grids: per-method + consensus (complete) - `combine_hypotheses()` trace mode with `CombinationTrace` (complete) - `tests/diagnose_combination.py` + `_combination_diagnosis.md` (diagnostic analysis) - Agent QA workspace with 42 table PNGs, extraction JSONs, manifest - `tune_weights.py` infrastructure (data-driven weight computation, never run) - `confidence_multipliers` in PipelineConfig (loaded from JSON, but combination.py never reads them — dead infrastructure) ## Goals - **Honest accuracy metric**: no vacuous 100% scores; all GT comparisons produce meaningful accuracy reflecting actual cell-level correctness - **Pipeline GT accuracy >= 95%** (from 76.2% on the vacuous metric) - **Consensus never degrades**: pipeline accuracy >= best single-method accuracy on every GT table (delta >= -5%) - **DEFAULT config matches or beats FAST and MINIMAL** - **LLM vision boundaries**: prompt-based structure detection using Sonnet and Haiku, evaluated against other methods — if LLM beats the pipeline, it replaces it - **Data-driven confidence multipliers** generated from stress test results and actually used by the combination engine ## Non-Goals - API-integrated LLM method (no runtime calls from the pipeline; prompt-based only) - Figure-data-table overlap detection fix - Adding new non-LLM structure or cell extraction methods - Changing the post-processor chain - Production deployment of agent QA ## Verification - **Phase 0**: `combination.py` no longer contains `_select_best_column_group` or `_group_confidence`; ground_truth.py no longer contains `else 100.0` default; `combine_hypotheses()` raises `NotImplementedError` for multi-hypothesis case - **Phase 1**: All GT tables have meaningful `fuzzy_accuracy_pct`; both precision and recall reported; no vacuous 100% scores - **Phase 2**: `combine_hypotheses()` works with per-divider voting; combination delta >= -5% on all tables; confidence multipliers actually used - **Phase 3**: LLM boundaries stored in debug DB; Sonnet vs Haiku comparison report generated with per-table fuzzy accuracy - **Phase 4**: 0 MAJOR stress test failures; overall fuzzy accuracy meets target; `pipeline_weights.json` generated from data - **Phase 5**: grep for removed function names returns 0 matches repo-wide ## Dependency Graph ``` Phase 0 (Dead Code Removal) ─── runs first, alone │ ├──→ Phase 1 Wave 1.1 (Fuzzy Metric code) ─── parallel after 0 ├──→ Phase 2 (Combination Rewrite) ─── parallel after 0 │ │ └──→ Phase 1 Wave 1.2 (Stress Test Integration) ─── after Phase 2 │ ├──→ Phase 3 (LLM Prompt Method) ─── after 1 + 2 (independent of 4) ├──→ Phase 4 (Validation) ─── after 1 + 2 │ Phase 5 (Legacy Reference Review) ─── after 4 ``` Phase 0 stubs `combine_hypotheses()` for multi-hypothesis case, breaking the pipeline. Phase 1 Wave 1.1 (fuzzy metric functions) can run in parallel with Phase 2 since it's pure computation in `ground_truth.py`. Phase 1 Wave 1.2 (stress test integration) needs a working pipeline, so it waits for Phase 2 to restore `combine_hypotheses()`. Phase 3 needs both the fuzzy metric (Phase 1) and working combination (Phase 2) to evaluate LLM boundaries meaningfully, but is independent of Phase 4. Phase 4 (validation) runs the stress test gate after Phases 1+2; it does not need Phase 3's LLM results. Phase 5 is the final audit. ## Spec-Level Design Decisions Decisions made during spec authoring that override or refine the original plan: - **Numeric cell scoring**: `_fuzzy_cell_score()` returns 0.0 for ANY mismatch where either cell is numeric. No partial credit for numeric errors. Text mismatches get LCS partial credit. Rationale: "0.047" vs ".047" is extraction failure, not a minor variation. - **Best-first greedy assignment**: The fuzzy accuracy matching sorts all `(ext_idx, gt_idx, score)` triples by score descending before greedy assignment. This approximates optimal matching without Hungarian algorithm complexity. - **No reevaluate_accuracy.py script**: Removed Task 1.2.2. The stress test itself computes the new metrics directly. Avoids unnecessary tooling. - **No normalize_method_confidence toggle**: Removed Task 2.2.2. Normalization adds complexity without demonstrated need. If chatty-method domination becomes a problem, it can be added later. - **Confidence multipliers via parameter**: `combine_hypotheses()` gets a new `confidence_multipliers: dict[str, float] | None = None` parameter. Pipeline passes `self._config.confidence_multipliers` at the call site. Clean separation of concerns — combination doesn't know about PipelineConfig. - **Ruled line override**: `provenance == "ruled_lines"` only. No other methods qualify for unconditional acceptance. - **Consensus boundary confidence = mean**: Each accepted consensus boundary's confidence is the mean (not sum) of its constituent points' scaled confidences. Prevents methods with many boundary points from dominating through volume. --- ## Phase 0: Dead Code Removal **Depends on**: (none — runs first) Remove code that will be replaced by this plan. This breaks the pipeline (combination.py raises NotImplementedError for multi-hypothesis case) — Phase 2 builds the replacement. Tests that call combine_hypotheses() with 2+ hypotheses will fail with NotImplementedError until Phase 2 restores them. ### Wave 0.1: Remove obsolete combination and metric code | Task | Description | Complexity | Key Files | |------|-------------|------------|-----------| | 0.1.1 | Remove `_select_best_column_group()` and `_group_confidence()` from `combination.py`. Stub the multi-hypothesis path in `combine_hypotheses()` with `NotImplementedError("awaiting per-divider voting rewrite")`. Change the vacuous `else 100.0` default in `ground_truth.py` to `else 0.0`. Delete `TestColumnGrouping` tests from `test_combination.py`. 8 tests that call combine_hypotheses() with 2+ hypotheses will fail with NotImplementedError (expected — Phase 2 restores them). | M | `combination.py`, `ground_truth.py`, `tests/test_combination.py` | --- ## Phase 1: Accuracy Metric Rewrite **Depends on**: Phase 0 **Wave 1.1 parallel with**: Phase 2 **Wave 1.2 depends on**: Phase 2 The fix adds an alignment-free symmetric fuzzy accuracy metric using bag-of-cells matching with best-first greedy assignment. Numeric cells are scored strictly (exact match or 0.0). Text cells get LCS partial credit. ### Wave 1.1: Symmetric fuzzy cell accuracy | Task | Description | Complexity | Key Files | |------|-------------|------------|-----------| | 1.1.1 | Implement `_fuzzy_cell_score(a, b) -> float` in `ground_truth.py`. After normalization: exact match -> 1.0; both empty -> 1.0; one empty/one not -> 0.0; either cell numeric and not matching -> 0.0; text -> LCS ratio. Also add `_is_numeric_cell()` helper. | M | `ground_truth.py` | | 1.1.2 | Implement `_compute_fuzzy_accuracy(gt_headers, gt_rows, ext_headers, ext_rows) -> (precision, recall, f1)`. Headers included in cell bags. Best-first greedy assignment (sort all score triples descending, assign greedily). | L | `ground_truth.py` | | 1.1.3 | Add `fuzzy_accuracy_pct`, `fuzzy_precision_pct`, `fuzzy_recall_pct` to `ComparisonResult`. Compute in `compare_extraction()`. Keep existing structural analysis unchanged. | M | `ground_truth.py` | ### Wave 1.2: Stress test integration | Task | Description | Complexity | Key Files | |------|-------------|------------|-----------| | 1.2.1 | Update stress test to use `fuzzy_accuracy_pct` as primary `quality_score`. Add fuzzy columns to `ground_truth_diffs` schema. Update GT comparison report and pipeline depth report. | M | `tests/stress_test_real_library.py`, `debug_db.py` | --- ## Phase 2: Combination Engine Rewrite **Depends on**: Phase 0 **Parallel with**: Phase 1 Wave 1.1 **Must complete before**: Phase 1 Wave 1.2 Replace column-count-based grouping with per-divider confidence voting. All boundary points from all methods are pooled and voted on independently. Acceptance based on median method_count. Ruled line override for physical boundaries. ### Design **Per-divider voting algorithm:** 1. **Collect** all boundary points from all methods (no grouping by column count). 2. **Scale** each point's confidence by its method's multiplier from the `confidence_multipliers` parameter (passed by Pipeline from PipelineConfig). 3. **Cluster by position**: expand narrow ranges, merge overlapping (existing logic). 4. **Score each cluster**: count distinct methods (method_count), mean weighted confidence. 5. **Accept/reject**: accept if `method_count >= median(all method_counts)`. Ruled line override: accept if any point has `provenance == "ruled_lines"`. ### Wave 2.1: Core per-divider voting | Task | Description | Complexity | Key Files | |------|-------------|------------|-----------| | 2.1.1 | Rewrite `combine_hypotheses()` with per-divider voting. Add `confidence_multipliers: dict[str, float] | None = None` parameter. Replace NotImplementedError stub. New acceptance: median method_count threshold + ruled line override. Add `_scale_point()` helper. | L | `combination.py` | | 2.1.2 | Pass `self._config.confidence_multipliers` from `Pipeline.extract()` to `combine_hypotheses()`. | S | `pipeline.py` | ### Wave 2.2: Trace infrastructure | Task | Description | Complexity | Key Files | |------|-------------|------------|-----------| | 2.2.1 | Add `method_names: list[str]` and `acceptance_reason: str` to `ClusterRecord`. Add `median_method_count: float` to `AxisTrace`. Update combination trace output. | M | `models.py`, `combination.py` | --- ## Phase 3: LLM Prompt Method **Depends on**: Phase 1 + Phase 2 Offline evaluation workflow: generate prompts, user runs through vision models, parse responses, inject into debug DB, compare against pipeline methods. ### Wave 3.1: Prompt generation | Task | Description | Complexity | Key Files | |------|-------------|------------|-----------| | 3.1.1 | Create `tests/llm_structure/generate_prompts.py`: render GT table PNGs, generate prompts, write manifest. | M | `tests/llm_structure/generate_prompts.py` | | 3.1.2 | Write prompt template requesting JSON with fractional divider positions and confidence labels. | S | `tests/llm_structure/prompt_template.md` | ### Wave 3.2: Response parsing + injection | Task | Description | Complexity | Key Files | |------|-------------|------------|-----------| | 3.2.1 | Create `tests/llm_structure/parse_responses.py`: parse LLM JSON -> BoundaryHypothesis. Handle markdown fences, validate structure. | M | `tests/llm_structure/parse_responses.py` | | 3.2.2 | Create `tests/llm_structure/inject_and_evaluate.py`: run cell extraction against LLM boundaries, compute fuzzy accuracy, write to debug DB. | M | `tests/llm_structure/inject_and_evaluate.py` | ### Wave 3.3: Model comparison | Task | Description | Complexity | Key Files | |------|-------------|------------|-----------| | 3.3.1 | Create `tests/llm_structure/compare_models.py`: per-table Sonnet vs Haiku vs pipeline accuracy report. | M | `tests/llm_structure/compare_models.py` | --- ## Phase 4: Validation **Depends on**: Phase 1 + Phase 2 + Phase 3 ### Wave 4.1: Full stress test | Task | Description | Complexity | Key Files | |------|-------------|------------|-----------| | 4.1.1 | Run stress test. Verify: 0 MAJOR failures, fuzzy GT accuracy >= 95%, consensus delta >= -5%, DEFAULT >= FAST/MINIMAL. | M | `stress_test_real_library.py` | | 4.1.2 | Run `tune_weights.py` to generate `pipeline_weights.json` from fuzzy win rates. Re-run stress test to verify no regression. | S | `tune_weights.py`, `pipeline_weights.json` | --- ## Phase 5: Legacy Reference Review **Depends on**: Phase 4 ### Wave 5.1: Full legacy audit | Task | Description | Complexity | Key Files | |------|-------------|------------|-----------| | 5.1.1 | Search and remove all stale references: `_select_best_column_group`, `_group_confidence`, `else 100.0`, column count grouping descriptions, `normalize_method_confidence`, old spec files. Update CLAUDE.md and MEMORY.md. | M | (repo-wide) |

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ccam80/zotero-chunk-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

plan.md•14.3 KiB