Zotero Chunk RAG

phase-3-llm-prompt-method.md•9.46 KiB

# Phase 3: LLM Prompt Method ## Overview Create a prompt-based workflow for vision-model table structure detection. The user or an agent runs cropped table PNGs through a vision model (Sonnet or Haiku) with a structured prompt, and the results are parsed into BoundaryHypothesis objects, run through cell extraction, scored against ground truth, and injected into the debug DB for evaluation alongside other methods. This is NOT an API-integrated pipeline method. It is an offline evaluation workflow that answers the question: "Can a vision model beat the multi-method pipeline at table structure detection?" **Depends on**: Phase 1 (fuzzy accuracy metric) + Phase 2 (working combination engine) **Entry state**: No LLM-based structure detection exists. **Exit state**: Evaluation infrastructure in `tests/llm_structure/` with per-table accuracy data for Sonnet and Haiku, comparison report against pipeline methods. --- ## Wave 3.1: Prompt generation ### Task 3.1.1: Generate table PNGs and prompts - **Description**: Create `tests/llm_structure/generate_prompts.py`. For each GT table in the ground truth database: 1. Open the PDF, navigate to the table's page 2. Render the table region as a cropped PNG (reuse pymupdf rendering: `page.get_pixmap(clip=bbox, dpi=200)`) 3. Save to `tests/llm_structure/tables/<table_id>/table.png` 4. Generate the prompt by reading the template from `tests/llm_structure/prompt_template.md` and appending any table-specific context (bbox dimensions, page number) 5. Save prompt to `tests/llm_structure/tables/<table_id>/prompt.md` 6. Generate a manifest JSON at `tests/llm_structure/manifest.json` listing all table IDs, their PDF paths, page numbers, and bbox coordinates The script needs to resolve PDF paths. GT entries have `paper_key` — the stress test corpus maps paper_key to PDF path via the Zotero library. The script should accept a `--corpus-json` argument pointing to a JSON file that maps paper_key to PDF path, OR reuse the CORPUS list from `stress_test_real_library.py` directly. - **Files to create**: - `tests/llm_structure/generate_prompts.py` — main script - `tests/llm_structure/__init__.py` — empty package marker - **Tests**: - `tests/test_feature_extraction/test_llm_structure.py::TestPromptGeneration::test_script_importable` — assert `generate_prompts` module imports without error - `tests/test_feature_extraction/test_llm_structure.py::TestPromptGeneration::test_manifest_schema` — run generation on 1 table (mock or fixture), verify manifest JSON has expected keys: `table_id`, `pdf_path`, `page_num`, `bbox` - **Acceptance criteria**: - Script generates PNG + prompt for each GT table - PNGs are readable images showing the table region - Manifest JSON is valid and lists all generated tables - Prompt references the correct coordinate system (fractions of bbox) ### Task 3.1.2: Prompt template - **Description**: Write the prompt template that instructs a vision model to identify table structure. The prompt should: - Explain the coordinate system: fractions of the table bbox (0.0 = left/top edge, 1.0 = right/bottom edge) - Request ONLY internal dividers (not the outer bbox edges) - Request JSON output: `{"columns": [{"position": float, "confidence": "high"|"medium"|"low"}, ...], "rows": [...]}` - Include a worked example showing a 3-column, 4-row table's expected output - Note that position values should be where the divider LINE would be drawn (between columns/rows), not column/row centers - **Files to create**: - `tests/llm_structure/prompt_template.md` — the prompt template - **Tests**: - `tests/test_feature_extraction/test_llm_structure.py::TestPromptTemplate::test_template_exists` — assert file exists and is non-empty - `tests/test_feature_extraction/test_llm_structure.py::TestPromptTemplate::test_template_contains_json_example` — assert template contains `"columns"` and `"rows"` and `"position"` - **Acceptance criteria**: - Template clearly explains the coordinate system - Example JSON output is valid JSON - Instructions are unambiguous about internal-only dividers --- ## Wave 3.2: Response parsing + injection ### Task 3.2.1: Parse LLM responses into BoundaryHypothesis objects - **Description**: Create `tests/llm_structure/parse_responses.py`. Reads LLM response JSON files placed by the user at `tests/llm_structure/tables/<table_id>/response_sonnet.json` and/or `response_haiku.json`. For each response: 1. Parse the JSON (handle common model quirks: markdown code fences wrapping JSON, extra text before/after the JSON block) 2. Validate structure: must have `columns` and `rows` arrays, each element must have `position` (float) and `confidence` (string) 3. Convert fractional positions to absolute PDF coordinates using the table's bbox from the manifest: `abs_pos = bbox_min + fraction * (bbox_max - bbox_min)` 4. Map confidence labels to scores: `high=0.9`, `medium=0.6`, `low=0.3` 5. Create a `BoundaryHypothesis` with provenance `"llm_sonnet"` or `"llm_haiku"` 6. Report parsing errors (invalid JSON, missing fields, positions outside 0-1) to stderr without crashing Provide a function `parse_response(table_id, model_name, manifest, response_path) -> BoundaryHypothesis | None` for programmatic use, plus a CLI that processes all tables. - **Files to create**: - `tests/llm_structure/parse_responses.py` - **Tests**: - `tests/test_feature_extraction/test_llm_structure.py::TestResponseParsing::test_valid_json_parsed` — create a synthetic response JSON with known positions. Assert returned BoundaryHypothesis has correct col/row boundary count and positions within tolerance of expected absolute coordinates. - `tests/test_feature_extraction/test_llm_structure.py::TestResponseParsing::test_markdown_fenced_json` — response wrapped in ```json ... ```. Assert still parsed correctly. - `tests/test_feature_extraction/test_llm_structure.py::TestResponseParsing::test_invalid_json_returns_none` — malformed JSON returns None without crashing. - `tests/test_feature_extraction/test_llm_structure.py::TestResponseParsing::test_confidence_mapping` — "high" -> 0.9, "medium" -> 0.6, "low" -> 0.3 on the resulting BoundaryPoints. - **Acceptance criteria**: - Valid response JSON produces a correct BoundaryHypothesis - Markdown-fenced JSON is handled - Invalid input returns None with error message (no crash) - Confidence labels correctly mapped to scores - Positions correctly converted from fractions to absolute PDF coordinates ### Task 3.2.2: Inject LLM boundaries and evaluate against GT - **Description**: Create `tests/llm_structure/inject_and_evaluate.py`. For each parsed BoundaryHypothesis from LLM responses: 1. Open the PDF, build a `TableContext` for the table region 2. Resolve col/row boundary midpoints from the BoundaryHypothesis 3. Run each cell extraction method (rawdict, words, pdfminer) against the LLM boundaries — call each method's `extract(ctx, col_positions, row_positions)` directly 4. For each resulting CellGrid, compute `fuzzy_accuracy_pct` against ground truth via `compare_extraction()` 5. Write `method_results` rows to the debug DB with method_name `"llm_sonnet+rawdict"`, `"llm_sonnet+word_assignment"`, etc. 6. Print a summary table: table_id, model, cell_method, fuzzy_accuracy The script reads the manifest and response files, opens the debug DB (`_stress_test_debug.db`), and appends rows. It should be safe to run multiple times (uses INSERT, not REPLACE — accumulates results). - **Files to create**: - `tests/llm_structure/inject_and_evaluate.py` - **Tests**: - `tests/test_feature_extraction/test_llm_structure.py::TestInjection::test_script_importable` — assert module imports without error - Full integration is tested by actually running the script after obtaining model responses — acceptance verified via the debug DB contents. - **Acceptance criteria**: - Cell extraction runs successfully against LLM boundaries - `method_results` rows written with correct method names (`llm_<model>+<cell_method>`) - Fuzzy accuracy scores computed and stored - Script handles missing response files gracefully (skips, doesn't crash) --- ## Wave 3.3: Model comparison ### Task 3.3.1: Comparison report generator - **Description**: Create `tests/llm_structure/compare_models.py`. Reads `method_results` from the debug DB for all `llm_sonnet` and `llm_haiku` entries. Produces a comparison report: 1. **Per-table accuracy**: table_id, Sonnet best accuracy, Haiku best accuracy, best non-LLM method accuracy, pipeline consensus accuracy 2. **Tables where LLM wins**: tables where LLM accuracy > best non-LLM method 3. **Tables where LLM loses**: tables where LLM accuracy < best non-LLM method 4. **Overall win rates**: Sonnet wins, Haiku wins, pipeline wins 5. **Summary statistics**: mean/median accuracy per approach Output: `tests/llm_structure/comparison_report.md` - **Files to create**: - `tests/llm_structure/compare_models.py` - **Tests**: - `tests/test_feature_extraction/test_llm_structure.py::TestComparison::test_script_importable` — assert module imports without error - Full integration tested by running after injection. - **Acceptance criteria**: - Report generated as markdown at the expected path - Report shows per-table comparison with all approaches - Win rates computed correctly - Report is readable and actionable (shows where LLM method adds value)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ccam80/zotero-chunk-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

phase-3-llm-prompt-method.md•9.46 KiB