Description
Bug
The extracted table contains misplaced column cells and rows which then leads to wrong answers by an LLM.
Steps to reproduce
Soruce pdf: iriesd_enea-operator_wer.2.3.pdf
From pages 181 to the end of the documents there are some big tables with small gap between the rows.
In this case lets focus on table from page 181
PAGE 181 There is a big table
Pipeline Options that I used
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.accelerator_options = accelerator_options
pipeline_options.ocr_options = EasyOcrOptions(force_full_page_ocr=True, lang=["pl", "en"])
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
Tried also with:
do_ocr = False
do_cell_matching = False
Markdown output
With no cell matching columns looks better, but still some values from cells are still joined together:
I can see that some numbers in table are ovelapping but as you can see, there is incorrectly joined rows/ columns
Problem also exists when do_cell_matching is set to False.
This affects all tables from pages from page 181 to the end of the document.
Thanks in advance!
Docling version
2.30.0
docling-core 2.32.0
Python version
Python 3.11.0