Skip to content

Incorrect Table Columns #1678

Open
Open
@fifibanana

Description

@fifibanana

Bug

The extracted table contains misplaced column cells and rows which then leads to wrong answers by an LLM.

Steps to reproduce

Soruce pdf: iriesd_enea-operator_wer.2.3.pdf
From pages 181 to the end of the documents there are some big tables with small gap between the rows.
In this case lets focus on table from page 181
PAGE 181 There is a big table

Pipeline Options that I used

pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.accelerator_options = accelerator_options
pipeline_options.ocr_options = EasyOcrOptions(force_full_page_ocr=True, lang=["pl", "en"])
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

Tried also with:

do_ocr  = False
do_cell_matching = False

Markdown output

Image
continuation:

Image

With no cell matching columns looks better, but still some values from cells are still joined together:

Image

I can see that some numbers in table are ovelapping but as you can see, there is incorrectly joined rows/ columns
Problem also exists when do_cell_matching is set to False.

This affects all tables from pages from page 181 to the end of the document.

Thanks in advance!

Docling version

2.30.0
docling-core 2.32.0

Python version

Python 3.11.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions