Release v0.6.0 · qurator-spk/eynollah

Fixed:

continue processing when no columns detected but text regions exist
convert marginalia to main text if no main text is present
reset deskewing angle to 0° when text covers <30% image area and detected angle >45°
🔥 polygons: avoid invalid paths (use Polygon.buffer() instead of dilation etc.)
return_boxes_of_images_by_order_of_reading_new: avoid Numpy.dtype mismatch, simplify
return_boxes_of_images_by_order_of_reading_new: log any exceptions instead of ignoring
filter_contours_without_textline_inside: avoid removing from duplicate lists twice
get_marginals: exit early if no peaks found to avoid spurious overlap mask
get_smallest_skew: after shifting search range of rotation angle, use overall best result
Dockerfile: fix CUDA installation (cuDNN contested between Torch and TF due to extra OCR)
OCR: re-instate missing methods and fix utils_ocr function calls
mbreorder/enhancement CLIs: missing imports
🔥 writer: SeparatorRegion needs SeparatorRegionType (not ImageRegionType), f458e3
tests: switch from pytest-subtests to parametrize so we can use pytest-isolate
(so CUDA memory gets freed between tests if running on GPU)
Prevent OOM GPU error by avoiding loading the region_fl model, #199
XML output: encoding should be utf-8, not utf8, #196, #197
join_polygons always returning Polygon, not MultiPolygon, #203

Added:

🔥 eynollah-training CLI and docs for training the models, #187, #193, https://github.com/qurator-spk/sbb_pixelwise_segmentation/tree/unifying-training-models
🔥 layout CLI: new option --model_version to override default choices
test coverage for OCR options in layout
test coverage for table detection in layout
CI linting with ruff

Changed:

polygons: slightly widen for regions and lines, increase for separators
various refactorings, some code style and identifier improvements
deskewing/multiprocessing: switch back to ProcessPoolExecutor (faster),
but use shared memory if necessary, and switch back from loky to stdlib,
and shutdown in del() instead of atexit
🔥 OCR: switch CNN-RNN model to 20250930 version compatible with TF 2.12 on CPU, too
OCR: allow running -tr without -fl, too
🔥 writer: use @type='heading' instead of 'header' for headings
🔥 performance gains via refactoring (simplification, less copy-code, vectorization,
avoiding unused calculations, avoiding unnecessary 3-channel image operations)
🔥 heuristic reading order detection: many improvements
- contour vs splitter box matching:
  - contour must be contained in box exactly instead of heuristics
  - make fallback center matching, center must be contained in box
- original vs deskewed contour matching:
  - same min-area filter on both sides
  - similar area score in addition to center proximity
  - avoid duplicate and missing mappings by allowing N:M
    matches and splitting+joining where necessary
CI: update+improve model caching

Merged PRs

CD: master is now main by @bertsky in #185
📝 extend changelog for v0.5.0 by @kba in #186
new attempt at #173 (valid polygons, faster deskewing, various fixes) by @bertsky in #192
XML encoding should be utf-8 not utf8 by @kba in #197
Fix overflow by @bertsky in #199
Prepare v0.6.0rc2 by @kba in #200
Training installation by @kba in #193
Integrate training from sbb pixelwise segmentation by @kba in #187
join_polygons: try to catch rare case of MultiPolygon by @kba in #203

Full Changelog: v0.5.0...v0.6.0