Skip to content

v0.6.0

Latest

Choose a tag to compare

@kba kba released this 17 Oct 08:39

Fixed:

  • continue processing when no columns detected but text regions exist
  • convert marginalia to main text if no main text is present
  • reset deskewing angle to 0° when text covers <30% image area and detected angle >45°
  • 🔥 polygons: avoid invalid paths (use Polygon.buffer() instead of dilation etc.)
  • return_boxes_of_images_by_order_of_reading_new: avoid Numpy.dtype mismatch, simplify
  • return_boxes_of_images_by_order_of_reading_new: log any exceptions instead of ignoring
  • filter_contours_without_textline_inside: avoid removing from duplicate lists twice
  • get_marginals: exit early if no peaks found to avoid spurious overlap mask
  • get_smallest_skew: after shifting search range of rotation angle, use overall best result
  • Dockerfile: fix CUDA installation (cuDNN contested between Torch and TF due to extra OCR)
  • OCR: re-instate missing methods and fix utils_ocr function calls
  • mbreorder/enhancement CLIs: missing imports
  • 🔥 writer: SeparatorRegion needs SeparatorRegionType (not ImageRegionType), f458e3
  • tests: switch from pytest-subtests to parametrize so we can use pytest-isolate
    (so CUDA memory gets freed between tests if running on GPU)
  • Prevent OOM GPU error by avoiding loading the region_fl model, #199
  • XML output: encoding should be utf-8, not utf8, #196, #197
  • join_polygons always returning Polygon, not MultiPolygon, #203

Added:

Changed:

  • polygons: slightly widen for regions and lines, increase for separators
  • various refactorings, some code style and identifier improvements
  • deskewing/multiprocessing: switch back to ProcessPoolExecutor (faster),
    but use shared memory if necessary, and switch back from loky to stdlib,
    and shutdown in del() instead of atexit
  • 🔥 OCR: switch CNN-RNN model to 20250930 version compatible with TF 2.12 on CPU, too
  • OCR: allow running -tr without -fl, too
  • 🔥 writer: use @type='heading' instead of 'header' for headings
  • 🔥 performance gains via refactoring (simplification, less copy-code, vectorization,
    avoiding unused calculations, avoiding unnecessary 3-channel image operations)
  • 🔥 heuristic reading order detection: many improvements
    • contour vs splitter box matching:
      • contour must be contained in box exactly instead of heuristics
      • make fallback center matching, center must be contained in box
    • original vs deskewed contour matching:
      • same min-area filter on both sides
      • similar area score in addition to center proximity
      • avoid duplicate and missing mappings by allowing N:M
        matches and splitting+joining where necessary
  • CI: update+improve model caching

Merged PRs

  • CD: master is now main by @bertsky in #185
  • 📝 extend changelog for v0.5.0 by @kba in #186
  • new attempt at #173 (valid polygons, faster deskewing, various fixes) by @bertsky in #192
  • XML encoding should be utf-8 not utf8 by @kba in #197
  • Fix overflow by @bertsky in #199
  • Prepare v0.6.0rc2 by @kba in #200
  • Training installation by @kba in #193
  • Integrate training from sbb pixelwise segmentation by @kba in #187
  • join_polygons: try to catch rare case of MultiPolygon by @kba in #203

Full Changelog: v0.5.0...v0.6.0