added markdown document for ocr engine comparison #13573

Kaan0029 · 2025-07-22T21:53:36Z

ADR for OCR project.
#13313

jabref-machine · 2025-07-22T21:55:11Z

Note that your PR will not be reviewed/accepted until you have gone through the mandatory checks in the description and marked each of them them exactly in the format of [x] (done), [ ] (not done yet) or [/] (not applicable).

docs/decisions/0047-ocr-engine-selection.md

…larity

trag-bot · 2025-07-23T22:45:04Z

@trag-bot didn't find any issues in the code! ✅✨

jabref-machine · 2025-07-23T22:45:19Z

You have removed the "Mandatory Checks" section from your pull request description. Please adhere to our pull request template.

jabref-machine · 2025-07-23T22:52:13Z

You modified Markdown (*.md) files and did not meet JabRef's rules for consistently formatted Markdown files. To ensure consistent styling, we have markdown-lint in place. Markdown lint's rules help to keep our Markdown files consistent within this repository and consistent with the Markdown files outside here.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "Markdown".

koppor · 2025-07-25T13:40:25Z

docs/decisions/0047-ocr-engine-selection.md

+
+## Context and Problem Statement
+
+JabRef requires an OCR engine to extract text from scanned PDFs and image-based academic documents. Tesseract is currently implemented, but accuracy varies significantly with document quality and type. Academic documents present unique challenges: mathematical notation, multiple languages, complex layouts, tables, and mixed handwritten/printed content. Which OCR engine(s) should JabRef adopt to best serve its academic user base while balancing accuracy, cost, privacy, and implementation complexity?


Please follow the one-sentence-per-line rule to ease reviewing.

First senence: imprecise. "scanned PDFs" and "image-based academic documents". I think, it is "only" about PDFs and they are stemming from scanned documents containing images?

"Tesseract is currently implemented". No, this document is about "selection", which means that the outcome of this decision is the engine. Maybe, you wanted to say that there are different engines out there and their output varies.

I would put the sentence about academic documents as second one. And then this is the first paragraph

The second paragraph is about the engines.

koppor · 2025-07-25T13:41:16Z

docs/decisions/0047-ocr-engine-selection.md

+
+* Accuracy on academic document types (printed papers, scanned books, handwritten notes)
+* Privacy requirements for unpublished research materials
+* Cost constraints (open-source project with limited funding)


This should come last -- maybe be more explicit: "solution should be available at no-cost, prefererably open-source"

koppor · 2025-07-25T13:41:36Z

docs/decisions/0047-ocr-engine-selection.md

+* Offline capability for secure research environments
+* Processing speed for batch operations
+* Implementation and maintenance complexity
+* Table and structure extraction capabilities


Group the requirements on academic papers together.

koppor · 2025-07-25T13:41:52Z

docs/decisions/0047-ocr-engine-selection.md

+* Cost constraints (open-source project with limited funding)
+* Language support for international academic community
+* Support for mathematical and scientific notation
+* Offline capability for secure research environments


This is privacy (second point)?

koppor · 2025-07-25T13:42:06Z

docs/decisions/0047-ocr-engine-selection.md

+
+## Considered Options
+
+* Option 1: Tesseract OCR (current implementation)


Normally, MADR does not enumerate options via number.

koppor · 2025-07-25T13:42:45Z

docs/decisions/0047-ocr-engine-selection.md

+
+#### Option 1: Tesseract OCR
+
+Originally developed by Hewlett‑Packard as proprietary software in the 1980s, released as open source in 2005 and development was sponsored by Google in 2006.


Source should be provided. Wikipedia is OK.

koppor · 2025-07-25T13:43:21Z

docs/decisions/0047-ocr-engine-selection.md

+
+#### Option 4: Microsoft Azure Computer Vision
+
+* **Good**, because [leads Category 1 of the above mentioned Decision Drivers (digital screenshots) with 99.8 % accuracy](https://research.aimultiple.com/ocr-accuracy/)


"Category 1" is not clear - the decision drivers above are not categorized. They could for sure.

koppor · 2025-07-25T13:43:44Z

docs/decisions/0047-ocr-engine-selection.md

+* https://www.plugger.ai/blog/comparison-of-paddle-ocr-easyocr-kerasocr-and-tesseract-ocr
+
+#### Option 6: PaddleOCR
+


Please, give a link and short description for each option. You started OKish at the first one.

koppor · 2025-07-25T13:44:08Z

docs/decisions/0047-ocr-engine-selection.md

+
+## Decision Outcome
+
+Chosen option: "Option 1: Tesseract OCR", with planned addition of "Option 2: Google Cloud Vision API" as an optional premium feature, because Tesseract provides a solid free foundation while Google Vision offers superior accuracy for users willing to trade privacy for performance.


Wrong MADR order - please check the template about the ordering -- pros and cons of the options comes AFTER the decision outcome.

koppor · 2025-07-25T13:45:53Z

docs/decisions/0047-ocr-engine-selection.md

+
+## Decision Drivers
+
+* Accuracy on academic document types (printed papers, scanned books, handwritten notes)


Please provide example documents. Maybe even do your own test. I think, for the start, two documents are enough.

I don't know why "printed papers" is there - shouldn't it also be "scanned papers"?

ThiloteE · 2025-07-25T14:05:29Z

I guess I should have sent you this stuff some months ago, but here are some more options regardless:

Option 8: olmocr
Option 9: mistral-ocr

Add ADR 0047 for OCR engine selection and fix requested changes from J…

122d522

…abRef/user-documentation#577

Kaan0029 mentioned this pull request Jul 22, 2025

added markdown document for ocr engine comparison JabRef/user-documentation#577

Closed

koppor reviewed Jul 23, 2025

View reviewed changes

docs/decisions/0047-ocr-engine-selection.md Outdated Show resolved Hide resolved

koppor reviewed Jul 23, 2025

View reviewed changes

docs/decisions/0047-ocr-engine-selection.md Show resolved Hide resolved

docs/decisions/0047-ocr-engine-selection.md Outdated Show resolved Hide resolved

koppor reviewed Jul 23, 2025

View reviewed changes

docs/decisions/0047-ocr-engine-selection.md Outdated Show resolved Hide resolved

linked sources in correct style and resolved some issues related to c…

d2aa30c

…larity

koppor marked this pull request as draft July 25, 2025 13:37

koppor requested changes Jul 25, 2025

View reviewed changes


		## Context and Problem Statement

		JabRef requires an OCR engine to extract text from scanned PDFs and image-based academic documents. Tesseract is currently implemented, but accuracy varies significantly with document quality and type. Academic documents present unique challenges: mathematical notation, multiple languages, complex layouts, tables, and mixed handwritten/printed content. Which OCR engine(s) should JabRef adopt to best serve its academic user base while balancing accuracy, cost, privacy, and implementation complexity?


		## Considered Options

		* Option 1: Tesseract OCR (current implementation)


		#### Option 1: Tesseract OCR

		Originally developed by Hewlett‑Packard as proprietary software in the 1980s, released as open source in 2005 and development was sponsored by Google in 2006.


		#### Option 4: Microsoft Azure Computer Vision

		* Good, because [leads Category 1 of the above mentioned Decision Drivers (digital screenshots) with 99.8 % accuracy](https://research.aimultiple.com/ocr-accuracy/)

		* https://www.plugger.ai/blog/comparison-of-paddle-ocr-easyocr-kerasocr-and-tesseract-ocr

		#### Option 6: PaddleOCR


		## Decision Outcome

		Chosen option: "Option 1: Tesseract OCR", with planned addition of "Option 2: Google Cloud Vision API" as an optional premium feature, because Tesseract provides a solid free foundation while Google Vision offers superior accuracy for users willing to trade privacy for performance.


		## Decision Drivers

		* Accuracy on academic document types (printed papers, scanned books, handwritten notes)

Uh oh!

added markdown document for ocr engine comparison #13573

Are you sure you want to change the base?

added markdown document for ocr engine comparison #13573

Uh oh!

Conversation

Kaan0029 commented Jul 22, 2025 • edited by subhramit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jabref-machine commented Jul 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

trag-bot bot commented Jul 23, 2025

Uh oh!

jabref-machine commented Jul 23, 2025

Uh oh!

jabref-machine commented Jul 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThiloteE commented Jul 25, 2025

Uh oh!

Uh oh!

Kaan0029 commented Jul 22, 2025 •

edited by subhramit

Loading