-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
added markdown document for ocr engine comparison #13573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Note that your PR will not be reviewed/accepted until you have gone through the mandatory checks in the description and marked each of them them exactly in the format of |
@trag-bot didn't find any issues in the code! ✅✨ |
You have removed the "Mandatory Checks" section from your pull request description. Please adhere to our pull request template. |
You modified Markdown ( You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "Markdown". |
|
||
## Context and Problem Statement | ||
|
||
JabRef requires an OCR engine to extract text from scanned PDFs and image-based academic documents. Tesseract is currently implemented, but accuracy varies significantly with document quality and type. Academic documents present unique challenges: mathematical notation, multiple languages, complex layouts, tables, and mixed handwritten/printed content. Which OCR engine(s) should JabRef adopt to best serve its academic user base while balancing accuracy, cost, privacy, and implementation complexity? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the one-sentence-per-line rule to ease reviewing.
First senence: imprecise. "scanned PDFs" and "image-based academic documents". I think, it is "only" about PDFs and they are stemming from scanned documents containing images?
"Tesseract is currently implemented". No, this document is about "selection", which means that the outcome of this decision is the engine. Maybe, you wanted to say that there are different engines out there and their output varies.
I would put the sentence about academic documents as second one. And then this is the first paragraph
The second paragraph is about the engines.
|
||
* Accuracy on academic document types (printed papers, scanned books, handwritten notes) | ||
* Privacy requirements for unpublished research materials | ||
* Cost constraints (open-source project with limited funding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should come last -- maybe be more explicit: "solution should be available at no-cost, prefererably open-source"
* Offline capability for secure research environments | ||
* Processing speed for batch operations | ||
* Implementation and maintenance complexity | ||
* Table and structure extraction capabilities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Group the requirements on academic papers together.
* Cost constraints (open-source project with limited funding) | ||
* Language support for international academic community | ||
* Support for mathematical and scientific notation | ||
* Offline capability for secure research environments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is privacy (second point)?
|
||
## Considered Options | ||
|
||
* Option 1: Tesseract OCR (current implementation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, MADR does not enumerate options via number.
|
||
#### Option 1: Tesseract OCR | ||
|
||
Originally developed by Hewlett‑Packard as proprietary software in the 1980s, released as open source in 2005 and development was sponsored by Google in 2006. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source should be provided. Wikipedia is OK.
|
||
#### Option 4: Microsoft Azure Computer Vision | ||
|
||
* **Good**, because [leads Category 1 of the above mentioned Decision Drivers (digital screenshots) with 99.8 % accuracy](https://research.aimultiple.com/ocr-accuracy/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Category 1" is not clear - the decision drivers above are not categorized. They could for sure.
* https://www.plugger.ai/blog/comparison-of-paddle-ocr-easyocr-kerasocr-and-tesseract-ocr | ||
|
||
#### Option 6: PaddleOCR | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, give a link and short description for each option. You started OKish at the first one.
|
||
## Decision Outcome | ||
|
||
Chosen option: "Option 1: Tesseract OCR", with planned addition of "Option 2: Google Cloud Vision API" as an optional premium feature, because Tesseract provides a solid free foundation while Google Vision offers superior accuracy for users willing to trade privacy for performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong MADR order - please check the template about the ordering -- pros and cons of the options comes AFTER the decision outcome.
|
||
## Decision Drivers | ||
|
||
* Accuracy on academic document types (printed papers, scanned books, handwritten notes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide example documents. Maybe even do your own test. I think, for the start, two documents are enough.
I don't know why "printed papers" is there - shouldn't it also be "scanned papers"?
I guess I should have sent you this stuff some months ago, but here are some more options regardless:
|
ADR for OCR project.
#13313