Skip to content

added markdown document for ocr engine comparison #13573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Kaan0029
Copy link
Contributor

@Kaan0029 Kaan0029 commented Jul 22, 2025

ADR for OCR project.
#13313

@jabref-machine
Copy link
Collaborator

Note that your PR will not be reviewed/accepted until you have gone through the mandatory checks in the description and marked each of them them exactly in the format of [x] (done), [ ] (not done yet) or [/] (not applicable).

Copy link

trag-bot bot commented Jul 23, 2025

@trag-bot didn't find any issues in the code! ✅✨

@jabref-machine
Copy link
Collaborator

You have removed the "Mandatory Checks" section from your pull request description. Please adhere to our pull request template.

@jabref-machine
Copy link
Collaborator

You modified Markdown (*.md) files and did not meet JabRef's rules for consistently formatted Markdown files. To ensure consistent styling, we have markdown-lint in place. Markdown lint's rules help to keep our Markdown files consistent within this repository and consistent with the Markdown files outside here.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "Markdown".

@koppor koppor marked this pull request as draft July 25, 2025 13:37

## Context and Problem Statement

JabRef requires an OCR engine to extract text from scanned PDFs and image-based academic documents. Tesseract is currently implemented, but accuracy varies significantly with document quality and type. Academic documents present unique challenges: mathematical notation, multiple languages, complex layouts, tables, and mixed handwritten/printed content. Which OCR engine(s) should JabRef adopt to best serve its academic user base while balancing accuracy, cost, privacy, and implementation complexity?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the one-sentence-per-line rule to ease reviewing.

First senence: imprecise. "scanned PDFs" and "image-based academic documents". I think, it is "only" about PDFs and they are stemming from scanned documents containing images?

"Tesseract is currently implemented". No, this document is about "selection", which means that the outcome of this decision is the engine. Maybe, you wanted to say that there are different engines out there and their output varies.

I would put the sentence about academic documents as second one. And then this is the first paragraph

The second paragraph is about the engines.


* Accuracy on academic document types (printed papers, scanned books, handwritten notes)
* Privacy requirements for unpublished research materials
* Cost constraints (open-source project with limited funding)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should come last -- maybe be more explicit: "solution should be available at no-cost, prefererably open-source"

* Offline capability for secure research environments
* Processing speed for batch operations
* Implementation and maintenance complexity
* Table and structure extraction capabilities
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Group the requirements on academic papers together.

* Cost constraints (open-source project with limited funding)
* Language support for international academic community
* Support for mathematical and scientific notation
* Offline capability for secure research environments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is privacy (second point)?


## Considered Options

* Option 1: Tesseract OCR (current implementation)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, MADR does not enumerate options via number.


#### Option 1: Tesseract OCR

Originally developed by Hewlett‑Packard as proprietary software in the 1980s, released as open source in 2005 and development was sponsored by Google in 2006.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source should be provided. Wikipedia is OK.


#### Option 4: Microsoft Azure Computer Vision

* **Good**, because [leads Category 1 of the above mentioned Decision Drivers (digital screenshots) with 99.8 % accuracy](https://research.aimultiple.com/ocr-accuracy/)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Category 1" is not clear - the decision drivers above are not categorized. They could for sure.

* https://www.plugger.ai/blog/comparison-of-paddle-ocr-easyocr-kerasocr-and-tesseract-ocr

#### Option 6: PaddleOCR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, give a link and short description for each option. You started OKish at the first one.


## Decision Outcome

Chosen option: "Option 1: Tesseract OCR", with planned addition of "Option 2: Google Cloud Vision API" as an optional premium feature, because Tesseract provides a solid free foundation while Google Vision offers superior accuracy for users willing to trade privacy for performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong MADR order - please check the template about the ordering -- pros and cons of the options comes AFTER the decision outcome.


## Decision Drivers

* Accuracy on academic document types (printed papers, scanned books, handwritten notes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide example documents. Maybe even do your own test. I think, for the start, two documents are enough.

I don't know why "printed papers" is there - shouldn't it also be "scanned papers"?

@ThiloteE
Copy link
Member

I guess I should have sent you this stuff some months ago, but here are some more options regardless:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants