PDFtoTextConvertor ocr option not working #5058

jais001 · 2023-05-31T16:04:37Z

jais001
May 31, 2023

Hi,

I am working on PDF to text conversion. But while using OCR argument in PDFToTextConverter, I am getting an error, i.e

unexpected keyword argument 'ocr'

I have checked the api documentation of PDFToTextConverter and found this argument in the documentation. My haystack version is

farm-haystack ==1.17.0

Does the argument exist in the latest version or is this an error?

Answered by anakin87

Jun 1, 2023

Hello @JaisVJ!

I am on Google Colab and the following commands work for me.
For more information, please take a look at the installation guide.

! pip install farm-haystack[pdf,ocr]==1.17.0    # You should select the right dependencies groups.
! apt-get install tesseract-ocr    # Tesseract is needed for OCR

# I set the environment variable that points at Tesseract data
# The following command works in Ubuntu. For other operating systems, you should use different commands.
import os
os.environ["TESSDATA_PREFIX"]="/usr/share/tesseract-ocr/4.00/tessdata"

converter = PDFToTextConverter(valid_languages=['en'], ocr='auto')

View full answer

anakin87 · 2023-06-01T07:51:57Z

anakin87
Jun 1, 2023
Maintainer

Hello @JaisVJ!

I am on Google Colab and the following commands work for me.
For more information, please take a look at the installation guide.

! pip install farm-haystack[pdf,ocr]==1.17.0    # You should select the right dependencies groups.
! apt-get install tesseract-ocr    # Tesseract is needed for OCR

# I set the environment variable that points at Tesseract data
# The following command works in Ubuntu. For other operating systems, you should use different commands.
import os
os.environ["TESSDATA_PREFIX"]="/usr/share/tesseract-ocr/4.00/tessdata"

converter = PDFToTextConverter(valid_languages=['en'], ocr='auto')

2 replies

jlncoelho Jan 25, 2024

Should haystack really be quietly failing to import the correct class and instead importing a different one with the same name?

try:
    with LazyImport() as fitz_import:
        # Try to use PyMuPDF, if not available fall back to xpdf
        from haystack.nodes.file_converter.pdf import PDFToTextConverter  # type: ignore

    fitz_import.check()
except (ModuleNotFoundError, ImportError):
    from haystack.nodes.file_converter.pdf_xpdf import PDFToTextConverter  # type: ignore  # pylint: disable=reimported,ungrouped-imports

anakin87 Jan 25, 2024
Maintainer

Hello...
I understand your concern but you have 2 different options for PDF conversion, as explained in the docs. They have different license.

In Haystack 2.0 (now in beta), there are different and more explicit options.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PDFtoTextConvertor ocr option not working #5058

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

PDFtoTextConvertor ocr option not working #5058

Uh oh!

jais001 May 31, 2023

Replies: 1 comment · 2 replies

Uh oh!

anakin87 Jun 1, 2023 Maintainer

Uh oh!

jlncoelho Jan 25, 2024

Uh oh!

anakin87 Jan 25, 2024 Maintainer

jais001
May 31, 2023

Replies: 1 comment 2 replies

anakin87
Jun 1, 2023
Maintainer

anakin87 Jan 25, 2024
Maintainer