Skip to content

PDFtoTextConvertor ocr option not working #5058

Answered by anakin87
jais001 asked this question in Questions
Discussion options

You must be logged in to vote

Hello @JaisVJ!

I am on Google Colab and the following commands work for me.
For more information, please take a look at the installation guide.

! pip install farm-haystack[pdf,ocr]==1.17.0    # You should select the right dependencies groups.
! apt-get install tesseract-ocr    # Tesseract is needed for OCR

# I set the environment variable that points at Tesseract data
# The following command works in Ubuntu. For other operating systems, you should use different commands.
import os
os.environ["TESSDATA_PREFIX"]="/usr/share/tesseract-ocr/4.00/tessdata"

converter = PDFToTextConverter(valid_languages=['en'], ocr='auto')

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@jlncoelho
Comment options

@anakin87
Comment options

Answer selected by jais001
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants