-
Notifications
You must be signed in to change notification settings - Fork 9.9k
4.0 with LSTM
Tesseract 4.0 alpha source code is available in the 'master' branch of the repository. It adds a new OCR engine based on LSTM neural networks. It initially works (well) on x86/Linux. Model data for 101 languages is available in the tessdata repository.
-
DAS 2016 tutorial slides
Slides #2, #6, #7 have information about LSTM integration in Tesseract 4.0.
- TrainingTesseract 4.00
- TrainingTesseract 4.00 - Finetuning Example - Arabic
- TrainingTesseract 4.00 - Replace Top Layer Example - Norwegian
- TrainingTesseract 4.00 - Replace Top Layer Example - Devanagari
3.0 version of box files can be converted for use with LSTM training by adding a tab character at end of each line. Mark EOL
function under Edit in Box Editor
tab of latest version of JTessboxeditor can be used to do it automatically.
Unofficial Ubuntu PPAs for Tesseract 4.00 & Leptonica 1.74:
Leponica 1.74.1 package for Debian:
Unofficial experimental binaries of tesseract-ocr 4.0.0-alpha (Jan 30, 2017) are available from the following links:
- Windows Installer made with MinGW-w64 from UB Mannheim
- zip file with cppan generated .dll and .exe files, You have to install VC2015 x86 redist from microsoft.com in order to run them.
Windows binaries of tesseract-ocr 4.0.0-alpha with GUI interface are available for VietOCR from
-
Visual C++ Redistributable for Visual Studio 2015 runtime - vc_redist.x86.exe is REQUIRED for VietOCR to run correctly.
VietOCR can be used to download appropriate 4.0.0alpha traineddata for additional languages.
Windows binaries of tesseract-ocr 4.0.0-alpha with GUI interface are available for gImageReader from
- gImageReader_3.2.1_qt5_i686_tesseract4.0.0.git2f10be5.exe
- gImageReader_3.2.1_qt5_x86_64_tesseract4.0.0.git2f10be5.exe
Download 4.0.0alpha traineddata to use with the above from master branch of tessdata. e.g. for Hindi download the following file:
https://github.com/tesseract-ocr/tessdata/blob/master/hin.traineddata *
An unofficial installer for Tesseract 3.05-dev for Windows is available from [Tesseract at UB Mannheim] (https://github.com/UB-Mannheim/tesseract/wiki). This includes the training tools.
The [3.05 branch on GitHub] (https://github.com/tesseract-ocr/tesseract/tree/3.05) can be used by those who want the bug fixes for 3.04 release.
The current official release is [3.04.1] (https://github.com/tesseract-ocr/tesseract/releases/tag/3.04.01).
Old wiki - no longer maintained. The pages were moved, see the new documentation.
These wiki pages are no longer maintained.
All pages were moved to tesseract-ocr/tessdoc.
The latest documentation is available at https://tesseract-ocr.github.io/.