|
| 1 | +ocropy |
| 2 | +====== |
| 3 | + |
| 4 | +Python-based OCR package using recurrent neural networks. |
| 5 | + |
| 6 | +To install, use: |
| 7 | + |
| 8 | + $ sudo apt-get install $(cat PACKAGES) |
| 9 | + $ wget -nd http://www.tmbdev.org/en-default.pyrnn.gz |
| 10 | + $ mv en-default.pyrnn.gz models/ |
| 11 | + $ sudo python setup.py install |
| 12 | + |
| 13 | +To test the recognizer, run: |
| 14 | + |
| 15 | + $ ./run-test |
| 16 | + |
| 17 | +OCRopus is really a collection of document analysis programs, not a turn-key OCR system. |
| 18 | + |
| 19 | +In addition to the recognition scripts themselves, there are a number of scripts for |
| 20 | +ground truth editing and correction, measuring error rates, determining confusion matrices, etc. |
| 21 | +OCRopus commands will generally print a stack trace along with an error message; |
| 22 | +this is not generally indicative of a problem (in a future release, we'll suppress the stack |
| 23 | +trace by default since it seems to confuse too many users). |
| 24 | + |
| 25 | +To recognize pages of text, you need to run separate commands: binarization, page layout |
| 26 | +analysis, and text line recognition. |
| 27 | + |
| 28 | + # perform binarization |
| 29 | + ./ocoropus-nlbin tests/ersch.png -o book |
| 30 | + |
| 31 | + # perform page layout analysis |
| 32 | + ./ocropus-gpageseg 'book/????.bin.png' |
| 33 | + |
| 34 | + # perform text line recognition (on four cores, with a fraktur model) |
| 35 | + ./ocropus-rpred -Q 4 -m models/fraktur.pyrnn.gz 'book/????/??????.bin.png' |
| 36 | + |
| 37 | + # generate HTML output |
| 38 | + ./ocropus-hocr 'book/????.bin.png' -o ersch.html |
| 39 | + |
| 40 | + # display the output |
| 41 | + firefox ersch.html |
| 42 | + |
| 43 | +There are some things the currently trained models for ocropus-rpred |
| 44 | +will not handle well, largely because they are nearly absent in the |
| 45 | +current training data. That includes all-caps text, some special symbols |
| 46 | +(including "?"), typewriter fonts, and subscripts/superscripts. This will |
| 47 | +be addressed in a future release, and, of course, you are welcome to contribute |
| 48 | +new, trained models. |
0 commit comments