Skip to content

Commit cb550ad

Browse files
committed
Created README.md
1 parent f2fb3be commit cb550ad

File tree

1 file changed

+48
-0
lines changed

1 file changed

+48
-0
lines changed

README.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
ocropy
2+
======
3+
4+
Python-based OCR package using recurrent neural networks.
5+
6+
To install, use:
7+
8+
$ sudo apt-get install $(cat PACKAGES)
9+
$ wget -nd http://www.tmbdev.org/en-default.pyrnn.gz
10+
$ mv en-default.pyrnn.gz models/
11+
$ sudo python setup.py install
12+
13+
To test the recognizer, run:
14+
15+
$ ./run-test
16+
17+
OCRopus is really a collection of document analysis programs, not a turn-key OCR system.
18+
19+
In addition to the recognition scripts themselves, there are a number of scripts for
20+
ground truth editing and correction, measuring error rates, determining confusion matrices, etc.
21+
OCRopus commands will generally print a stack trace along with an error message;
22+
this is not generally indicative of a problem (in a future release, we'll suppress the stack
23+
trace by default since it seems to confuse too many users).
24+
25+
To recognize pages of text, you need to run separate commands: binarization, page layout
26+
analysis, and text line recognition.
27+
28+
# perform binarization
29+
./ocoropus-nlbin tests/ersch.png -o book
30+
31+
# perform page layout analysis
32+
./ocropus-gpageseg 'book/????.bin.png'
33+
34+
# perform text line recognition (on four cores, with a fraktur model)
35+
./ocropus-rpred -Q 4 -m models/fraktur.pyrnn.gz 'book/????/??????.bin.png'
36+
37+
# generate HTML output
38+
./ocropus-hocr 'book/????.bin.png' -o ersch.html
39+
40+
# display the output
41+
firefox ersch.html
42+
43+
There are some things the currently trained models for ocropus-rpred
44+
will not handle well, largely because they are nearly absent in the
45+
current training data. That includes all-caps text, some special symbols
46+
(including "?"), typewriter fonts, and subscripts/superscripts. This will
47+
be addressed in a future release, and, of course, you are welcome to contribute
48+
new, trained models.

0 commit comments

Comments
 (0)