-
Notifications
You must be signed in to change notification settings - Fork 9.9k
German - Characters added to result multiple times (aä / AÄ) #1060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Additional example (wW) |
Please test with the latest models available in
https://github.com/tesseract-ocr/tessdata/tree/master/best
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Tue, Aug 1, 2017 at 9:45 PM, TheSeiko ***@***.***> wrote:
Additional example (wW)
VW-Werk -> VwW-Werk
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1060 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o9N-1P4rP3fa2mYMcL4zS0Z8LMEYks5sT08WgaJpZM4Op9eR>
.
|
|
The fix for most problems with the LSTM engine is more / better training. DAS2016 Sildes, 6. Modernization Efforts Page 17 I think that for 'in dictionary' words these kind of duplications would be eliminated. |
https://github.com/tesseract-ocr/tessdata/tree/master/best deu.traineddata 19.721 KB best trainingdata only delivers empty results |
Are you using --oem 1? you can see the contents of the traineddata by combine_tessdata -u deu,traineddata These are probably only lstm models and do not have the legacy engine which is used via --oem 0 |
@stweil Have you tested the deu model? |
Yes, I'm using --oem 1 I'm just switching deu.traineddata in tessdata |
Looks like you need both deu and frk models wget -O ./tess4data-save/deubest.traineddata https://github.com/tesseract-ocr/tessdata/blob/master/best/deu.traineddata?raw=true sudo cp ./tess4data-save/*.traineddata /usr/share/tesseract-ocr/4.00/tessdata time tesseract ./tif/phototest.tif stdout --oem 1 -l deu
|
works on linux - looks for frk traineddata, probably listed in deu.config |
The new best one? No, I have not tested it yet. I am currently focused on Fraktur where the new results clearly beat the old ones.
I noticed on Linux that "old" Tesseract executables crash with the new traineddata, so I expect that my current Windows binaries would crash, too. Building new ones is on my list. |
thank you |
The new binaries are now available. |
Thanks!
:-) |
Thank you for the new binaries. There are still similar errors: hitzefrei -> 1 x hitzefreii / 1 x hitzefreil Suggestion: The results are a lot better with 4.0. LSTM than with 3.05.01 but training seams to be difficult. Maybe it would be a good idea to offer a webpage where people could upload example image-files and matching text-files to include them in the training process. |
@theraysmith, @TheSeiko, maybe you'd get better results for Antiqua text without that |
Thank you for the tip. Much appreciated! |
@stweil Am I doing something wrong? There's only a version file included in the deu,traineddata when using the binaries from 04.08 E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu. E:\Tesseract-OCR4.0a2>combine_tessdata -d deu.traineddata |
Looks like I introduced a bug.
If the traineddata file doesn't exist, it makes an empty one with a version
string in it, instead of complaining about the non-existent file.
…On Sun, Aug 6, 2017 at 8:00 PM, TheSeiko ***@***.***> wrote:
@stweil <https://github.com/stweil> Am I doing something wrong?
There's only a version file included in the deu,traineddata when using the
binaries from 04.08
E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu
Extracting tessdata components from deu.traineddata
Wrote tmp/deu.version
Version string:4.0.0-alpha.20170804
23:version:size=20, offset=192
E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu.
Extracting tessdata components from deu.traineddata
Wrote tmp/deu.version
Version string:4.0.0-alpha.20170804
23:version:size=20, offset=192
E:\Tesseract-OCR4.0a2>combine_tessdata -d deu.traineddata
Version string:4.0.0-alpha.20170804
23:version:size=20, offset=192
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1060 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056cltXp_aF2gKV9kC0kWvs3JtVSFnks5sVn3ggaJpZM4Op9eR>
.
--
Ray.
|
Ray,
There have been a number of reports of people not being able to run the
english tutorial training.
Missing eng.config etc
Posted in tesseract-ocr forum
…On 07-Aug-2017 9:49 PM, "theraysmith" ***@***.***> wrote:
Looks like I introduced a bug.
If the traineddata file doesn't exist, it makes an empty one with a version
string in it, instead of complaining about the non-existent file.
On Sun, Aug 6, 2017 at 8:00 PM, TheSeiko ***@***.***> wrote:
> @stweil <https://github.com/stweil> Am I doing something wrong?
>
> There's only a version file included in the deu,traineddata when using
the
> binaries from 04.08
>
> E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu
> Extracting tessdata components from deu.traineddata
> Wrote tmp/deu.version
> Version string:4.0.0-alpha.20170804
> 23:version:size=20, offset=192
>
> E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu.
> Extracting tessdata components from deu.traineddata
> Wrote tmp/deu.version
> Version string:4.0.0-alpha.20170804
> 23:version:size=20, offset=192
>
> E:\Tesseract-OCR4.0a2>combine_tessdata -d deu.traineddata
> Version string:4.0.0-alpha.20170804
> 23:version:size=20, offset=192
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1060#
issuecomment-320557491>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AL056cltXp_
aF2gKV9kC0kWvs3JtVSFnks5sVn3ggaJpZM4Op9eR>
> .
>
--
Ray.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1060 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o0ifee0UQtGx7y9q_bdy7S-bpU8Gks5sVzkkgaJpZM4Op9eR>
.
|
I just made 3 commits that address some of these issues:
Error message for lack of --traineddata arg referring to wiki.
Emphasis that the lack of config file is just a warning.
Detected non-existent traineddata file in combine_tessdata.
It seems the majority of the problems are lack of sync of code/data. There
are dependencies between code and data that have changed due to moving the
unicharset from the LSTM model to the traineddata file.
…On Mon, Aug 7, 2017 at 9:27 AM, Shreeshrii ***@***.***> wrote:
Ray,
There have been a number of reports of people not being able to run the
english tutorial training.
Missing eng.config etc
Posted in tesseract-ocr forum
On 07-Aug-2017 9:49 PM, "theraysmith" ***@***.***> wrote:
> Looks like I introduced a bug.
> If the traineddata file doesn't exist, it makes an empty one with a
version
> string in it, instead of complaining about the non-existent file.
>
> On Sun, Aug 6, 2017 at 8:00 PM, TheSeiko ***@***.***>
wrote:
>
> > @stweil <https://github.com/stweil> Am I doing something wrong?
> >
> > There's only a version file included in the deu,traineddata when using
> the
> > binaries from 04.08
> >
> > E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu
> > Extracting tessdata components from deu.traineddata
> > Wrote tmp/deu.version
> > Version string:4.0.0-alpha.20170804
> > 23:version:size=20, offset=192
> >
> > E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu.
> > Extracting tessdata components from deu.traineddata
> > Wrote tmp/deu.version
> > Version string:4.0.0-alpha.20170804
> > 23:version:size=20, offset=192
> >
> > E:\Tesseract-OCR4.0a2>combine_tessdata -d deu.traineddata
> > Version string:4.0.0-alpha.20170804
> > 23:version:size=20, offset=192
> >
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > <#1060#
> issuecomment-320557491>,
> > or mute the thread
> > <https://github.com/notifications/unsubscribe-auth/AL056cltXp_
> aF2gKV9kC0kWvs3JtVSFnks5sVn3ggaJpZM4Op9eR>
> > .
> >
>
>
>
> --
> Ray.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1060#
issuecomment-320710053>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AE2_o0ifee0UQtGx7y9q_
bdy7S-bpU8Gks5sVzkkgaJpZM4Op9eR>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1060 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056bG0VYKti8qB_bH9oO6Ma4ZkT7B_ks5sVzrtgaJpZM4Op9eR>
.
--
Ray.
|
Yes. That is the problem. One possible solution that I have been asking for a while is the tagging of "important" commits. Then it would be easy to say, use tesseract, tessdata, langdata as of 4.0.0alpha-20170807 |
@stweil thank you, removing deu.config helped a lot ad best traineddata deu without deu.config: after ~50k testimages: great recognition rate only problem so far: sometimes i is not recognised properly: sıch - sich I'm adding a regex to replace ı with i |
and j -> J OCR Result <-> Text in image |
|
Interesting. I remember from learning German at school that all nouns begin
with a capital, so why do yours not?
I would assume from the errors that you describe that the network has
learned that all nouns begin with a capital, so it hallucinates one even
when it is not there.
If you have a lot of non-capital nouns for some reason, it might do better
in 'Latin' than 'deu'
…On Tue, Aug 8, 2017 at 11:57 PM, TheSeiko ***@***.***> wrote:
$$-Jährige <-> $$-jährige
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1060 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056SQxJfQMiipkLavZpbfqC4FgRymLks5sWVhbgaJpZM4Op9eR>
.
--
Ray.
|
@theraysmith jährige/Jährige can be both - a noun (capital letter) or an adjective (lowercase): Latin is working better with this problem, I've had it running yesterday for ~100k frames I've collected some example images and I'll try to do the "Fine Tuning Training" |
| 10 | 10 | Arzl-Ost | Arzl-0Ost | 2017-08-11 09:34:41 | - LATIN |
| 10 | 1502045216726 | Oberwölz | Oberwõlz |
@stweil |
One thing I've found out is that sometimes the points, i.e. Ö are used with the previous line. So the Ö is recognised as an O and the previous line has points added. Sometimes the reason for this is that the previous line has a different character size than the following paragraph. But this is only one case. |
This looks like a different issue from the original one. |
C:\Tesseract-OCR20200328>tesseract --version C:\Tesseract-OCR20190314>tesseract --version |
pathTesseract: C://Tesseract-OCR20190314/tesseract
|
@stweil - grófste pathTesseract: C://Tesseract-OCR20200328/tesseract
|
@stweil ÓFB-Legionaàr pathTesseract: C://Tesseract-OCR20200328/tesseract
|
@stweil A^4 pathTesseract: C://Tesseract-OCR20200328/tesseract
|
@stweil 4 comes from nowhere pathTesseract: C://Tesseract-OCR20200328/tesseract
|
tesseract 4.00.00alpha
leptonica-1.74.1
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0
Win10 64bit - built Uni Mannheim
deu.traineddata - Repeating of characters:
Current Behavior:
ÄGYPTEN -> ÄAGYPTEN
Grand-Prix -> Gräand-Prix
AUSTRALIEN -> AUSTRAÄLIEN
GROSSBRITANNIEN -> GROSSBRITAÄANNIEN
Expected Behavior:
ÄGYPTEN -> ÄGYPTEN
Grand-Prix -> Grand-Prix
AUSTRALIEN -> AUSTRALIEN
GROSSBRITANNIEN -> GROSSBRITANNIEN
Suggested Fix:
1 blob / 1 box should only be 1 outcome / 1 result
Additional Info:
Example images are available for posting
The text was updated successfully, but these errors were encountered: