Environment: tesseract 4.0.0 leptonica-1.76.0 libjpeg 9c :

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Generated .box files have identical coordinates for every character about tesstrain HOT 31 CLOSED

tesseract-ocr commented on May 29, 2024

Generated .box files have identical coordinates for every character

from tesstrain.

Comments (31)

jaddoughman commented on May 29, 2024

و 0 0 223 17 0
ن 0 0 223 17 0
ق 0 0 223 17 0
ل 0 0 223 17 0
ت 0 0 223 17 0
0 0 223 17 0
ص 0 0 223 17 0
ح 0 0 223 17 0
ف 0 0 223 17 0
0 0 223 17 0
ه 0 0 223 17 0
ن 0 0 223 17 0
د 0 0 223 17 0
ي 0 0 223 17 0
ة 0 0 223 17 0
0 0 223 17 0
ا 0 0 223 17 0
م 0 0 223 17 0
س 0 0 223 17 0
0 0 223 17 0

from tesstrain.

amitdo commented on May 29, 2024

Generated .box files have identical coordinates for every character

It's not a bug, it's a feature.

The LSTM engine needs only line boxes, If you'll give it char boxes, the first thing it will do is make line boxes from the char boxes info.

from tesstrain.

jaddoughman commented on May 29, 2024

But in my above given example of the .box file. It is generated as RTL not LTR. Will this create an issue wen finetuning ? If yes, will inverting the strings to LTR fix my issue ?

@amitdo

from tesstrain.

amitdo commented on May 29, 2024

About the RTL issue.

The ground truth text file needs to be converted from logical order to visual order.

https://www.unix.com/man-page/linux/1/fribidi

from tesstrain.

wrznr commented on May 29, 2024

Thanks for sharing the example. I will try to test it tomorrow and get back to you.

…

Am 03.12.2018 um 21:55 schrieb jaddoughman ***@***.***>: After following your instructions and converting the .gt.txt file and generated .box file to LTR order. The training is looking a lot like this: Iteration 459: ALIGNED TRUTH : 00000000 0000000 00000 000000" 00000 00000 Iteration 459: BEST OCR TEXT : 00000000 0000000 00000 000000" 00000 00000" File data/train/line_1_5.lstmf page 0 (Perfect): Mean rms=4.529%, delta=32.952%, train=72.624%(67.841%), skip ratio=0% Iteration 460: ALIGNED TRUTH : 00000 0000 00 00000 0000 Iteration 460: BEST OCR TEXT : ل00000 0000 00 00000 0000 File data/train/line_1_7.lstmf page 0 : Mean rms=4.522%, delta=32.882%, train=72.484%(67.737%), skip ratio=0% Iteration 461: ALIGNED TRUTH : 000000000 0000000 00000 000000" 00000 00000 Iteration 461: BEST OCR TEXT : 00000000 0000000 00000 000000" 00000 0000 File data/train/line_1_5.lstmf page 0 (Perfect): Mean rms=4.515%, delta=32.811%, train=72.342%(67.626%), skip ratio=0% Iteration 462: ALIGNED TRUTH : 00000 0000 00 00000 0000 Iteration 462: BEST OCR TEXT : ل00000 0000 00 00000 0000 Can you explain the reason behind such an error. I will attach the txt and box files below. line_1_5.gt.txt line_1_7.gt.txt line_1_8.gt.txt 0 67 894 0 0 ﻦ 0 67 894 0 0 ﻴ 0 67 894 0 0 ﺒ 0 67 894 0 0 ﻧ 0 67 894 0 0 ﺬ 0 67 894 0 0 ﻤ 0 67 894 0 0 ﻟ 0 67 894 0 0 ﺍ 0 0 894 67 0 0 67 894 0 0 ﺔ 0 67 894 0 0 ﺒ 0 67 894 0 0 ﻗ 0 67 894 0 0 ﺎ 0 67 894 0 0 ﻌ 0 67 894 0 0 ﻣ 0 67 894 0 0 ﻭ 0 0 894 67 0 0 67 894 0 0 ﻝ 0 67 894 0 0 ﺪ 0 67 894 0 0 ﻌ 0 67 894 0 0 ﻟ 0 67 894 0 0 ﺍ 0 0 894 67 0 0 67 894 0 0 ﻖ 0 67 894 0 0 ﻴ 0 67 894 0 0 ﻘ 0 67 894 0 0 ﺤ 0 67 894 0 0 ﺘ 0 67 894 0 0 ﺑ " 0 0 894 67 0 0 0 894 67 0 0 67 894 0 0 ﺪ 0 67 894 0 0 ﻬ 0 67 894 0 0 ﻌ 0 67 894 0 0 ﺘ 0 67 894 0 0 ﻳ 0 0 894 67 0 0 67 894 0 0 ﻙ 0 67 894 0 0 ﺭ 0 67 894 0 0 ﺎ 0 67 894 0 0 ﺒ 0 67 894 0 0 ﻣ " 0 0 894 67 0 894 67 895 68 0 — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

from tesstrain.

jaddoughman commented on May 29, 2024

@wrznr

I attached a small dataset of Arabic text lines and their ground truth below. They are in RTL direction. I need to fine tune the _best arabic trained data model. Any help in doing so would be extremely appreciated.

Dataset.zip

from tesstrain.

jaddoughman commented on May 29, 2024

@amitdo

If i followed your instructions in changing the ground truth to LTR, wouldn't i have to invert the tiff images as well ? The txt file would be inverted when changed to LTR, wouldn't that be an issue when generating the .lstm files ?

from tesstrain.

jaddoughman commented on May 29, 2024

@amitdo @wrznr

Should i convert only the .gt.txt to LTR or should i also convert the resultant .box files ? If so, what should be done for the images ? Shouldn't they match the inverted text files ?

from tesstrain.

amitdo commented on May 29, 2024

You should use the regular RTL text for generating the images.

Try using tesseract's text2image and you'll see that the chars are in visual order (reversed).

@wrznr, if you want to support RTL text, you should check that the output of fribidi + splitting the chars to lines matches the output of text2image (the chars order should be the same, not the boxes).

from tesstrain.

jaddoughman commented on May 29, 2024

@amitdo

You misinterpreted my question. I already have the images of the text lines generated. I have their ground truth. Both the images (tif) and the text lines (gt.txt) are in RTL. After converting the gt.txt files to LTR using fribidi. Should i change the resultant box files and/or the original (tif) files ?

What are the necessary steps needed to fine tune using arabic text line image ?

from tesstrain.

jaddoughman commented on May 29, 2024

@amitdo
Check out my data set below to visualize my issue.

Dataset.zip

from tesstrain.

amitdo commented on May 29, 2024

Should i change the resultant box files

The chars order in the box files should match the reversed ground truth text.

("Hello everyone" in Hebrew):
שלום לכולם
=>
םלוכל םולש
=>

ם
ל
ו
כ
ל
 
ם
ו
ל
ש

and/or the original (tif) files ?

No.

from tesstrain.

jaddoughman commented on May 29, 2024

@amitdo

Okay, the box files will automatically have the same order as the reversed txt file. But does it create an issue if in the box files, the coordiantes came before the letter. Even when changing the txt file to RTL, the resultant box files gave the format of ( 0 0 0 0 letter) not (letter 0 0 0 0). 0 being any coordiante.

from tesstrain.

amitdo commented on May 29, 2024

In which application do you watch the box file?

from tesstrain.

jaddoughman commented on May 29, 2024

@amitdo

I open the box files using "gedit" on Ubuntu 16.04

from tesstrain.

amitdo commented on May 29, 2024

Please provide an example (just one tif, text, reversed, box).

from tesstrain.

jaddoughman commented on May 29, 2024

@amitdo

The Sample folder contains the normal tif file with the reversed text file (as you recommended using fribidi) and the generated box files (automatically reversed when generated).

Sample.zip

from tesstrain.

amitdo commented on May 29, 2024

I also opened it in gedit (Debian 9). It's fine.

from tesstrain.

jaddoughman commented on May 29, 2024

Okay, great. Does this mean that my Dataset is ready for fine tuning ? If yes, how many text lines like the one you saw is recommended for fine tuning ? Also, how many iterations is needed ? Thank you for your patience and support.
@amitdo

from tesstrain.

wrznr commented on May 29, 2024

Also many thanks from my side @amitdo for your support on that matter. I successfully "fine-tuned" tesseract's Fraktur model with the latest version of ocrd-train:

make -j4 training START_MODEL=Fraktur TESSDATA=/home/kmw/built/tessdata_best/script

Place your training images in data/ground-truth, choose the model you want to fine-tune as START_MODEL, the folder the model is located in as TESSDATA and you should be fine. Pls. note that I haven't had time to test the procedure with an RTL data set yet. Problems are likely to occur, especially since your gedit shows something else than @amitdo 's.

Pls. get back to us with your experience. Maybe we can even close this issue... ;)

from tesstrain.

amitdo commented on May 29, 2024

Not sure, probably 150-400 for each font.

from tesstrain.

amitdo commented on May 29, 2024

Problems are likely to occur, especially since your gedit shows something else than @amitdo 's.

I see it as he see it... but it still fine :-)

from tesstrain.

amitdo commented on May 29, 2024

@wrznr, FYI, Tesseract official lstm data was trained with degraded synthetic images.
tesseract-ocr/tesseract#1052

from tesstrain.

jaddoughman commented on May 29, 2024

@amitdo @wrznr
I used the dataset that @amitdo approved off and attempted to fine tune the arabic _best model...

Iteration 700: ALIGNED TRUTH : عامتجا ماتخ يف رداقلا دبع اعدو
Iteration 700: BEST OCR TEXT : َجع
File data/train/line_1_34.lstmf page 0 :
Mean rms=5.762%, delta=74.89%, train=184.604%(99.745%), skip ratio=0%
Iteration 701: ALIGNED TRUTH : عامتجلا 0ا للا 0خ "هنأ ًاحضوم تاعاس
Iteration 701: BEST OCR TEXT : َو
File data/train/line_1_30.lstmf page 0 :
Mean rms=5.76%, delta=74.83%, train=184.479%(99.745%), skip ratio=0%
Iteration 702: ALIGNED TRUTH : عباتي كرابم نا ادكؤم ،هددصلا
Iteration 702: BEST OCR TEXT : ْاةع

What is the issue ? I have been attempting every variation of fine tuning for more than 2 weeks, the results are very disappointing. Any help would be really appreciated.

from tesstrain.

wrznr commented on May 29, 2024

Really hard to tell from distance. Three things I have noticed: 1. Do not expect any good results before the let's say 2000th iteration. 2. The TIFs in Dataset.zip are rather small in terms of file size (mostly about 3k while our sample line images are about 14k). 3. I could not open them with Ubuntu's standard image viewer:

And, as I mentioned above, there is the issue of different gedit behaviors.This is what it looks like in my gedit:

Correct or not?

from tesstrain.

jaddoughman commented on May 29, 2024

@wrznr

You can open the tif images using the Shotwell viewer (pre-installed with Ubuntu).
Concerning the size, i can generate more than 3k text lines easily, but for fine tuning i don't think that a large dataset is needed. Your Makefile is used for Training From Scratch i believe.
The dataset.zip file contains the txt files in RTL order, this was before i used fribidi to convert them to LTR. Do not attempt to train with them.

Can you elaborate on the functionality of your Makefile, are you fine tuning or training from scratch ?

from tesstrain.

jaddoughman commented on May 29, 2024

@wrznr

Also, concerning the bidirectional support, I can gladly edit your python script to enable its support of bidirectional text lines. This would be a major upgrade since training for Bidirectional languages is extremely useful.

from tesstrain.

wrznr commented on May 29, 2024

Your support is very welcome. If you file a pull request for bidi language support, we will gladly merge it. The makefile is supposed to support both, training from scratch as well as starting with a previously built model. With 3k, I reffered to individual file size rather than training set size.

…

Am 04.12.2018 um 18:14 schrieb jaddoughman ***@***.***>: @wrznr Also, concerning the bidirectional support, I can gladly edit your python script to enable its support of bidirectional text lines. This would be a major upgrade since training for Bidirectional languages is extremely useful. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

from tesstrain.

jaddoughman commented on May 29, 2024

@wrznr

Yes, the small image size is due to the image extraction from the original image source. The text lines are extracted from a newspaper. So they are cropped to small images. However, tesseract handles these text lines easily with psm 6. I don't see how this creates an issue. Can you elaborate ?

from tesstrain.

jaddoughman commented on May 29, 2024

@wrznr

Also, how big was your training set size ? Do i need a lot of text lines to successfully fine tune ?

from tesstrain.

wrznr commented on May 29, 2024

@jaddoughman I do not have much experience with training productive models myself. Sorry. When we set up this repository, our hope was that we could get some of the necessary insights from users like you... But my guess would be that 3k text lines are enough to fine tune an existing model.

With 3k, I referred to individual file size rather than training set size.

File size not image size (a line in our example data set has typically about 13kb while your's have only 3kb). It might well be that the resolution of the images is to small. It should be at least 300dpi. But again, my experience is rather limited.

from tesstrain.

Generated .box files have identical coordinates for every character about tesstrain HOT 31 CLOSED

Comments (31)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent