Code Monkey home page Code Monkey logo

Comments (31)

jaddoughman avatar jaddoughman commented on May 29, 2024

و 0 0 223 17 0
ن 0 0 223 17 0
ق 0 0 223 17 0
ل 0 0 223 17 0
ت 0 0 223 17 0
0 0 223 17 0
ص 0 0 223 17 0
ح 0 0 223 17 0
ف 0 0 223 17 0
0 0 223 17 0
ه 0 0 223 17 0
ن 0 0 223 17 0
د 0 0 223 17 0
ي 0 0 223 17 0
ة 0 0 223 17 0
0 0 223 17 0
ا 0 0 223 17 0
م 0 0 223 17 0
س 0 0 223 17 0
0 0 223 17 0

from tesstrain.

amitdo avatar amitdo commented on May 29, 2024

Generated .box files have identical coordinates for every character

It's not a bug, it's a feature.

The LSTM engine needs only line boxes, If you'll give it char boxes, the first thing it will do is make line boxes from the char boxes info.

from tesstrain.

jaddoughman avatar jaddoughman commented on May 29, 2024

But in my above given example of the .box file. It is generated as RTL not LTR. Will this create an issue wen finetuning ? If yes, will inverting the strings to LTR fix my issue ?

@amitdo

from tesstrain.

amitdo avatar amitdo commented on May 29, 2024

About the RTL issue.

The ground truth text file needs to be converted from logical order to visual order.

https://www.unix.com/man-page/linux/1/fribidi

from tesstrain.

wrznr avatar wrznr commented on May 29, 2024

from tesstrain.

jaddoughman avatar jaddoughman commented on May 29, 2024

@wrznr

I attached a small dataset of Arabic text lines and their ground truth below. They are in RTL direction. I need to fine tune the _best arabic trained data model. Any help in doing so would be extremely appreciated.

Dataset.zip

from tesstrain.

jaddoughman avatar jaddoughman commented on May 29, 2024

@amitdo

If i followed your instructions in changing the ground truth to LTR, wouldn't i have to invert the tiff images as well ? The txt file would be inverted when changed to LTR, wouldn't that be an issue when generating the .lstm files ?

from tesstrain.

jaddoughman avatar jaddoughman commented on May 29, 2024

@amitdo @wrznr

Should i convert only the .gt.txt to LTR or should i also convert the resultant .box files ? If so, what should be done for the images ? Shouldn't they match the inverted text files ?

from tesstrain.

amitdo avatar amitdo commented on May 29, 2024

You should use the regular RTL text for generating the images.

Try using tesseract's text2image and you'll see that the chars are in visual order (reversed).

@wrznr, if you want to support RTL text, you should check that the output of fribidi + splitting the chars to lines matches the output of text2image (the chars order should be the same, not the boxes).

from tesstrain.

jaddoughman avatar jaddoughman commented on May 29, 2024

@amitdo

You misinterpreted my question. I already have the images of the text lines generated. I have their ground truth. Both the images (tif) and the text lines (gt.txt) are in RTL. After converting the gt.txt files to LTR using fribidi. Should i change the resultant box files and/or the original (tif) files ?

What are the necessary steps needed to fine tune using arabic text line image ?

from tesstrain.

jaddoughman avatar jaddoughman commented on May 29, 2024

@amitdo
Check out my data set below to visualize my issue.

Dataset.zip

from tesstrain.

amitdo avatar amitdo commented on May 29, 2024

Should i change the resultant box files

The chars order in the box files should match the reversed ground truth text.

("Hello everyone" in Hebrew):
שלום לכולם
=>
םלוכל םולש
=>

ם
ל
ו
כ
ל
 
ם
ו
ל
ש

and/or the original (tif) files ?

No.

from tesstrain.

jaddoughman avatar jaddoughman commented on May 29, 2024

@amitdo

Okay, the box files will automatically have the same order as the reversed txt file. But does it create an issue if in the box files, the coordiantes came before the letter. Even when changing the txt file to RTL, the resultant box files gave the format of ( 0 0 0 0 letter) not (letter 0 0 0 0). 0 being any coordiante.

from tesstrain.

amitdo avatar amitdo commented on May 29, 2024

In which application do you watch the box file?

from tesstrain.

jaddoughman avatar jaddoughman commented on May 29, 2024

@amitdo

I open the box files using "gedit" on Ubuntu 16.04

from tesstrain.

amitdo avatar amitdo commented on May 29, 2024

Please provide an example (just one tif, text, reversed, box).

from tesstrain.

jaddoughman avatar jaddoughman commented on May 29, 2024

@amitdo

The Sample folder contains the normal tif file with the reversed text file (as you recommended using fribidi) and the generated box files (automatically reversed when generated).

Sample.zip

from tesstrain.

amitdo avatar amitdo commented on May 29, 2024

I also opened it in gedit (Debian 9). It's fine.

from tesstrain.

jaddoughman avatar jaddoughman commented on May 29, 2024

Okay, great. Does this mean that my Dataset is ready for fine tuning ? If yes, how many text lines like the one you saw is recommended for fine tuning ? Also, how many iterations is needed ? Thank you for your patience and support.
@amitdo

from tesstrain.

wrznr avatar wrznr commented on May 29, 2024

Also many thanks from my side @amitdo for your support on that matter. I successfully "fine-tuned" tesseract's Fraktur model with the latest version of ocrd-train:

make -j4 training START_MODEL=Fraktur TESSDATA=/home/kmw/built/tessdata_best/script

Place your training images in data/ground-truth, choose the model you want to fine-tune as START_MODEL, the folder the model is located in as TESSDATA and you should be fine. Pls. note that I haven't had time to test the procedure with an RTL data set yet. Problems are likely to occur, especially since your gedit shows something else than @amitdo 's.

Pls. get back to us with your experience. Maybe we can even close this issue... ;)

from tesstrain.

amitdo avatar amitdo commented on May 29, 2024

Not sure, probably 150-400 for each font.

from tesstrain.

amitdo avatar amitdo commented on May 29, 2024

Problems are likely to occur, especially since your gedit shows something else than @amitdo 's.

I see it as he see it... but it still fine :-)

from tesstrain.

amitdo avatar amitdo commented on May 29, 2024

@wrznr, FYI, Tesseract official lstm data was trained with degraded synthetic images.
tesseract-ocr/tesseract#1052

from tesstrain.

jaddoughman avatar jaddoughman commented on May 29, 2024

@amitdo @wrznr
I used the dataset that @amitdo approved off and attempted to fine tune the arabic _best model...

Iteration 700: ALIGNED TRUTH : عامتجا ماتخ يف رداقلا دبع اعدو
Iteration 700: BEST OCR TEXT : َجع
File data/train/line_1_34.lstmf page 0 :
Mean rms=5.762%, delta=74.89%, train=184.604%(99.745%), skip ratio=0%
Iteration 701: ALIGNED TRUTH : عامتجلا 0ا للا 0خ "هنأ ًاحضوم تاعاس
Iteration 701: BEST OCR TEXT : َو
File data/train/line_1_30.lstmf page 0 :
Mean rms=5.76%, delta=74.83%, train=184.479%(99.745%), skip ratio=0%
Iteration 702: ALIGNED TRUTH : عباتي كرابم نا ادكؤم ،هددصلا
Iteration 702: BEST OCR TEXT : ْاةع

What is the issue ? I have been attempting every variation of fine tuning for more than 2 weeks, the results are very disappointing. Any help would be really appreciated.

from tesstrain.

wrznr avatar wrznr commented on May 29, 2024

Really hard to tell from distance. Three things I have noticed: 1. Do not expect any good results before the let's say 2000th iteration. 2. The TIFs in Dataset.zip are rather small in terms of file size (mostly about 3k while our sample line images are about 14k). 3. I could not open them with Ubuntu's standard image viewer:
image

And, as I mentioned above, there is the issue of different gedit behaviors.This is what it looks like in my gedit:
image

Correct or not?

from tesstrain.

jaddoughman avatar jaddoughman commented on May 29, 2024

@wrznr

  1. You can open the tif images using the Shotwell viewer (pre-installed with Ubuntu).
  2. Concerning the size, i can generate more than 3k text lines easily, but for fine tuning i don't think that a large dataset is needed. Your Makefile is used for Training From Scratch i believe.
  3. The dataset.zip file contains the txt files in RTL order, this was before i used fribidi to convert them to LTR. Do not attempt to train with them.

Can you elaborate on the functionality of your Makefile, are you fine tuning or training from scratch ?

from tesstrain.

jaddoughman avatar jaddoughman commented on May 29, 2024

@wrznr

Also, concerning the bidirectional support, I can gladly edit your python script to enable its support of bidirectional text lines. This would be a major upgrade since training for Bidirectional languages is extremely useful.

from tesstrain.

wrznr avatar wrznr commented on May 29, 2024

from tesstrain.

jaddoughman avatar jaddoughman commented on May 29, 2024

@wrznr

Yes, the small image size is due to the image extraction from the original image source. The text lines are extracted from a newspaper. So they are cropped to small images. However, tesseract handles these text lines easily with psm 6. I don't see how this creates an issue. Can you elaborate ?

from tesstrain.

jaddoughman avatar jaddoughman commented on May 29, 2024

@wrznr

Also, how big was your training set size ? Do i need a lot of text lines to successfully fine tune ?

from tesstrain.

wrznr avatar wrznr commented on May 29, 2024

@jaddoughman I do not have much experience with training productive models myself. Sorry. When we set up this repository, our hope was that we could get some of the necessary insights from users like you... But my guess would be that 3k text lines are enough to fine tune an existing model.

With 3k, I referred to individual file size rather than training set size.

File size not image size (a line in our example data set has typically about 13kb while your's have only 3kb). It might well be that the resolution of the images is to small. It should be at least 300dpi. But again, my experience is rather limited.

from tesstrain.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.