Comments (31)
و 0 0 223 17 0
ن 0 0 223 17 0
ق 0 0 223 17 0
ل 0 0 223 17 0
ت 0 0 223 17 0
0 0 223 17 0
ص 0 0 223 17 0
ح 0 0 223 17 0
ف 0 0 223 17 0
0 0 223 17 0
ه 0 0 223 17 0
ن 0 0 223 17 0
د 0 0 223 17 0
ي 0 0 223 17 0
ة 0 0 223 17 0
0 0 223 17 0
ا 0 0 223 17 0
م 0 0 223 17 0
س 0 0 223 17 0
0 0 223 17 0
from tesstrain.
Generated .box files have identical coordinates for every character
It's not a bug, it's a feature.
The LSTM engine needs only line boxes, If you'll give it char boxes, the first thing it will do is make line boxes from the char boxes info.
from tesstrain.
But in my above given example of the .box file. It is generated as RTL not LTR. Will this create an issue wen finetuning ? If yes, will inverting the strings to LTR fix my issue ?
from tesstrain.
About the RTL issue.
The ground truth text file needs to be converted from logical order to visual order.
https://www.unix.com/man-page/linux/1/fribidi
from tesstrain.
from tesstrain.
I attached a small dataset of Arabic text lines and their ground truth below. They are in RTL direction. I need to fine tune the _best arabic trained data model. Any help in doing so would be extremely appreciated.
from tesstrain.
If i followed your instructions in changing the ground truth to LTR, wouldn't i have to invert the tiff images as well ? The txt file would be inverted when changed to LTR, wouldn't that be an issue when generating the .lstm files ?
from tesstrain.
Should i convert only the .gt.txt to LTR or should i also convert the resultant .box files ? If so, what should be done for the images ? Shouldn't they match the inverted text files ?
from tesstrain.
You should use the regular RTL text for generating the images.
Try using tesseract's text2image and you'll see that the chars are in visual order (reversed).
@wrznr, if you want to support RTL text, you should check that the output of fribidi + splitting the chars to lines matches the output of text2image (the chars order should be the same, not the boxes).
from tesstrain.
You misinterpreted my question. I already have the images of the text lines generated. I have their ground truth. Both the images (tif) and the text lines (gt.txt) are in RTL. After converting the gt.txt files to LTR using fribidi. Should i change the resultant box files and/or the original (tif) files ?
What are the necessary steps needed to fine tune using arabic text line image ?
from tesstrain.
@amitdo
Check out my data set below to visualize my issue.
from tesstrain.
Should i change the resultant box files
The chars order in the box files should match the reversed ground truth text.
("Hello everyone" in Hebrew):
שלום לכולם
=>
םלוכל םולש
=>
ם
ל
ו
כ
ל
ם
ו
ל
ש
and/or the original (tif) files ?
No.
from tesstrain.
Okay, the box files will automatically have the same order as the reversed txt file. But does it create an issue if in the box files, the coordiantes came before the letter. Even when changing the txt file to RTL, the resultant box files gave the format of ( 0 0 0 0 letter) not (letter 0 0 0 0). 0 being any coordiante.
from tesstrain.
In which application do you watch the box file?
from tesstrain.
I open the box files using "gedit" on Ubuntu 16.04
from tesstrain.
Please provide an example (just one tif, text, reversed, box).
from tesstrain.
The Sample folder contains the normal tif file with the reversed text file (as you recommended using fribidi) and the generated box files (automatically reversed when generated).
from tesstrain.
I also opened it in gedit (Debian 9). It's fine.
from tesstrain.
Okay, great. Does this mean that my Dataset is ready for fine tuning ? If yes, how many text lines like the one you saw is recommended for fine tuning ? Also, how many iterations is needed ? Thank you for your patience and support.
@amitdo
from tesstrain.
Also many thanks from my side @amitdo for your support on that matter. I successfully "fine-tuned" tesseract's Fraktur model with the latest version of ocrd-train:
make -j4 training START_MODEL=Fraktur TESSDATA=/home/kmw/built/tessdata_best/script
Place your training images in data/ground-truth
, choose the model you want to fine-tune as START_MODEL
, the folder the model is located in as TESSDATA
and you should be fine. Pls. note that I haven't had time to test the procedure with an RTL data set yet. Problems are likely to occur, especially since your gedit shows something else than @amitdo 's.
Pls. get back to us with your experience. Maybe we can even close this issue... ;)
from tesstrain.
Not sure, probably 150-400 for each font.
from tesstrain.
Problems are likely to occur, especially since your gedit shows something else than @amitdo 's.
I see it as he see it... but it still fine :-)
from tesstrain.
@wrznr, FYI, Tesseract official lstm data was trained with degraded synthetic images.
tesseract-ocr/tesseract#1052
from tesstrain.
@amitdo @wrznr
I used the dataset that @amitdo approved off and attempted to fine tune the arabic _best model...
Iteration 700: ALIGNED TRUTH : عامتجا ماتخ يف رداقلا دبع اعدو
Iteration 700: BEST OCR TEXT : َجع
File data/train/line_1_34.lstmf page 0 :
Mean rms=5.762%, delta=74.89%, train=184.604%(99.745%), skip ratio=0%
Iteration 701: ALIGNED TRUTH : عامتجلا 0ا للا 0خ "هنأ ًاحضوم تاعاس
Iteration 701: BEST OCR TEXT : َو
File data/train/line_1_30.lstmf page 0 :
Mean rms=5.76%, delta=74.83%, train=184.479%(99.745%), skip ratio=0%
Iteration 702: ALIGNED TRUTH : عباتي كرابم نا ادكؤم ،هددصلا
Iteration 702: BEST OCR TEXT : ْاةع
What is the issue ? I have been attempting every variation of fine tuning for more than 2 weeks, the results are very disappointing. Any help would be really appreciated.
from tesstrain.
Really hard to tell from distance. Three things I have noticed: 1. Do not expect any good results before the let's say 2000th iteration. 2. The TIFs in Dataset.zip
are rather small in terms of file size (mostly about 3k while our sample line images are about 14k). 3. I could not open them with Ubuntu's standard image viewer:
And, as I mentioned above, there is the issue of different gedit behaviors.This is what it looks like in my gedit:
Correct or not?
from tesstrain.
- You can open the tif images using the Shotwell viewer (pre-installed with Ubuntu).
- Concerning the size, i can generate more than 3k text lines easily, but for fine tuning i don't think that a large dataset is needed. Your Makefile is used for Training From Scratch i believe.
- The dataset.zip file contains the txt files in RTL order, this was before i used fribidi to convert them to LTR. Do not attempt to train with them.
Can you elaborate on the functionality of your Makefile, are you fine tuning or training from scratch ?
from tesstrain.
Also, concerning the bidirectional support, I can gladly edit your python script to enable its support of bidirectional text lines. This would be a major upgrade since training for Bidirectional languages is extremely useful.
from tesstrain.
from tesstrain.
Yes, the small image size is due to the image extraction from the original image source. The text lines are extracted from a newspaper. So they are cropped to small images. However, tesseract handles these text lines easily with psm 6. I don't see how this creates an issue. Can you elaborate ?
from tesstrain.
Also, how big was your training set size ? Do i need a lot of text lines to successfully fine tune ?
from tesstrain.
@jaddoughman I do not have much experience with training productive models myself. Sorry. When we set up this repository, our hope was that we could get some of the necessary insights from users like you... But my guess would be that 3k text lines are enough to fine tune an existing model.
With 3k, I referred to individual file size rather than training set size.
File size not image size (a line in our example data set has typically about 13kb while your's have only 3kb). It might well be that the resolution of the images is to small. It should be at least 300dpi. But again, my experience is rather limited.
from tesstrain.
Related Issues (20)
- Empty list.train and eval.train HOT 2
- fine tuning arabic traineddata to solve extended words issue HOT 2
- Error while compiling tesseract within tesstrain HOT 2
- Maths OCR
- Can't open lstm.train despite (probably) having all training tools HOT 1
- Training a model from scratch with own imgs + txts? HOT 1
- Trying to train Tesseract for a different font, unable to get CER under 50%
- File not found - *.gt.txt HOT 3
- Error fine tuning new font for Thai Language
- What if my ground truth includes characters not found in a *.unicharset?
- Error generate text2image using khm.training_text HOT 1
- make training not building traineddata file HOT 1
- `make lists -j32` doesn't seem to be honoring the thread count. (Also happens when calling `make training -j32`) HOT 3
- deu_latf wordfile HOT 4
- unicharset_extractor stuck HOT 1
- How to train captcha? HOT 4
- winget install GnuWin32.Make error HOT 10
- make tesseract-langdata error HOT 7
- A question about missing dependency warnings when compiling and installing tesseract on centos using source code HOT 1
- How to train Chinese tradtional vertical in Tesseract 5? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tesstrain.