Comments (23)
Is your training text in utf-8?
from tesstrain.
@Shreeshrii I am having the same issue when trying to finetune for language Tamil (tessdata_best) but works fine when finetuning English. The training text is in utf-8. Is there any workaround to resolve this issue? Thanks.
from tesstrain.
Please share the files used and commands given so that I can test.
from tesstrain.
also try setting setting PYTHONIOENCODING=utf8
on the shell
from tesstrain.
@Shreeshrii I tried PYTHONIOENCODING=utf8
and still receive the same error.
Errors:
-
Word started with a combiner:0xbbe
Normalization failed for string 'ா' -
Can't encode transcription: 'கௌசல்யா' in language ''
Encoding of string failed! Failure bytes: ffffffe0 ffffffaf ffffff8c ffffffe0 ffffffae ffffff9a ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffbe
Additional Info:
- I have installed tesseract and leptonica using OCRD-train Make Script. Additionally added tam.traineddata (tessdata_best) to
ocrd/usr/share/tessdata
- Initially I got lot of errors as 'Warning: properties incomplete for index', unable to find Tamil.unicharset and Latin.unicharset, so I added those files in
ocrd/data
now these errors reduced to 3 seen in the attached log - Command used for Training / Fine tuning:
make training MODEL_NAME=tamiltest START_MODEL=tam
- To make it simpler, I have attached simple names data collected from web for testing purposes which also reproduces the error being discussed
- Facing same issue even with Script: Tamil.trainneddata
Check the attached Training Log and Training Sample Data. Also, I tried fine-tuning eng.traineddata with my own data and it worked fine. I am only having issues fine-tuning tam.trainneddata where I face multiple errors 'Word started with a combiner', 'Warning: properties incomplete for index' and 'Can't encode transcription'.
Training_Log.txt
training_data_sample.zip
from tesstrain.
from tesstrain.
Edit:
After seeing the error messages 'Can't encode transcription' again: it only points to 3 words: கௌசல்யா, கௌசிக், கௌரி training the bigger dataset. Issue with the letter: ** கௌ ** ?
Any idea of how to solve the remaining errors: 'Word started with a combiner' & 'Warning: properties incomplete for index' as attached in the log file of the previous comment.
@Shreeshrii Thanks for the tip, the error 'Can't encode transcription' has been resolved.
Thanks a lot.
from tesstrain.
from tesstrain.
I receive lot of Normalization errors (though repeated), I am using Google Transliterate to make the ground-truth text. Is there any better way to correct the errors made when using tesseract best data and training them with corrected text. Also, can advise me more regarding the substitutions as a fix.
from tesstrain.
from tesstrain.
@Shreeshrii Just like you said, I have commented out in the source downloaded from master and compiled. Version updated to 4.1.0-rc1.
Now, the error changed to 'Invalid start of grapheme sequence'. Also, I noticed all the characters showing from the error 'Normalization failed for string' are present in ocrd/data/Tamil.unicharset and data/tam/tam.lstm-unicharset. But, the characters that are shown in error are not extracted while creating ground-truth/my.unicharset.
After merge_unicharsets
, the resulting files data/unicharset & data/all-boxes has all the missing characters. Does this mean we can ignore the error and proceed with training?
from tesstrain.
from tesstrain.
The box files created by the python script do not meet normalization rules for the indic scripts.
I have changed the makefile
to use tesseract
's wordstr
box file generation. The box files or data/allboxes need to be reviewed/corrected manually before continuing training.
The makefile
should be modified to output a message as reminder since the box files will NOT be 100% correct.
from tesstrain.
I have added a PR - you can see it for changes made. Use Wordstr box option for images without ground truth - #56
Here is the console log:
ubuntu@tesseract-ocr:~/ocrd-train$ make clean MODEL_NAME=tam
find data/ground-truth -name '*.box' -delete
find data/ground-truth -name '*.lstmf' -delete
rm -rf data/all-*
rm -rf data/list.*
rm -rf data/tam
rm -rf data/unicharset
rm -rf data/checkpoints
ubuntu@tesseract-ocr:~/ocrd-train$ make training MODEL_NAME=tam START_MODEL=tam TESSDATA=../tessdata_best
tesseract "data/ground-truth/190.tif" "data/ground-truth/190" -l tam --psm 6 wordstrbox > "data/ground-truth/190.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/191.tif" "data/ground-truth/191" -l tam --psm 6 wordstrbox > "data/ground-truth/191.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/192.tif" "data/ground-truth/192" -l tam --psm 6 wordstrbox > "data/ground-truth/192.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/193.tif" "data/ground-truth/193" -l tam --psm 6 wordstrbox > "data/ground-truth/193.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/194.tif" "data/ground-truth/194" -l tam --psm 6 wordstrbox > "data/ground-truth/194.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/195.tif" "data/ground-truth/195" -l tam --psm 6 wordstrbox > "data/ground-truth/195.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/196.tif" "data/ground-truth/196" -l tam --psm 6 wordstrbox > "data/ground-truth/196.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/197.tif" "data/ground-truth/197" -l tam --psm 6 wordstrbox > "data/ground-truth/197.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/198.tif" "data/ground-truth/198" -l tam --psm 6 wordstrbox > "data/ground-truth/198.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/199.tif" "data/ground-truth/199" -l tam --psm 6 wordstrbox > "data/ground-truth/199.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/200.tif" "data/ground-truth/200" -l tam --psm 6 wordstrbox > "data/ground-truth/200.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
find data/ground-truth -name '*.box' -exec cat {} \; > "data/all-boxes"
mkdir -p data/tam
combine_tessdata -u ../tessdata_best/tam.traineddata data/tam/tam
Extracting tessdata components from ../tessdata_best/tam.traineddata
Wrote data/tam/tam.config
Wrote data/tam/tam.lstm
Wrote data/tam/tam.lstm-punc-dawg
Wrote data/tam/tam.lstm-word-dawg
Wrote data/tam/tam.lstm-number-dawg
Wrote data/tam/tam.lstm-unicharset
Wrote data/tam/tam.lstm-recoder
Wrote data/tam/tam.version
Version string:4.00.00alpha:tam:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1]
0:config:size=355, offset=192
17:lstm:size=3068971, offset=547
18:lstm-punc-dawg:size=2442, offset=3069518
19:lstm-word-dawg:size=2943474, offset=3071960
20:lstm-number-dawg:size=818, offset=6015434
21:lstm-unicharset:size=5970, offset=6016252
22:lstm-recoder:size=895, offset=6022222
23:version:size=80, offset=6023117
unicharset_extractor --output_unicharset "data/ground-truth/my.unicharset" --norm_mode 2 "data/all-boxes"
Extracting unicharset from box file data/all-boxes
Wrote unicharset file data/ground-truth/my.unicharset
merge_unicharsets data/tam/tam.lstm-unicharset data/ground-truth/my.unicharset "data/unicharset"
Loaded unicharset of size 99 from file data/tam/tam.lstm-unicharset
Loaded unicharset of size 30 from file data/ground-truth/my.unicharset
Wrote unicharset file data/unicharset.
tesseract data/ground-truth/190.tif data/ground-truth/190 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/191.tif data/ground-truth/191 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/192.tif data/ground-truth/192 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/193.tif data/ground-truth/193 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/194.tif data/ground-truth/194 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/195.tif data/ground-truth/195 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/196.tif data/ground-truth/196 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/197.tif data/ground-truth/197 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/198.tif data/ground-truth/198 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/199.tif data/ground-truth/199 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/200.tif data/ground-truth/200 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
find data/ground-truth -name '*.lstmf' -exec echo {} \; | sort -R -o "data/all-lstmf"
total=`cat data/all-lstmf | wc -l` \
no=`echo "$total * 0.90 / 1" | bc`; \
head -n "$no" data/all-lstmf > "data/list.train"
total=`cat data/all-lstmf | wc -l` \
no=`echo "($total - $total * 0.90) / 1" | bc`; \
tail -n "$no" data/all-lstmf > "data/list.eval"
combine_lang_model \
--input_unicharset data/unicharset \
--script_dir data/ \
--output_dir data/ \
--pass_through_recoder \
--lang tam
Loaded unicharset of size 99 from file data/unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:data//Tamil.unicharset
Failed to load script unicharset from:data//Latin.unicharset
Warning: properties incomplete for index 3 = [
Warning: properties incomplete for index 4 = ]
Warning: properties incomplete for index 5 = 2
Warning: properties incomplete for index 6 = ப
Warning: properties incomplete for index 7 = ொ
Warning: properties incomplete for index 8 = ர
Warning: properties incomplete for index 9 = ு
Warning: properties incomplete for index 10 = ட
Warning: properties incomplete for index 11 = ்
Warning: properties incomplete for index 12 = க
Warning: properties incomplete for index 13 = ள
Warning: properties incomplete for index 14 = ி
Warning: properties incomplete for index 15 = ன
Warning: properties incomplete for index 16 = ்
Warning: properties incomplete for index 17 = 4
Warning: properties incomplete for index 18 = "
Warning: properties incomplete for index 19 = -
Warning: properties incomplete for index 20 = ஐ
Warning: properties incomplete for index 21 = /
Warning: properties incomplete for index 22 = )
Warning: properties incomplete for index 23 = 9
Warning: properties incomplete for index 24 = 8
Warning: properties incomplete for index 25 = வ
Warning: properties incomplete for index 26 = ோ
Warning: properties incomplete for index 27 = ே
Warning: properties incomplete for index 28 = ெ
Warning: properties incomplete for index 29 = ஷ
Warning: properties incomplete for index 30 = ா
Warning: properties incomplete for index 31 = .
Warning: properties incomplete for index 32 = த
Warning: properties incomplete for index 33 = ,
Warning: properties incomplete for index 34 = ண
Warning: properties incomplete for index 35 = ம
Warning: properties incomplete for index 36 = 3
Warning: properties incomplete for index 37 = 6
Warning: properties incomplete for index 38 = 7
Warning: properties incomplete for index 39 = 0
Warning: properties incomplete for index 40 = ீ
Warning: properties incomplete for index 41 = ற
Warning: properties incomplete for index 42 = ஸ
Warning: properties incomplete for index 43 = அ
Warning: properties incomplete for index 44 = ழ
Warning: properties incomplete for index 45 = !
Warning: properties incomplete for index 46 = எ
Warning: properties incomplete for index 47 = ச
Warning: properties incomplete for index 48 = ல
Warning: properties incomplete for index 49 = ூ
Warning: properties incomplete for index 50 = 5
Warning: properties incomplete for index 51 = இ
Warning: properties incomplete for index 52 = 1
Warning: properties incomplete for index 53 = ய
Warning: properties incomplete for index 54 = ந
Warning: properties incomplete for index 55 = ை
Warning: properties incomplete for index 56 = ஞ
Warning: properties incomplete for index 57 = (
Warning: properties incomplete for index 58 = '
Warning: properties incomplete for index 59 = :
Warning: properties incomplete for index 60 = உ
Warning: properties incomplete for index 61 = ஜ
Warning: properties incomplete for index 62 = ங
Warning: properties incomplete for index 63 = ஆ
Warning: properties incomplete for index 64 = ஏ
Warning: properties incomplete for index 65 = ?
Warning: properties incomplete for index 66 = ஹ
Warning: properties incomplete for index 67 = ”
Warning: properties incomplete for index 68 = ஓ
Warning: properties incomplete for index 69 = ;
Warning: properties incomplete for index 70 = ஈ
Warning: properties incomplete for index 71 = “
Warning: properties incomplete for index 72 = *
Warning: properties incomplete for index 73 = ஒ
Warning: properties incomplete for index 74 = ஊ
Warning: properties incomplete for index 75 = ஃ
Warning: properties incomplete for index 76 = %
Warning: properties incomplete for index 77 = ।
Warning: properties incomplete for index 78 = _
Warning: properties incomplete for index 79 = $
Warning: properties incomplete for index 80 = ௦
Warning: properties incomplete for index 81 = ௫
Warning: properties incomplete for index 82 = |
Warning: properties incomplete for index 83 = &
Warning: properties incomplete for index 84 = ௩
Warning: properties incomplete for index 85 = ௨
Warning: properties incomplete for index 86 = ௮
Warning: properties incomplete for index 87 = ௧
Warning: properties incomplete for index 88 = €
Warning: properties incomplete for index 89 = ௪
Warning: properties incomplete for index 90 = ௯
Warning: properties incomplete for index 91 = ௬
Warning: properties incomplete for index 92 = £
Warning: properties incomplete for index 93 = ₹
Warning: properties incomplete for index 94 = ॥
Warning: properties incomplete for index 95 = ௭
Warning: properties incomplete for index 96 = ௰
Warning: properties incomplete for index 97 = ௱
Warning: properties incomplete for index 98 = ௲
Config file is optional, continuing...
mkdir -p data/checkpoints
lstmtraining \
--traineddata data/tam/tam.traineddata \
--old_traineddata ../tessdata_best/tam.traineddata \
--continue_from data/tam/tam.lstm \
--model_output data/checkpoints/tam \
--train_listfile data/list.train \
--eval_listfile data/list.eval \
--max_iterations 400
Loaded file data/tam/tam.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 99 to 99!
Num (Extended) outputs,weights in Series:
1,36,0,1:1, 0
Num (Extended) outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys48:48, 12480
Lfx96:96, 55680
Lrx96:96, 74112
Lfx192:192, 221952
Fc99:99, 19107
Total weights = 383491
Previous null char=2 mapped to 2
Continuing from data/tam/tam.lstm
Loaded 1/1 pages (1-1) of document data/ground-truth/192.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/193.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/198.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/191.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/199.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/195.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/196.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/200.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/194.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/190.lstmf
2 Percent improvement time=50, best error was 100 @ 0
At iteration 50/100/100, Mean rms=1.173%, delta=2.113%, char train=9.344%, word train=2%, skip ratio=0%, New best char error = 9.344 Transitioned to stage 1 wrote best model:data/checkpoints/tam9.344_50.checkpoint wrote checkpoint.
2 Percent improvement time=0, best error was 9.344 @ 50
At iteration 50/200/200, Mean rms=0.735%, delta=1.057%, char train=4.836%, word train=1%, skip ratio=0%, New best char error = 4.836 wrote best model:data/checkpoints/tam4.836_50.checkpoint wrote checkpoint.
2 Percent improvement time=0, best error was 9.344 @ 50
At iteration 50/300/300, Mean rms=0.536%, delta=0.704%, char train=3.224%, word train=0.667%, skip ratio=0%, New best char error = 3.224 wrote best model:data/checkpoints/tam3.224_50.checkpoint wrote checkpoint.
2 Percent improvement time=1, best error was 4.836 @ 50
At iteration 51/400/400, Mean rms=0.44%, delta=0.535%, char train=2.449%, word train=0.75%, skip ratio=0%, New best char error = 2.449 wrote best model:data/checkpoints/tam2.449_51.checkpoint wrote checkpoint.
Finished! Error rate = 2.449
lstmtraining \
--stop_training \
--continue_from data/checkpoints/tam_checkpoint \
--traineddata data/tam/tam.traineddata \
--model_output data/tam.traineddata
Loaded file data/checkpoints/tam_checkpoint, unpacking...
from tesstrain.
@Shreeshrii Thanks for implementing Wordstr Box as an option. During testing I encountered issues such as empty all-boxes, unicharset files. I tried some changes too for using the current tif
+ gt.txt
pairs and I would like to get feedback on them. I no longer get Normalization
and Can't encode transcription
errors, all characters that are valid are added to the unicharset
files.
Like how you commented out the validation in validate_grapheme.cpp
in this PR. I modified the below mentioned files and able to get a better fine-tuned tam
data:
Commented out the lines #L37 #L38 #L39
// tprintf("Invalid start of grapheme sequence:%c=0x%x\n",
// codes_[codes_used_].first, codes_[codes_used_].second);
// }
Commented out the Error message and also adding the strings to the Unicharset file that are shown as errors (which are actually valid characters present in tam.lstm-unicharset
and Tamil.unicharset
)
// tprintf("Normalization failed for string '%s'\n", strings[i].c_str());
unicharset->unichar_insert(strings[i].c_str());
Training using the current Makefile using changes suggested in your PR with OCRD-Train:
-
Build Command:
make training MODEL_NAME=tamtest START_MODEL=tam NORM_MODE=2
-
I removed the
₹
symbol to resolve the errorWarning: properties incomplete for index 97 = ₹
in START_MODELtam/tam.lstm-unicharset
and commented out #L98 #L99 in Makefile so that it doesn't get overwritten in the subsequent trainings. -
I am now able to get valid
my.unicharset
,unicharset
-
But, I am still facing the below errors while training. All characters were extracted well except for the characters with
்
mentioned in the error below. This issue only occurs with the generatedbox
andall-boxes
file.
Warning: properties incomplete for index 11 = ்
Warning: properties incomplete for index 97 = ்
-
For example, க் should be split as க + ் instead it is added as க் in the
box
files which affects the training and while doing OCR with real world data as well. Accuracy gets a hit on characters with்
, It happens for all characters with்
such as ன், ப் etc. -
்
is referenced as[bcd 200c ]
instead of[bcd ]
inunicharset
as shown below. I had tried changing the same inTamil.unicharset
andtam.lstm-unicharset
files as well which didn't work.
் 0 0,255,0,255,0,0,0,0,0,0 Tamil 5 17 5 ் # ் [bcd 200c ]
The python script generate_line_box.py
is splitting all the characters in the right way (பா
as ப
+ ா
and மோ
as ம
+ ோ
etc..) except for the characters with ்
. Can this be resolved in the python script itself? Attaching the files for evaluation.
Thanks.
Training Data
Training_Log.txt
unicharset.txt
my.unicharset.txt
all-boxes.txt
190.box.txt
191.box.txt
192.box.txt
EDIT
With reference to the below taken from Tesseract 4 Training Wiki, I wonder what is the best way to split up words and generate the corresponding unicharset
box
files. For example, whether to split மோ
as ம
+ ோ
or leave it as மோ
.
Unicharset Compression-recoding
LSTMs are great at learning sequences, but slow down a lot when the number of states is too large. There are empirical results that suggest it is better to ask an LSTM to learn a long sequence than a short sequence of many classes, so for the complex scripts, (Han, Hangul, and the Indic scripts) it is better to recode each symbol as a short sequence of codes from a small number of classes than have a large set of classes. The combine_lang_model command has this feature on by default. It encodes each Han character as a variable-length sequence of 1-5 codes, Hangul using the Jamo encoding as a sequence of 3 codes, and other scripts as a sequence of their unicode components. For the scripts that use a virama character to generate conjunct consonants, (All the Indic scripts plus Myanmar and Khmer) the function NormalizeCleanAndSegmentUTF8 pairs the virama with an appropriate neighbor to generate a more glyph-oriented encoding in the unicharset. To make full use of this improvement, the --pass_through_recoder flag should be set for combine_lang_model for these scripts.
from tesstrain.
I usually use tesstrain.sh for testing training. I have made a new PR related to Tamil not related to
். I will get back with the results soon.
from tesstrain.
@Shreeshrii Thanks for all the help, looking forward. 👍 Also, looking to understand what to change in generate_line_box.py, so that it extracts க்
as க
and ்
etc.
@wrznr Any idea how can this be achieved in the python script? Thanks.
from tesstrain.
Also, looking to understand what to change in generate_line_box.py, so that it extracts க் as க and ் etc.
try norm_mode=3
from tesstrain.
rm -rf ~/tesstutorial/tamocrd
bash ~/tesseract/src/training/tesstrain.sh \
--fonts_dir ~/.fonts \
--lang tam \
--linedata_only \
--save_box_tiff \
--workspace_dir ~/tmp \
--exposures "0" \
--maxpages 1 \
--noextract_font_properties \
--fontlist \
"Arial Unicode MS" \
"FreeSerif" \
"Karla Tamil Inclined Bold Italic" \
"Karla Tamil Inclined Italic" \
"Karla Tamil Upright Bold" \
"Karla Tamil Upright Regular" \
"Lohit Tamil Classical Regular" \
"Lohit Tamil Regular" \
"Lohit Tamil Regular" \
"Noto Sans Tamil Bold" \
"Noto Sans Tamil Regular" \
"TAMu_Kadambri Regular" \
"TAMu_Kalyani Regular" \
"TAMu_Maduram Normal" \
"TSCu_Comic Normal" \
"TSCu_Paranar Bold" \
"TSCu_Paranar Regular" \
"TSCu_Times Normal" \
--langdata_dir ~/langdata_lstm \
--tessdata_dir ~/tessdata_best \
--training_text /home/ubuntu/ocrd-train/tam.txt \
--output_dir ~/tesstutorial/tamocrd
rm -rf ~/tesstutorial/tamtest
bash ~/tesseract/src/training/tesstrain.sh \
--fonts_dir ~/.fonts \
--lang tam \
--linedata_only \
--save_box_tiff \
--workspace_dir ~/tmp \
--exposures "0" \
--maxpages 5 \
--noextract_font_properties \
--langdata_dir ~/langdata_lstm \
--tessdata_dir ~/tessdata_best \
--fontlist \
"Arial Unicode MS" \
"FreeSerif" \
"Karla Tamil Inclined Bold Italic" \
"Karla Tamil Inclined Italic" \
"Karla Tamil Upright Bold" \
"Karla Tamil Upright Regular" \
"Lohit Tamil Classical Regular" \
"Lohit Tamil Regular" \
"Lohit Tamil Regular" \
"Noto Sans Tamil Bold" \
"Noto Sans Tamil Regular" \
"TAMu_Kadambri Regular" \
"TAMu_Kalyani Regular" \
"TAMu_Maduram Normal" \
"TSCu_Comic Normal" \
"TSCu_Paranar Bold" \
"TSCu_Paranar Regular" \
"TSCu_Times Normal" \
"e-Grantamil" \
"Arima Madurai" \
"Mukta Malar" \
"Catamaran" \
"Hind Madurai" \
"Meera Inimai" \
"Pavanam" \
--training_text ~/langdata/tam/tam.training_text \
--output_dir ~/tesstutorial/tamtest
rm -rf ~/tesstutorial/plusminus_from_tam
mkdir -p ~/tesstutorial/plusminus_from_tam
#
combine_tessdata -e ~/tessdata_best/tam.traineddata \
~/tesstutorial/plusminus_from_tam/tam.lstm
#
for ((num_iterations=4100; num_iterations<=6000; num_iterations+=100)); do
lstmtraining \
--debug_interval 0 \
--model_output ~/tesstutorial/plusminus_from_tam/plusminus \
--continue_from ~/tesstutorial/plusminus_from_tam/tam.lstm \
--old_traineddata ~/tessdata_best/tam.traineddata \
--traineddata ~/tesstutorial/tamtest/tam/tam.traineddata \
--train_listfile ~/tesstutorial/tamtest/tam.training_files.txt \
--eval_listfile ~/tesstutorial/tamocrd/tam.training_files.txt \
--max_image_MB 6000 \
--max_iterations $num_iterations
lstmeval \
--verbosity -1 \
--model ~/tesstutorial/plusminus_from_tam/plusminus_checkpoint \
--traineddata ~/tesstutorial/tamtest/tam/tam.traineddata \
--eval_listfile ~/tesstutorial/tamocrd/tam.training_files.txt
done
time lstmeval \
--verbosity 0 \
--model ~/tessdata_best/tam.traineddata \
--eval_listfile ~/tesstutorial/tamocrd/tam.training_files.txt
time lstmeval \
--verbosity 0 \
--model ~/tessdata_fast/tam.traineddata \
--eval_listfile ~/tesstutorial/tamocrd/tam.training_files.txt
lstmtraining \
--stop_training \
--model_output ~/tesstutorial/plusminus_from_tam/tam_plusminus.traineddata \
--continue_from ~/tesstutorial/plusminus_from_tam/plusminus_checkpoint \
--traineddata ~/tesstutorial/tamtest/tam/tam.traineddata
from tesstrain.
for i in $(seq -f "%03g" 190 200) ; do
tesseract /home/ubuntu/ocrd-train/data/ground-truth/$i.tif \
/home/ubuntu/ocrd-train/data/ground-truth/$i \
--tessdata-dir ~/tesstutorial/plusminus_from_tam -l tam_plusminus --psm 6
done
for i in $(seq -f "%03g" 190 200) ; do
wdiff --no-common --statistics ~/ocrd-train/data/ground-truth/$i.gt.txt ~/ocrd-train/data/ground-truth/$i.txt
done
======================================================================
[-அக்கம்மாள்-]{+அககம்மாள்+}
======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/190.gt.txt: 1 word 0 0% common 0 0% deleted 1 100% changed
/home/ubuntu/ocrd-train/data/ground-truth/190.txt: 1 word 0 0% common 0 0% inserted 1 100% changed
======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/191.gt.txt: 1 word 1 100% common 0 0% deleted 0 0% changed
/home/ubuntu/ocrd-train/data/ground-truth/191.txt: 1 word 1 100% common 0 0% inserted 0 0% changed
======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/192.gt.txt: 1 word 1 100% common 0 0% deleted 0 0% changed
/home/ubuntu/ocrd-train/data/ground-truth/192.txt: 1 word 1 100% common 0 0% inserted 0 0% changed
======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/193.gt.txt: 1 word 1 100% common 0 0% deleted 0 0% changed
/home/ubuntu/ocrd-train/data/ground-truth/193.txt: 1 word 1 100% common 0 0% inserted 0 0% changed
======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/194.gt.txt: 1 word 1 100% common 0 0% deleted 0 0% changed
/home/ubuntu/ocrd-train/data/ground-truth/194.txt: 1 word 1 100% common 0 0% inserted 0 0% changed
======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/195.gt.txt: 1 word 1 100% common 0 0% deleted 0 0% changed
/home/ubuntu/ocrd-train/data/ground-truth/195.txt: 1 word 1 100% common 0 0% inserted 0 0% changed
======================================================================
[-கௌசல்யா-]{+கள சல்யா+}
======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/196.gt.txt: 1 word 0 0% common 0 0% deleted 1 100% changed
/home/ubuntu/ocrd-train/data/ground-truth/196.txt: 2 words 0 0% common 0 0% inserted 2 100% changed
======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/197.gt.txt: 1 word 1 100% common 0 0% deleted 0 0% changed
/home/ubuntu/ocrd-train/data/ground-truth/197.txt: 1 word 1 100% common 0 0% inserted 0 0% changed
======================================================================
[-பொன்னைய்யா-]{+பொன்னனயயா+}
======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/198.gt.txt: 1 word 0 0% common 0 0% deleted 1 100% changed
/home/ubuntu/ocrd-train/data/ground-truth/198.txt: 1 word 0 0% common 0 0% inserted 1 100% changed
======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/199.gt.txt: 1 word 1 100% common 0 0% deleted 0 0% changed
/home/ubuntu/ocrd-train/data/ground-truth/199.txt: 1 word 1 100% common 0 0% inserted 0 0% changed
======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/200.gt.txt: 1 word 1 100% common 0 0% deleted 0 0% changed
/home/ubuntu/ocrd-train/data/ground-truth/200.txt: 1 word 1 100% common 0 0% inserted 0 0% changed
from tesstrain.
The zip file has the finetuned traineddata that you can test.
from tesstrain.
I will check the traineddata. Also, even after using norm_mode as 3
it doesn't split
க்
as க
and ்
in the box and all-boxes files.
from tesstrain.
from tesstrain.
Related Issues (20)
- Empty list.train and eval.train HOT 2
- fine tuning arabic traineddata to solve extended words issue HOT 2
- Error while compiling tesseract within tesstrain HOT 2
- Maths OCR
- Can't open lstm.train despite (probably) having all training tools HOT 1
- Training a model from scratch with own imgs + txts? HOT 1
- Trying to train Tesseract for a different font, unable to get CER under 50%
- File not found - *.gt.txt HOT 3
- Error fine tuning new font for Thai Language
- What if my ground truth includes characters not found in a *.unicharset?
- Error generate text2image using khm.training_text HOT 1
- make training not building traineddata file HOT 1
- `make lists -j32` doesn't seem to be honoring the thread count. (Also happens when calling `make training -j32`) HOT 3
- deu_latf wordfile HOT 4
- unicharset_extractor stuck HOT 1
- How to train captcha? HOT 4
- winget install GnuWin32.Make error HOT 10
- make tesseract-langdata error HOT 7
- A question about missing dependency warnings when compiling and installing tesseract on centos using source code HOT 1
- How to train Chinese tradtional vertical in Tesseract 5? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tesstrain.