while we train for sequence of characters in devanagari like क्ष (क ् ष) it giv

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Can't encode transcription: 'क्षक्षाक्षिक्षी क्षु क्षू्क्षे क्षैक्षो क्षौ क्षं क्षः' in language '',about tesseract-ocr/tesstrain

Comments (23)

Shreeshrii commented on May 30, 2024

Is your training text in utf-8?

from tesstrain.

vijayrajasekaran commented on May 30, 2024

@Shreeshrii I am having the same issue when trying to finetune for language Tamil (tessdata_best) but works fine when finetuning English. The training text is in utf-8. Is there any workaround to resolve this issue? Thanks.

from tesstrain.

Shreeshrii commented on May 30, 2024

Please share the files used and commands given so that I can test.

from tesstrain.

Shreeshrii commented on May 30, 2024

also try setting setting PYTHONIOENCODING=utf8 on the shell

from tesstrain.

vijayrajasekaran commented on May 30, 2024

@Shreeshrii I tried PYTHONIOENCODING=utf8 and still receive the same error.

Errors:

Word started with a combiner:0xbbe
Normalization failed for string 'ா'
Can't encode transcription: 'கௌசல்யா' in language ''
Encoding of string failed! Failure bytes: ffffffe0 ffffffaf ffffff8c ffffffe0 ffffffae ffffff9a ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffbe

Additional Info:

I have installed tesseract and leptonica using OCRD-train Make Script. Additionally added tam.traineddata (tessdata_best) to ocrd/usr/share/tessdata
Initially I got lot of errors as 'Warning: properties incomplete for index', unable to find Tamil.unicharset and Latin.unicharset, so I added those files in ocrd/data now these errors reduced to 3 seen in the attached log
Command used for Training / Fine tuning: make training MODEL_NAME=tamiltest START_MODEL=tam
To make it simpler, I have attached simple names data collected from web for testing purposes which also reproduces the error being discussed
Facing same issue even with Script: Tamil.trainneddata

Check the attached Training Log and Training Sample Data. Also, I tried fine-tuning eng.traineddata with my own data and it worked fine. I am only having issues fine-tuning tam.trainneddata where I face multiple errors 'Word started with a combiner', 'Warning: properties incomplete for index' and 'Can't encode transcription'.

Training_Log.txt
training_data_sample.zip

from tesstrain.

Shreeshrii commented on May 30, 2024

Try adding a line feed to your ground truth files. Currently it is just one line, make it one line of text followed by empty line. See if that helps

…

On Sun, Feb 24, 2019 at 9:14 PM Vijay Rajasekaran ***@***.***> wrote: @Shreeshrii <https://github.com/Shreeshrii> I tried PYTHONIOENCODING=utf8 and still receive the same error. Errors: 1. Word started with a combiner:0xbbe Normalization failed for string 'ா' 2. Can't encode transcription: 'கௌசல்யா' in language '' Encoding of string failed! Failure bytes: ffffffe0 ffffffaf ffffff8c ffffffe0 ffffffae ffffff9a ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffbe - I have installed tesseract and leptonica using OCRD-train Make Script. Additionally added tam.traineddata (tessdata_best) to ocrd/usr/share/tessdata - Initially I got lot of errors as 'Warning: properties incomplete for index', unable to find Tamil.unicharset and Latin.unicharset, so I added those files in ocrd/data now these errors reduced to 3 seen in the attached log - Command used for Training / Fine tuning: make training MODEL_NAME=tamiltest START_MODEL=tam - To make it simpler, I have attached simple names data collected from web for testing purposes which also reproduces the error being discussed Check the attached Training Log and Training Sample Data Training Log <https://pastebin.com/Cjy5rgJ3> Training Sample Data <https://drive.google.com/open?id=1JHLSWbQ3BR12_gRBa5Au0q5El_AK8FjU> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_ozoLF8uHr09YhK8ZJ_kJIJWaGvWbks5vQrNYgaJpZM4avRaq> .

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

from tesstrain.

vijayrajasekaran commented on May 30, 2024

Edit:

After seeing the error messages 'Can't encode transcription' again: it only points to 3 words: கௌசல்யா, கௌசிக், கௌரி training the bigger dataset. Issue with the letter: ** கௌ ** ?
Any idea of how to solve the remaining errors: 'Word started with a combiner' & 'Warning: properties incomplete for index' as attached in the log file of the previous comment.

@Shreeshrii Thanks for the tip, the error 'Can't encode transcription' has been resolved.

Thanks a lot.

from tesstrain.

Shreeshrii commented on May 30, 2024

You can ignore Warning: properties incomplete for index', the program now geneates the required properties. The other errors are related to normalization of the unicode text. It is possible that there are some errors in the words in your training text. If the errors are few you can ignore, else run substitutions to fix the problem.

…

On Sun, 24 Feb 2019, 22:29 Vijay Rajasekaran, ***@***.***> wrote: @Shreeshrii <https://github.com/Shreeshrii> Thanks for the tip, the error 'Can't encode transcription' has been resolved. Any idea of how to solve the remaining errors: 'Word started with a combiner' & 'Warning: properties incomplete for index' Thanks a lot. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_ozUAYRRwrhSUB9n-Zw2QobdnwEqUks5vQsTygaJpZM4avRaq> .

from tesstrain.

vijayrajasekaran commented on May 30, 2024

@Shreeshrii

I receive lot of Normalization errors (though repeated), I am using Google Transliterate to make the ground-truth text. Is there any better way to correct the errors made when using tesseract best data and training them with corrected text. Also, can advise me more regarding the substitutions as a fix.

Normalization _Error.txt

from tesstrain.

Shreeshrii commented on May 30, 2024

I haven't tested tesseract training with Tamil text, so I am not sure what is causing this. If you think that the text is correct and there is some problem in normalization rules, you can try to comment out the following and see if it helps - https://github.com/tesseract-ocr/tesseract/blob/master/src/training/validate_grapheme.cpp#L18 // Reject easily detected badly formed sequences. // if (prev_cc == CharClass::kWhitespace && is_combiner) { // if (report_errors_) tprintf("Word started with a combiner:0x%x\n", ch); // return false; // }

…

On Sun, Feb 24, 2019 at 11:10 PM Vijay Rajasekaran ***@***.***> wrote: @Shreeshrii <https://github.com/Shreeshrii> I receive lot of Normalization errors (though repeated), I am using Google Transliterate to make the ground-truth text. Is there any better way to correct the errors made when using tesseract best data and training them with corrected text. Also, can advise me more regarding the substitutions as a fix. Normalization _Error.txt <https://github.com/OCR-D/ocrd-train/files/2898176/Normalization._Error.txt> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_oyNc_fHOcTQCSGOQ6e73FRgm48z-ks5vQs6ngaJpZM4avRaq> .

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

from tesstrain.

vijayrajasekaran commented on May 30, 2024

@Shreeshrii Just like you said, I have commented out in the source downloaded from master and compiled. Version updated to 4.1.0-rc1.

Now, the error changed to 'Invalid start of grapheme sequence'. Also, I noticed all the characters showing from the error 'Normalization failed for string' are present in ocrd/data/Tamil.unicharset and data/tam/tam.lstm-unicharset. But, the characters that are shown in error are not extracted while creating ground-truth/my.unicharset.

After merge_unicharsets, the resulting files data/unicharset & data/all-boxes has all the missing characters. Does this mean we can ignore the error and proceed with training?

from tesstrain.

Shreeshrii commented on May 30, 2024

After merge_unicharsets, the resulting files data/unicharset &

data/all-boxes has all the missing characters. Does this mean we can ignore the error and proceed with training? Sure, give it a try. That is the best way to find out if it works :-)

…

On Mon, Feb 25, 2019 at 3:16 AM Vijay Rajasekaran ***@***.***> wrote: @Shreeshrii <https://github.com/Shreeshrii> Just like you said, I have commented out in the source downloaded from master and compiled. Version updated to 4.1.0-rc1. Now, the error changed to 'Invalid start of grapheme sequence'. Also, I noticed all the characters showing from the error 'Normalization failed for string' are present in ocrd/data/Tamil.unicharset and data/tam/tam.lstm-unicharset. But, the characters that are shown in error are not extracted while creating ground-truth/my.unicharset. After merge_unicharsets, the resulting files data/unicharset & data/all-boxes has all the missing characters. Does this mean we can ignore the error and proceed with training? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o80GkpHhTUzuWcDwxej75l8bR0QVks5vQwhAgaJpZM4avRaq> .

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

from tesstrain.

Shreeshrii commented on May 30, 2024

The box files created by the python script do not meet normalization rules for the indic scripts.

I have changed the makefile to use tesseract 's wordstr box file generation. The box files or data/allboxes need to be reviewed/corrected manually before continuing training.

The makefile should be modified to output a message as reminder since the box files will NOT be 100% correct.

from tesstrain.

Shreeshrii commented on May 30, 2024

I have added a PR - you can see it for changes made. Use Wordstr box option for images without ground truth - #56

Here is the console log:

ubuntu@tesseract-ocr:~/ocrd-train$ make clean MODEL_NAME=tam
find data/ground-truth -name '*.box' -delete
find data/ground-truth -name '*.lstmf' -delete
rm -rf data/all-*
rm -rf data/list.*
rm -rf data/tam
rm -rf data/unicharset
rm -rf data/checkpoints

ubuntu@tesseract-ocr:~/ocrd-train$ make training  MODEL_NAME=tam START_MODEL=tam TESSDATA=../tessdata_best
tesseract "data/ground-truth/190.tif" "data/ground-truth/190" -l tam --psm 6 wordstrbox > "data/ground-truth/190.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/191.tif" "data/ground-truth/191" -l tam --psm 6 wordstrbox > "data/ground-truth/191.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/192.tif" "data/ground-truth/192" -l tam --psm 6 wordstrbox > "data/ground-truth/192.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/193.tif" "data/ground-truth/193" -l tam --psm 6 wordstrbox > "data/ground-truth/193.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/194.tif" "data/ground-truth/194" -l tam --psm 6 wordstrbox > "data/ground-truth/194.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/195.tif" "data/ground-truth/195" -l tam --psm 6 wordstrbox > "data/ground-truth/195.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/196.tif" "data/ground-truth/196" -l tam --psm 6 wordstrbox > "data/ground-truth/196.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/197.tif" "data/ground-truth/197" -l tam --psm 6 wordstrbox > "data/ground-truth/197.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/198.tif" "data/ground-truth/198" -l tam --psm 6 wordstrbox > "data/ground-truth/198.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/199.tif" "data/ground-truth/199" -l tam --psm 6 wordstrbox > "data/ground-truth/199.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract "data/ground-truth/200.tif" "data/ground-truth/200" -l tam --psm 6 wordstrbox > "data/ground-truth/200.box"
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
find data/ground-truth -name '*.box' -exec cat {} \; > "data/all-boxes"
mkdir -p data/tam
combine_tessdata -u ../tessdata_best/tam.traineddata  data/tam/tam
Extracting tessdata components from ../tessdata_best/tam.traineddata
Wrote data/tam/tam.config
Wrote data/tam/tam.lstm
Wrote data/tam/tam.lstm-punc-dawg
Wrote data/tam/tam.lstm-word-dawg
Wrote data/tam/tam.lstm-number-dawg
Wrote data/tam/tam.lstm-unicharset
Wrote data/tam/tam.lstm-recoder
Wrote data/tam/tam.version
Version string:4.00.00alpha:tam:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1]
0:config:size=355, offset=192
17:lstm:size=3068971, offset=547
18:lstm-punc-dawg:size=2442, offset=3069518
19:lstm-word-dawg:size=2943474, offset=3071960
20:lstm-number-dawg:size=818, offset=6015434
21:lstm-unicharset:size=5970, offset=6016252
22:lstm-recoder:size=895, offset=6022222
23:version:size=80, offset=6023117
unicharset_extractor --output_unicharset "data/ground-truth/my.unicharset" --norm_mode 2 "data/all-boxes"
Extracting unicharset from box file data/all-boxes
Wrote unicharset file data/ground-truth/my.unicharset
merge_unicharsets data/tam/tam.lstm-unicharset data/ground-truth/my.unicharset  "data/unicharset"
Loaded unicharset of size 99 from file data/tam/tam.lstm-unicharset
Loaded unicharset of size 30 from file data/ground-truth/my.unicharset
Wrote unicharset file data/unicharset.
tesseract data/ground-truth/190.tif data/ground-truth/190 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/191.tif data/ground-truth/191 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/192.tif data/ground-truth/192 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/193.tif data/ground-truth/193 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/194.tif data/ground-truth/194 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/195.tif data/ground-truth/195 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/196.tif data/ground-truth/196 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/197.tif data/ground-truth/197 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/198.tif data/ground-truth/198 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/199.tif data/ground-truth/199 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
tesseract data/ground-truth/200.tif data/ground-truth/200 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-9-g49ed with Leptonica
Page 1
find data/ground-truth -name '*.lstmf' -exec echo {} \; | sort -R -o "data/all-lstmf"
total=`cat data/all-lstmf | wc -l` \
   no=`echo "$total * 0.90 / 1" | bc`; \
   head -n "$no" data/all-lstmf > "data/list.train"
total=`cat data/all-lstmf | wc -l` \
   no=`echo "($total - $total * 0.90) / 1" | bc`; \
   tail -n "$no" data/all-lstmf > "data/list.eval"
combine_lang_model \
  --input_unicharset data/unicharset \
  --script_dir data/ \
  --output_dir data/ \
  --pass_through_recoder \
  --lang tam
Loaded unicharset of size 99 from file data/unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:data//Tamil.unicharset
Failed to load script unicharset from:data//Latin.unicharset
Warning: properties incomplete for index 3 = [
Warning: properties incomplete for index 4 = ]
Warning: properties incomplete for index 5 = 2
Warning: properties incomplete for index 6 = ப
Warning: properties incomplete for index 7 = ொ
Warning: properties incomplete for index 8 = ர
Warning: properties incomplete for index 9 = ு
Warning: properties incomplete for index 10 = ட
Warning: properties incomplete for index 11 = ்
Warning: properties incomplete for index 12 = க
Warning: properties incomplete for index 13 = ள
Warning: properties incomplete for index 14 = ி
Warning: properties incomplete for index 15 = ன
Warning: properties incomplete for index 16 = ்‌
Warning: properties incomplete for index 17 = 4
Warning: properties incomplete for index 18 = "
Warning: properties incomplete for index 19 = -
Warning: properties incomplete for index 20 = ஐ
Warning: properties incomplete for index 21 = /
Warning: properties incomplete for index 22 = )
Warning: properties incomplete for index 23 = 9
Warning: properties incomplete for index 24 = 8
Warning: properties incomplete for index 25 = வ
Warning: properties incomplete for index 26 = ோ
Warning: properties incomplete for index 27 = ே
Warning: properties incomplete for index 28 = ெ
Warning: properties incomplete for index 29 = ஷ
Warning: properties incomplete for index 30 = ா
Warning: properties incomplete for index 31 = .
Warning: properties incomplete for index 32 = த
Warning: properties incomplete for index 33 = ,
Warning: properties incomplete for index 34 = ண
Warning: properties incomplete for index 35 = ம
Warning: properties incomplete for index 36 = 3
Warning: properties incomplete for index 37 = 6
Warning: properties incomplete for index 38 = 7
Warning: properties incomplete for index 39 = 0
Warning: properties incomplete for index 40 = ீ
Warning: properties incomplete for index 41 = ற
Warning: properties incomplete for index 42 = ஸ
Warning: properties incomplete for index 43 = அ
Warning: properties incomplete for index 44 = ழ
Warning: properties incomplete for index 45 = !
Warning: properties incomplete for index 46 = எ
Warning: properties incomplete for index 47 = ச
Warning: properties incomplete for index 48 = ல
Warning: properties incomplete for index 49 = ூ
Warning: properties incomplete for index 50 = 5
Warning: properties incomplete for index 51 = இ
Warning: properties incomplete for index 52 = 1
Warning: properties incomplete for index 53 = ய
Warning: properties incomplete for index 54 = ந
Warning: properties incomplete for index 55 = ை
Warning: properties incomplete for index 56 = ஞ
Warning: properties incomplete for index 57 = (
Warning: properties incomplete for index 58 = '
Warning: properties incomplete for index 59 = :
Warning: properties incomplete for index 60 = உ
Warning: properties incomplete for index 61 = ஜ
Warning: properties incomplete for index 62 = ங
Warning: properties incomplete for index 63 = ஆ
Warning: properties incomplete for index 64 = ஏ
Warning: properties incomplete for index 65 = ?
Warning: properties incomplete for index 66 = ஹ
Warning: properties incomplete for index 67 = ”
Warning: properties incomplete for index 68 = ஓ
Warning: properties incomplete for index 69 = ;
Warning: properties incomplete for index 70 = ஈ
Warning: properties incomplete for index 71 = “
Warning: properties incomplete for index 72 = *
Warning: properties incomplete for index 73 = ஒ
Warning: properties incomplete for index 74 = ஊ
Warning: properties incomplete for index 75 = ஃ
Warning: properties incomplete for index 76 = %
Warning: properties incomplete for index 77 = ।
Warning: properties incomplete for index 78 = _
Warning: properties incomplete for index 79 = $
Warning: properties incomplete for index 80 = ௦
Warning: properties incomplete for index 81 = ௫
Warning: properties incomplete for index 82 = |
Warning: properties incomplete for index 83 = &
Warning: properties incomplete for index 84 = ௩
Warning: properties incomplete for index 85 = ௨
Warning: properties incomplete for index 86 = ௮
Warning: properties incomplete for index 87 = ௧
Warning: properties incomplete for index 88 = €
Warning: properties incomplete for index 89 = ௪
Warning: properties incomplete for index 90 = ௯
Warning: properties incomplete for index 91 = ௬
Warning: properties incomplete for index 92 = £
Warning: properties incomplete for index 93 = ₹
Warning: properties incomplete for index 94 = ॥
Warning: properties incomplete for index 95 = ௭
Warning: properties incomplete for index 96 = ௰
Warning: properties incomplete for index 97 = ௱
Warning: properties incomplete for index 98 = ௲
Config file is optional, continuing...
mkdir -p data/checkpoints
lstmtraining \
  --traineddata data/tam/tam.traineddata \
      --old_traineddata ../tessdata_best/tam.traineddata \
  --continue_from data/tam/tam.lstm \
  --model_output data/checkpoints/tam \
  --train_listfile data/list.train \
  --eval_listfile data/list.eval \
  --max_iterations 400
Loaded file data/tam/tam.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 99 to 99!
Num (Extended) outputs,weights in Series:
  1,36,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  Lfys48:48, 12480
  Lfx96:96, 55680
  Lrx96:96, 74112
  Lfx192:192, 221952
  Fc99:99, 19107
Total weights = 383491
Previous null char=2 mapped to 2
Continuing from data/tam/tam.lstm
Loaded 1/1 pages (1-1) of document data/ground-truth/192.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/193.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/198.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/191.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/199.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/195.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/196.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/200.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/194.lstmf
Loaded 1/1 pages (1-1) of document data/ground-truth/190.lstmf
2 Percent improvement time=50, best error was 100 @ 0
At iteration 50/100/100, Mean rms=1.173%, delta=2.113%, char train=9.344%, word train=2%, skip ratio=0%,  New best char error = 9.344 Transitioned to stage 1 wrote best model:data/checkpoints/tam9.344_50.checkpoint wrote checkpoint.

2 Percent improvement time=0, best error was 9.344 @ 50
At iteration 50/200/200, Mean rms=0.735%, delta=1.057%, char train=4.836%, word train=1%, skip ratio=0%,  New best char error = 4.836 wrote best model:data/checkpoints/tam4.836_50.checkpoint wrote checkpoint.

2 Percent improvement time=0, best error was 9.344 @ 50
At iteration 50/300/300, Mean rms=0.536%, delta=0.704%, char train=3.224%, word train=0.667%, skip ratio=0%,  New best char error = 3.224 wrote best model:data/checkpoints/tam3.224_50.checkpoint wrote checkpoint.

2 Percent improvement time=1, best error was 4.836 @ 50
At iteration 51/400/400, Mean rms=0.44%, delta=0.535%, char train=2.449%, word train=0.75%, skip ratio=0%,  New best char error = 2.449 wrote best model:data/checkpoints/tam2.449_51.checkpoint wrote checkpoint.

Finished! Error rate = 2.449
lstmtraining \
--stop_training \
--continue_from data/checkpoints/tam_checkpoint \
--traineddata data/tam/tam.traineddata \
--model_output data/tam.traineddata
Loaded file data/checkpoints/tam_checkpoint, unpacking...

from tesstrain.

vijayrajasekaran commented on May 30, 2024

@Shreeshrii Thanks for implementing Wordstr Box as an option. During testing I encountered issues such as empty all-boxes, unicharset files. I tried some changes too for using the current tif + gt.txt pairs and I would like to get feedback on them. I no longer get Normalization and Can't encode transcription errors, all characters that are valid are added to the unicharset files.

Like how you commented out the validation in validate_grapheme.cpp in this PR. I modified the below mentioned files and able to get a better fine-tuned tam data:

validate_indic.cpp#L37

Commented out the lines #L37 #L38 #L39

        // tprintf("Invalid start of grapheme sequence:%c=0x%x\n",
                // codes_[codes_used_].first, codes_[codes_used_].second);
      // }

unicharset_extractor.cpp#L59

Commented out the Error message and also adding the strings to the Unicharset file that are shown as errors (which are actually valid characters present in tam.lstm-unicharset and Tamil.unicharset)

// tprintf("Normalization failed for string '%s'\n", strings[i].c_str());
unicharset->unichar_insert(strings[i].c_str());

Training using the current Makefile using changes suggested in your PR with OCRD-Train:

Build Command: make training MODEL_NAME=tamtest START_MODEL=tam NORM_MODE=2
I removed the ₹ symbol to resolve the error Warning: properties incomplete for index 97 = ₹ in START_MODEL tam/tam.lstm-unicharset and commented out #L98 #L99 in Makefile so that it doesn't get overwritten in the subsequent trainings.
I am now able to get valid my.unicharset, unicharset
But, I am still facing the below errors while training. All characters were extracted well except for the characters with ் mentioned in the error below. This issue only occurs with the generated box and all-boxes file.

Warning: properties incomplete for index 11 = ்
Warning: properties incomplete for index 97 = ்‌

For example, க் should be split as க + ் instead it is added as க் in the box files which affects the training and while doing OCR with real world data as well. Accuracy gets a hit on characters with ், It happens for all characters with ் such as ன், ப் etc.
்‌ is referenced as [bcd 200c ] instead of [bcd ] in unicharset as shown below. I had tried changing the same in Tamil.unicharset and tam.lstm-unicharset files as well which didn't work.

்‌ 0 0,255,0,255,0,0,0,0,0,0 Tamil 5 17 5 ்‌	# ்‌ [bcd 200c ]

The python script generate_line_box.py is splitting all the characters in the right way (பா as ப + ா and மோ as ம + ோ etc..) except for the characters with ் . Can this be resolved in the python script itself? Attaching the files for evaluation.

Thanks.

Training Data
Training_Log.txt
unicharset.txt
my.unicharset.txt
all-boxes.txt
190.box.txt
191.box.txt
192.box.txt

EDIT

With reference to the below taken from Tesseract 4 Training Wiki, I wonder what is the best way to split up words and generate the corresponding unicharset box files. For example, whether to split மோ as ம + ோ or leave it as மோ.

Unicharset Compression-recoding

LSTMs are great at learning sequences, but slow down a lot when the number of states is too large. There are empirical results that suggest it is better to ask an LSTM to learn a long sequence than a short sequence of many classes, so for the complex scripts, (Han, Hangul, and the Indic scripts) it is better to recode each symbol as a short sequence of codes from a small number of classes than have a large set of classes. The combine_lang_model command has this feature on by default. It encodes each Han character as a variable-length sequence of 1-5 codes, Hangul using the Jamo encoding as a sequence of 3 codes, and other scripts as a sequence of their unicode components. For the scripts that use a virama character to generate conjunct consonants, (All the Indic scripts plus Myanmar and Khmer) the function NormalizeCleanAndSegmentUTF8 pairs the virama with an appropriate neighbor to generate a more glyph-oriented encoding in the unicharset. To make full use of this improvement, the --pass_through_recoder flag should be set for combine_lang_model for these scripts.

from tesstrain.

Shreeshrii commented on May 30, 2024

I usually use tesstrain.sh for testing training. I have made a new PR related to Tamil not related to
். I will get back with the results soon.

from tesstrain.

vijayrajasekaran commented on May 30, 2024

@Shreeshrii Thanks for all the help, looking forward. 👍 Also, looking to understand what to change in generate_line_box.py, so that it extracts க் as க and ் etc.

@wrznr Any idea how can this be achieved in the python script? Thanks.

from tesstrain.

Shreeshrii commented on May 30, 2024

Also, looking to understand what to change in generate_line_box.py, so that it extracts க் as க and ் etc.

try norm_mode=3

from tesstrain.

Shreeshrii commented on May 30, 2024

tam_plusminus.zip

rm -rf ~/tesstutorial/tamocrd
bash  ~/tesseract/src/training/tesstrain.sh \
  --fonts_dir ~/.fonts \
  --lang tam \
  --linedata_only \
  --save_box_tiff \
  --workspace_dir ~/tmp \
  --exposures "0" \
  --maxpages 1 \
  --noextract_font_properties \
  --fontlist \
"Arial Unicode MS" \
"FreeSerif" \
"Karla Tamil Inclined Bold Italic" \
"Karla Tamil Inclined Italic" \
"Karla Tamil Upright Bold" \
"Karla Tamil Upright Regular" \
"Lohit Tamil Classical Regular" \
"Lohit Tamil Regular" \
"Lohit Tamil Regular" \
"Noto Sans Tamil Bold" \
"Noto Sans Tamil Regular" \
"TAMu_Kadambri Regular" \
"TAMu_Kalyani Regular" \
"TAMu_Maduram Normal" \
"TSCu_Comic Normal" \
"TSCu_Paranar Bold" \
"TSCu_Paranar Regular" \
"TSCu_Times Normal" \
  --langdata_dir ~/langdata_lstm \
  --tessdata_dir ~/tessdata_best  \
  --training_text /home/ubuntu/ocrd-train/tam.txt \
  --output_dir ~/tesstutorial/tamocrd
  
rm -rf ~/tesstutorial/tamtest
bash  ~/tesseract/src/training/tesstrain.sh \
  --fonts_dir ~/.fonts \
  --lang tam \
  --linedata_only \
  --save_box_tiff \
  --workspace_dir ~/tmp \
  --exposures "0" \
  --maxpages 5 \
  --noextract_font_properties \
  --langdata_dir ~/langdata_lstm \
  --tessdata_dir ~/tessdata_best  \
  --fontlist \
"Arial Unicode MS" \
"FreeSerif" \
"Karla Tamil Inclined Bold Italic" \
"Karla Tamil Inclined Italic" \
"Karla Tamil Upright Bold" \
"Karla Tamil Upright Regular" \
"Lohit Tamil Classical Regular" \
"Lohit Tamil Regular" \
"Lohit Tamil Regular" \
"Noto Sans Tamil Bold" \
"Noto Sans Tamil Regular" \
"TAMu_Kadambri Regular" \
"TAMu_Kalyani Regular" \
"TAMu_Maduram Normal" \
"TSCu_Comic Normal" \
"TSCu_Paranar Bold" \
"TSCu_Paranar Regular" \
"TSCu_Times Normal" \
"e-Grantamil" \
"Arima Madurai" \
"Mukta Malar" \
"Catamaran" \
"Hind Madurai" \
"Meera Inimai" \
"Pavanam" \
  --training_text ~/langdata/tam/tam.training_text \
  --output_dir ~/tesstutorial/tamtest

rm -rf ~/tesstutorial/plusminus_from_tam
mkdir -p ~/tesstutorial/plusminus_from_tam
#
combine_tessdata -e ~/tessdata_best/tam.traineddata \
  ~/tesstutorial/plusminus_from_tam/tam.lstm
#
  
for ((num_iterations=4100; num_iterations<=6000; num_iterations+=100)); do

  lstmtraining \
  --debug_interval 0 \
  --model_output ~/tesstutorial/plusminus_from_tam/plusminus \
  --continue_from ~/tesstutorial/plusminus_from_tam/tam.lstm \
  --old_traineddata ~/tessdata_best/tam.traineddata \
  --traineddata ~/tesstutorial/tamtest/tam/tam.traineddata \
  --train_listfile ~/tesstutorial/tamtest/tam.training_files.txt \
  --eval_listfile ~/tesstutorial/tamocrd/tam.training_files.txt \
  --max_image_MB 6000 \
  --max_iterations $num_iterations
  
  lstmeval \
  --verbosity -1 \
  --model ~/tesstutorial/plusminus_from_tam/plusminus_checkpoint \
  --traineddata ~/tesstutorial/tamtest/tam/tam.traineddata  \
  --eval_listfile ~/tesstutorial/tamocrd/tam.training_files.txt

done
  
time lstmeval \
  --verbosity 0 \
  --model ~/tessdata_best/tam.traineddata  \
  --eval_listfile ~/tesstutorial/tamocrd/tam.training_files.txt
  
time lstmeval \
  --verbosity 0 \
  --model ~/tessdata_fast/tam.traineddata  \
  --eval_listfile ~/tesstutorial/tamocrd/tam.training_files.txt
  
lstmtraining \
  --stop_training \
  --model_output ~/tesstutorial/plusminus_from_tam/tam_plusminus.traineddata \
  --continue_from ~/tesstutorial/plusminus_from_tam/plusminus_checkpoint \
  --traineddata ~/tesstutorial/tamtest/tam/tam.traineddata

from tesstrain.

Shreeshrii commented on May 30, 2024

for i in $(seq -f "%03g" 190 200) ; do
    tesseract /home/ubuntu/ocrd-train/data/ground-truth/$i.tif \
      /home/ubuntu/ocrd-train/data/ground-truth/$i \
      --tessdata-dir ~/tesstutorial/plusminus_from_tam -l tam_plusminus --psm 6 
done

for i in $(seq -f "%03g" 190 200) ; do
wdiff --no-common --statistics ~/ocrd-train/data/ground-truth/$i.gt.txt ~/ocrd-train/data/ground-truth/$i.txt
done

======================================================================
[-அக்கம்மாள்-]{+அககம்மாள்+}
======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/190.gt.txt: 1 word  0 0% common  0 0% deleted  1 100% changed
/home/ubuntu/ocrd-train/data/ground-truth/190.txt: 1 word  0 0% common  0 0% inserted  1 100% changed

======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/191.gt.txt: 1 word  1 100% common  0 0% deleted  0 0% changed
/home/ubuntu/ocrd-train/data/ground-truth/191.txt: 1 word  1 100% common  0 0% inserted  0 0% changed

======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/192.gt.txt: 1 word  1 100% common  0 0% deleted  0 0% changed
/home/ubuntu/ocrd-train/data/ground-truth/192.txt: 1 word  1 100% common  0 0% inserted  0 0% changed

======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/193.gt.txt: 1 word  1 100% common  0 0% deleted  0 0% changed
/home/ubuntu/ocrd-train/data/ground-truth/193.txt: 1 word  1 100% common  0 0% inserted  0 0% changed

======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/194.gt.txt: 1 word  1 100% common  0 0% deleted  0 0% changed
/home/ubuntu/ocrd-train/data/ground-truth/194.txt: 1 word  1 100% common  0 0% inserted  0 0% changed

======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/195.gt.txt: 1 word  1 100% common  0 0% deleted  0 0% changed
/home/ubuntu/ocrd-train/data/ground-truth/195.txt: 1 word  1 100% common  0 0% inserted  0 0% changed

======================================================================
[-கௌசல்யா-]{+கள சல்யா+}
======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/196.gt.txt: 1 word  0 0% common  0 0% deleted  1 100% changed
/home/ubuntu/ocrd-train/data/ground-truth/196.txt: 2 words  0 0% common  0 0% inserted  2 100% changed

======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/197.gt.txt: 1 word  1 100% common  0 0% deleted  0 0% changed
/home/ubuntu/ocrd-train/data/ground-truth/197.txt: 1 word  1 100% common  0 0% inserted  0 0% changed

======================================================================
[-பொன்னைய்யா-]{+பொன்னனயயா+}
======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/198.gt.txt: 1 word  0 0% common  0 0% deleted  1 100% changed
/home/ubuntu/ocrd-train/data/ground-truth/198.txt: 1 word  0 0% common  0 0% inserted  1 100% changed

======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/199.gt.txt: 1 word  1 100% common  0 0% deleted  0 0% changed
/home/ubuntu/ocrd-train/data/ground-truth/199.txt: 1 word  1 100% common  0 0% inserted  0 0% changed

======================================================================
/home/ubuntu/ocrd-train/data/ground-truth/200.gt.txt: 1 word  1 100% common  0 0% deleted  0 0% changed
/home/ubuntu/ocrd-train/data/ground-truth/200.txt: 1 word  1 100% common  0 0% inserted  0 0% changed

from tesstrain.

Shreeshrii commented on May 30, 2024

The zip file has the finetuned traineddata that you can test.

from tesstrain.

vijayrajasekaran commented on May 30, 2024

I will check the traineddata. Also, even after using norm_mode as 3 it doesn't split
க் as க and ் in the box and all-boxes files.

from tesstrain.

Shreeshrii commented on May 30, 2024

I don't think that the Warnings regarding ் are the real issue. If you review tam.training_text in langdata_lstm, you will find that it doesn't have any sample for `kau` - that is the reason it is not getting recognized.

…

On Fri, Mar 1, 2019 at 11:42 PM Vijay Rajasekaran ***@***.***> wrote: I will check the traineddata. Also, even after using norm_mode as 3 it doesn't split க் as க and ் in the box and all-boxes files. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o6A18YDSlFIz9MdgwDKzFyOtgQ2jks5vSW2CgaJpZM4avRaq> .

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

from tesstrain.

Can't encode transcription: 'क्षक्षाक्षिक्षी क्षु क्षू्क्षे क्षैक्षो क्षौ क्षं क्षः' in language '' about tesstrain HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent