<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

After training tesseract it dies when trying to create text from an image about tesseract-ocr HOT 12 CLOSED

gamjaradio commented on September 18, 2024

After training tesseract it dies when trying to create text from an image

from tesseract-ocr.

Comments (12)

GoogleCodeExporter commented on September 18, 2024

Found the problem, the DangAmbigs file is causing it to crash, without it 
continues
(but creates an almost empty output.txt file when I issue "tesseract 056-10.tif
output -l slo", contains only a few spaces it seems). But I don't see anything 
wrong
with DangAmbigs? I copied the English version and deleted a few lines (contained
characters that should not be used in slovene).

Original comment by [email protected] on 22 Jul 2007 at 3:46

Attachments:

output.txt

from tesseract-ocr.

GoogleCodeExporter commented on September 18, 2024

 Have you solved the problem ? If so, step by step procedure followed to create
slo.freq-dawg using commandline "wordlist2dawg frequent_words_list freq-dawg" 
may
please be explained in detail for benefit of others. It would be nice if you 
upload
copies of wordlists created by you for the purpose of (1) freq-dawg and 
(2)word-dawg.

In my case, I could not create freq-drawg for Kannada lan.

Original comment by [email protected] on 4 Aug 2007 at 5:38

from tesseract-ocr.

GoogleCodeExporter commented on September 18, 2024

I have also tried to teach tesseract Slovene language and had the same problem. 
I
solved it with building *.box files with at least one box for every letter 
known in
Slovene language (this at-least-one-sample-of-every-letter is probably also 
needed to
teach tesseract properly) so that the resulting unicharset list had all 
characters in
Slovene language (and numbers, other symbols ...). (In my first version of it 
and in
previously attached slo.unicharset file some of them are missing.)

I think this is still a bug as it should print some meaningful error message. 
For
example at least: "Found a letter not in the unicharset list."

The results are just horrible. I will have to iterate the learning process (use
current version of learned Slovene language to read some more pages and repeat).

I am attaching the 1163700 words word_list and 50 words frequent_words_list I 
got
from aspell and Wikipedia:

http://sl.wikipedia.org/wiki/Najpogostej%C5%A1e_slovenske_besede

It took around three hours to compile word_list dawg file. :-)

Original comment by [email protected] on 8 Aug 2007 at 9:24

Attachments:

slv-words.tgz

from tesseract-ocr.

GoogleCodeExporter commented on September 18, 2024

withblessings: there is no step by step procedure, a word per line and issue 
that
command which you have specified :)

mmitar: wow, thank you for sharing that :) ... and yes about that you have to 
teach
it all the letters I unfortunately already know.

It seems kind of a bad move to have to teach a language from scratch. There are 
many
languages that share the same letters (all of the latin1 charset except "x", 
"y" and
"z" is present in surely more than ... 30 languages?) so I see it as a _great_
disadvantage that every single letter is language specific. There should've 
been a
global stash of letters (like latin1 charset) and then each additional language 
can
define it's own _additional_ letters.

Original comment by [email protected] on 8 Aug 2007 at 5:57

from tesseract-ocr.

GoogleCodeExporter commented on September 18, 2024

1. Note added to the TrainingTesseract wiki to confirm that you have to check 
the
output for errors and fix the box files to make sure there is at least one 
sample of
each character before continuing.

2. Agreed it is unfortunate that you have to supply samples of every character. 
While
it would be possible to take data from existing .tr files and just add a few new
characters, this would lead to a complexity nightmare compared with the current
training process, which you surely agree is complex enough. For one thing, the 
risk
of unicharset not matching the set of characters in the .tr files would be 
massively
increased. For another the complex sort and merge operation required would be 
hard
for most windows users to do as it would require heavy use of a unix shell like 
cygwin.

3. It seems that most (if not all) of the people currently training tesseract 
are
using windows, except at Google, where we are using Linux. That makes it harder 
for
us to support the training effort, as many useful things that we could do for 
one
platform would be useless for the other. However, your suggestion is a good 
one, and
I can see that it would be possible to build a small app that could do this 
sorting
and merging on windows. (Something that looks a bit like character map) Any
volunteers to build it?

Original comment by [email protected] on 17 Aug 2007 at 4:01

Changed state: Accepted

from tesseract-ocr.

GoogleCodeExporter commented on September 18, 2024

With reference to "(Something that looks a bit like character map)", it is 
available
in MSwindows like XP as a default for all world languages  -vide character 
Map.png
uploaded. As such, ssmall app has to be created to enable tesseract to call
CharacterMap from OS like XP and select lang reuired.
To view all world languages, it has to be enabled in Control panel ->"Regional &
Lan..." -vide Regional & language options.png (which is self explnatory) 
uploaded.

Original comment by [email protected] on 17 Aug 2007 at 5:17

Attachments:

[character map world languages part-I.bmp](https://storage.googleapis.com/google-code-attachments/tesseract-ocr/issue-47/comment-6/character map world languages part-I.bmp)
[character map2.PNG](https://storage.googleapis.com/google-code-attachments/tesseract-ocr/issue-47/comment-6/character map2.PNG)
[Regional and Language options.PNG](https://storage.googleapis.com/google-code-attachments/tesseract-ocr/issue-47/comment-6/Regional and Language options.PNG)

from tesseract-ocr.

GoogleCodeExporter commented on September 18, 2024

theraysmith: 3. I don't use Windows, I prefer BSD, so "if not all" is not 
likely. :P

Original comment by [email protected] on 20 Aug 2007 at 10:45

from tesseract-ocr.

GoogleCodeExporter commented on September 18, 2024

I replaced
  assert(length > 0 && length <= UNICHAR_LEN);
  assert(ids.contains(unichar_repr, length));
  return ids.unichar_to_id(unichar_repr, length);

with
  if ( ids.contains(unichar_repr, length) ) {
    return ids.unichar_to_id(unichar_repr, length);
  }
  else { 
    // what a pity.
    return 2;
  }

where 2 is just an arbitrary value. I did not take the time to look which value 
might
make more sense I just assumed that the index "2" exists and I did not bother 
to dig
into the details of the inner structures. 

I do not care much if a single character is not recognized. There are lots of 
others
that will not be recognized either when reading fraktur. But asserting and 
dumping
core just because the config file has some problems definitely is a bad idea.

Original comment by [email protected] on 8 Jan 2008 at 10:43

from tesseract-ocr.

GoogleCodeExporter commented on September 18, 2024

I'm attempting to train tesseract to work on a dictionary digitization project 
for
the Salishan language Lillooet. I went through the training, reran the OCR on 
the
training page to make sure there were no mistakes and found one. I corrected it,
reran all the necessary commands (tesseract ... box.train, mftraining, 
cntraining,
unicharset_extractor) and tried again. When I did so, I started getting the 
above
assertion. I added a print statement to figure out where it dies and the 
following is
what shows up:
x̌wəmʼ-c-minʼ!'to
wəmʼ-c-minʼ!'to
x̌wəmʼ-c-minʼ!'to
əmʼ-c-minʼ!'to
əmʼ-c-minʼ!'to
x̌ʷəmʼ-c-minʼ!!
̌ʷəmʼ-c-minʼ!!
For some reason, tesseract is stepping through this string and removes the x 
without
bringing the caron with it. (There does not appear to be an X WITH CARON 
character in
Unicode, so the combining character is necessary.) However, it doesn't do this
earlier. The caron alone is nowhere in the repertoire and shouldn't be, as it 
never
appears in isolation. Any idea what the cause of this is? (Let me know if I 
should
attach files.)

Original comment by [email protected] on 6 Mar 2008 at 7:11

from tesseract-ocr.

GoogleCodeExporter commented on September 18, 2024

I am receiving this error.  My box file did not have any "fatalities".  It 
recognized
and identified all characters.  The training process seemed to complete okay, 
and I
copied the resultant 8 files to a brand new language, named by the font name
FiveLineThinFont.  When I feed a .txt file in, I get the assert and core dump.  
What am I doing wrong?  Is this thread saying that every language must contain a
character for every other language?  Doesn't the -l option take care of this?

Original comment by [email protected] on 21 Jul 2008 at 4:02

from tesseract-ocr.

GoogleCodeExporter commented on September 18, 2024

Comment 11 - follow on to 10
I am using version 2.01.

Original comment by [email protected] on 21 Jul 2008 at 4:04

from tesseract-ocr.

GoogleCodeExporter commented on September 18, 2024

These issues were resolved in 2.03.

Original comment by [email protected] on 30 Dec 2008 at 9:36

Changed state: Fixed

from tesseract-ocr.

After training tesseract it dies when trying to create text from an image about tesseract-ocr HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent