Code Monkey home page Code Monkey logo

Comments (12)

GoogleCodeExporter avatar GoogleCodeExporter commented on September 18, 2024
Found the problem, the DangAmbigs file is causing it to crash, without it 
continues
(but creates an almost empty output.txt file when I issue "tesseract 056-10.tif
output -l slo", contains only a few spaces it seems). But I don't see anything 
wrong
with DangAmbigs? I copied the English version and deleted a few lines (contained
characters that should not be used in slovene).

Original comment by [email protected] on 22 Jul 2007 at 3:46

Attachments:

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on September 18, 2024
 Have you solved the problem ? If so, step by step procedure followed to create
slo.freq-dawg using commandline "wordlist2dawg frequent_words_list freq-dawg" 
may
please be explained in detail for benefit of others. It would be nice if you 
upload
copies of wordlists created by you for the purpose of (1) freq-dawg and 
(2)word-dawg.

In my case, I could not create freq-drawg for Kannada lan.

Original comment by [email protected] on 4 Aug 2007 at 5:38

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on September 18, 2024
I have also tried to teach tesseract Slovene language and had the same problem. 
I
solved it with building *.box files with at least one box for every letter 
known in
Slovene language (this at-least-one-sample-of-every-letter is probably also 
needed to
teach tesseract properly) so that the resulting unicharset list had all 
characters in
Slovene language (and numbers, other symbols ...). (In my first version of it 
and in
previously attached slo.unicharset file some of them are missing.)

I think this is still a bug as it should print some meaningful error message. 
For
example at least: "Found a letter not in the unicharset list."

The results are just horrible. I will have to iterate the learning process (use
current version of learned Slovene language to read some more pages and repeat).

I am attaching the 1163700 words word_list and 50 words frequent_words_list I 
got
from aspell and Wikipedia:

http://sl.wikipedia.org/wiki/Najpogostej%C5%A1e_slovenske_besede

It took around three hours to compile word_list dawg file. :-)

Original comment by [email protected] on 8 Aug 2007 at 9:24

Attachments:

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on September 18, 2024
withblessings: there is no step by step procedure, a word per line and issue 
that
command which you have specified :)

mmitar: wow, thank you for sharing that :) ... and yes about that you have to 
teach
it all the letters I unfortunately already know.

It seems kind of a bad move to have to teach a language from scratch. There are 
many
languages that share the same letters (all of the latin1 charset except "x", 
"y" and
"z" is present in surely more than ... 30 languages?) so I see it as a _great_
disadvantage that every single letter is language specific. There should've 
been a
global stash of letters (like latin1 charset) and then each additional language 
can
define it's own _additional_ letters.

Original comment by [email protected] on 8 Aug 2007 at 5:57

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on September 18, 2024
1. Note added to the TrainingTesseract wiki to confirm that you have to check 
the
output for errors and fix the box files to make sure there is at least one 
sample of
each character before continuing.

2. Agreed it is unfortunate that you have to supply samples of every character. 
While
it would be possible to take data from existing .tr files and just add a few new
characters, this would lead to a complexity nightmare compared with the current
training process, which you surely agree is complex enough. For one thing, the 
risk
of unicharset not matching the set of characters in the .tr files would be 
massively
increased. For another the complex sort and merge operation required would be 
hard
for most windows users to do as it would require heavy use of a unix shell like 
cygwin.

3. It seems that most (if not all) of the people currently training tesseract 
are
using windows, except at Google, where we are using Linux. That makes it harder 
for
us to support the training effort, as many useful things that we could do for 
one
platform would be useless for the other. However, your suggestion is a good 
one, and
I can see that it would be possible to build a small app that could do this 
sorting
and merging on windows. (Something that looks a bit like character map) Any
volunteers to build it?

Original comment by [email protected] on 17 Aug 2007 at 4:01

  • Changed state: Accepted

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on September 18, 2024
With reference to "(Something that looks a bit like character map)", it is 
available
in MSwindows like XP as a default for all world languages  -vide character 
Map.png
uploaded. As such, ssmall app has to be created to enable tesseract to call
CharacterMap from OS like XP and select lang reuired.
To view all world languages, it has to be enabled in Control panel ->"Regional &
Lan..." -vide Regional & language options.png (which is self explnatory) 
uploaded. 

Original comment by [email protected] on 17 Aug 2007 at 5:17

Attachments:

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on September 18, 2024
theraysmith: 3. I don't use Windows, I prefer BSD, so "if not all" is not 
likely. :P

Original comment by [email protected] on 20 Aug 2007 at 10:45

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on September 18, 2024
I replaced
  assert(length > 0 && length <= UNICHAR_LEN);
  assert(ids.contains(unichar_repr, length));
  return ids.unichar_to_id(unichar_repr, length);

with
  if ( ids.contains(unichar_repr, length) ) {
    return ids.unichar_to_id(unichar_repr, length);
  }
  else { 
    // what a pity.
    return 2;
  }

where 2 is just an arbitrary value. I did not take the time to look which value 
might
make more sense I just assumed that the index "2" exists and I did not bother 
to dig
into the details of the inner structures. 

I do not care much if a single character is not recognized. There are lots of 
others
that will not be recognized either when reading fraktur. But asserting and 
dumping
core just because the config file has some problems definitely is a bad idea.

Original comment by [email protected] on 8 Jan 2008 at 10:43

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on September 18, 2024
I'm attempting to train tesseract to work on a dictionary digitization project 
for
the Salishan language Lillooet. I went through the training, reran the OCR on 
the
training page to make sure there were no mistakes and found one. I corrected it,
reran all the necessary commands (tesseract ... box.train, mftraining, 
cntraining,
unicharset_extractor) and tried again. When I did so, I started getting the 
above
assertion. I added a print statement to figure out where it dies and the 
following is
what shows up:
x̌wəmʼ-c-minʼ!'to
wəmʼ-c-minʼ!'to
x̌wəmʼ-c-minʼ!'to
əmʼ-c-minʼ!'to
əmʼ-c-minʼ!'to
x̌ʷəmʼ-c-minʼ!!
̌ʷəmʼ-c-minʼ!!
For some reason, tesseract is stepping through this string and removes the x 
without
bringing the caron with it. (There does not appear to be an X WITH CARON 
character in
Unicode, so the combining character is necessary.) However, it doesn't do this
earlier. The caron alone is nowhere in the repertoire and shouldn't be, as it 
never
appears in isolation. Any idea what the cause of this is? (Let me know if I 
should
attach files.)

Original comment by [email protected] on 6 Mar 2008 at 7:11

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on September 18, 2024
I am receiving this error.  My box file did not have any "fatalities".  It 
recognized
and identified all characters.  The training process seemed to complete okay, 
and I
copied the resultant 8 files to a brand new language, named by the font name
FiveLineThinFont.  When I feed a .txt file in, I get the assert and core dump.  
What am I doing wrong?  Is this thread saying that every language must contain a
character for every other language?  Doesn't the -l option take care of this?

Original comment by [email protected] on 21 Jul 2008 at 4:02

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on September 18, 2024
Comment 11 - follow on to 10
I am using version 2.01.

Original comment by [email protected] on 21 Jul 2008 at 4:04

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on September 18, 2024
These issues were resolved in 2.03.

Original comment by [email protected] on 30 Dec 2008 at 9:36

  • Changed state: Fixed

from tesseract-ocr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.