Comments (12)
Found the problem, the DangAmbigs file is causing it to crash, without it
continues
(but creates an almost empty output.txt file when I issue "tesseract 056-10.tif
output -l slo", contains only a few spaces it seems). But I don't see anything
wrong
with DangAmbigs? I copied the English version and deleted a few lines (contained
characters that should not be used in slovene).
Original comment by [email protected]
on 22 Jul 2007 at 3:46
Attachments:
from tesseract-ocr.
Have you solved the problem ? If so, step by step procedure followed to create
slo.freq-dawg using commandline "wordlist2dawg frequent_words_list freq-dawg"
may
please be explained in detail for benefit of others. It would be nice if you
upload
copies of wordlists created by you for the purpose of (1) freq-dawg and
(2)word-dawg.
In my case, I could not create freq-drawg for Kannada lan.
Original comment by [email protected]
on 4 Aug 2007 at 5:38
from tesseract-ocr.
I have also tried to teach tesseract Slovene language and had the same problem.
I
solved it with building *.box files with at least one box for every letter
known in
Slovene language (this at-least-one-sample-of-every-letter is probably also
needed to
teach tesseract properly) so that the resulting unicharset list had all
characters in
Slovene language (and numbers, other symbols ...). (In my first version of it
and in
previously attached slo.unicharset file some of them are missing.)
I think this is still a bug as it should print some meaningful error message.
For
example at least: "Found a letter not in the unicharset list."
The results are just horrible. I will have to iterate the learning process (use
current version of learned Slovene language to read some more pages and repeat).
I am attaching the 1163700 words word_list and 50 words frequent_words_list I
got
from aspell and Wikipedia:
http://sl.wikipedia.org/wiki/Najpogostej%C5%A1e_slovenske_besede
It took around three hours to compile word_list dawg file. :-)
Original comment by [email protected]
on 8 Aug 2007 at 9:24
Attachments:
from tesseract-ocr.
withblessings: there is no step by step procedure, a word per line and issue
that
command which you have specified :)
mmitar: wow, thank you for sharing that :) ... and yes about that you have to
teach
it all the letters I unfortunately already know.
It seems kind of a bad move to have to teach a language from scratch. There are
many
languages that share the same letters (all of the latin1 charset except "x",
"y" and
"z" is present in surely more than ... 30 languages?) so I see it as a _great_
disadvantage that every single letter is language specific. There should've
been a
global stash of letters (like latin1 charset) and then each additional language
can
define it's own _additional_ letters.
Original comment by [email protected]
on 8 Aug 2007 at 5:57
from tesseract-ocr.
1. Note added to the TrainingTesseract wiki to confirm that you have to check
the
output for errors and fix the box files to make sure there is at least one
sample of
each character before continuing.
2. Agreed it is unfortunate that you have to supply samples of every character.
While
it would be possible to take data from existing .tr files and just add a few new
characters, this would lead to a complexity nightmare compared with the current
training process, which you surely agree is complex enough. For one thing, the
risk
of unicharset not matching the set of characters in the .tr files would be
massively
increased. For another the complex sort and merge operation required would be
hard
for most windows users to do as it would require heavy use of a unix shell like
cygwin.
3. It seems that most (if not all) of the people currently training tesseract
are
using windows, except at Google, where we are using Linux. That makes it harder
for
us to support the training effort, as many useful things that we could do for
one
platform would be useless for the other. However, your suggestion is a good
one, and
I can see that it would be possible to build a small app that could do this
sorting
and merging on windows. (Something that looks a bit like character map) Any
volunteers to build it?
Original comment by [email protected]
on 17 Aug 2007 at 4:01
- Changed state: Accepted
from tesseract-ocr.
With reference to "(Something that looks a bit like character map)", it is
available
in MSwindows like XP as a default for all world languages -vide character
Map.png
uploaded. As such, ssmall app has to be created to enable tesseract to call
CharacterMap from OS like XP and select lang reuired.
To view all world languages, it has to be enabled in Control panel ->"Regional &
Lan..." -vide Regional & language options.png (which is self explnatory)
uploaded.
Original comment by [email protected]
on 17 Aug 2007 at 5:17
Attachments:
- [character map world languages part-I.bmp](https://storage.googleapis.com/google-code-attachments/tesseract-ocr/issue-47/comment-6/character map world languages part-I.bmp)
- [character map2.PNG](https://storage.googleapis.com/google-code-attachments/tesseract-ocr/issue-47/comment-6/character map2.PNG)
- [Regional and Language options.PNG](https://storage.googleapis.com/google-code-attachments/tesseract-ocr/issue-47/comment-6/Regional and Language options.PNG)
from tesseract-ocr.
theraysmith: 3. I don't use Windows, I prefer BSD, so "if not all" is not
likely. :P
Original comment by [email protected]
on 20 Aug 2007 at 10:45
from tesseract-ocr.
I replaced
assert(length > 0 && length <= UNICHAR_LEN);
assert(ids.contains(unichar_repr, length));
return ids.unichar_to_id(unichar_repr, length);
with
if ( ids.contains(unichar_repr, length) ) {
return ids.unichar_to_id(unichar_repr, length);
}
else {
// what a pity.
return 2;
}
where 2 is just an arbitrary value. I did not take the time to look which value
might
make more sense I just assumed that the index "2" exists and I did not bother
to dig
into the details of the inner structures.
I do not care much if a single character is not recognized. There are lots of
others
that will not be recognized either when reading fraktur. But asserting and
dumping
core just because the config file has some problems definitely is a bad idea.
Original comment by [email protected]
on 8 Jan 2008 at 10:43
from tesseract-ocr.
I'm attempting to train tesseract to work on a dictionary digitization project
for
the Salishan language Lillooet. I went through the training, reran the OCR on
the
training page to make sure there were no mistakes and found one. I corrected it,
reran all the necessary commands (tesseract ... box.train, mftraining,
cntraining,
unicharset_extractor) and tried again. When I did so, I started getting the
above
assertion. I added a print statement to figure out where it dies and the
following is
what shows up:
x̌wəmʼ-c-minʼ!'to
wəmʼ-c-minʼ!'to
x̌wəmʼ-c-minʼ!'to
əmʼ-c-minʼ!'to
əmʼ-c-minʼ!'to
x̌ʷəmʼ-c-minʼ!!
̌ʷəmʼ-c-minʼ!!
For some reason, tesseract is stepping through this string and removes the x
without
bringing the caron with it. (There does not appear to be an X WITH CARON
character in
Unicode, so the combining character is necessary.) However, it doesn't do this
earlier. The caron alone is nowhere in the repertoire and shouldn't be, as it
never
appears in isolation. Any idea what the cause of this is? (Let me know if I
should
attach files.)
Original comment by [email protected]
on 6 Mar 2008 at 7:11
from tesseract-ocr.
I am receiving this error. My box file did not have any "fatalities". It
recognized
and identified all characters. The training process seemed to complete okay,
and I
copied the resultant 8 files to a brand new language, named by the font name
FiveLineThinFont. When I feed a .txt file in, I get the assert and core dump.
What am I doing wrong? Is this thread saying that every language must contain a
character for every other language? Doesn't the -l option take care of this?
Original comment by [email protected]
on 21 Jul 2008 at 4:02
from tesseract-ocr.
Comment 11 - follow on to 10
I am using version 2.01.
Original comment by [email protected]
on 21 Jul 2008 at 4:04
from tesseract-ocr.
These issues were resolved in 2.03.
Original comment by [email protected]
on 30 Dec 2008 at 9:36
- Changed state: Fixed
from tesseract-ocr.
Related Issues (20)
- tesseeract-ocr from dvn won't build with gettext 0.18 HOT 1
- tesseeract-ocr from svn won't build with gettext 0.18 HOT 12
- Could not open file eng.user-words HOT 7
- [Feature Request] Dictionaries should provide provide an easy way do identify them automatically HOT 24
- tesseract crash after training. HOT 1
- why is the output is junk.tr?? HOT 2
- why is the output is junk.tr?? HOT 11
- Doesn't compile with the flag -DGRAPHICS_DISABLED HOT 1
- Package(s) are tarbombs HOT 2
- help ! dont know what had happen.(assertion failed) HOT 6
- `config/config.rpath' not found, `./ABOUT-NLS' not found HOT 2
- Arco-Gio International HOT 1
- stdio.h not included in viewer/svutil.cpp - breaks compilation on Ubuntu 10.04 HOT 4
- problem about combine_tessdata HOT 12
- how to combine new .traineddata file with the one provide on the web? HOT 3
- Windows : application crash with tesseract 2.04 (lastest stable version) HOT 4
- Some file pointers are not closed HOT 1
- Danish fraktur update HOT 1
- why all the chinese character is wrong . HOT 3
- Include a C wrapper in TessBaseAPI (baseapi.cpp/.h) HOT 81
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tesseract-ocr.