Comments (10)
Is there any workaround for this issue? For instance by using some training
facility?
Original comment by [email protected]
on 12 Apr 2007 at 9:22
from tesseract-ocr.
Well... you could probably add your own heuristics for these special cases
without
training (which does not yet work). For example, see how tess determines
whether a
"dot" is just noise or actually part of the letter "i":
http://tesseract-ocr.repairfaq.org/makerow_8cpp.html#90ccf46408d4dc726cb6ad4b7ab
b731d
Entry point might be here:
http://tesseract-ocr.repairfaq.org/makerow_8cpp.html#fdc4c4b87028fd7aafcb679f923
64d21
The "Meanwhile" part of your message is due to the permuter - the dictionary
included
with tess simply does not include words that contain certain letters in the
"wrong"
places so it will substitute something that makes "more" sense to it. See:
http://tesseract-ocr.repairfaq.org/allaboutdawg.html
Joke: You can probably include some of most common words that use é by using a
'6'
instead of the "é" just so that later you *know* which letter it found ;-)
Cheers,
Fil
Original comment by [email protected]
on 16 Apr 2007 at 1:26
from tesseract-ocr.
To clarify, tesseract will need *formal* language-specific support to work as
well
for other languages as it does now for English. This is not only because it was
trained for English fonts but also because the DAWG only has English words in
it. So,
right now you have expected problems *recognizing* the non-english letters *and*
expected problems *verifying* that the recognized letters lead to a *valid*
English
word. There are ways to shut off the permuter (with the config file, I forget
the
option, sorry) but *trust me* you do not want to do that :-)
http://tesseract-ocr.repairfaq.org/allaboutdawg.html
Cheers,
Fil
Original comment by [email protected]
on 16 Apr 2007 at 1:31
from tesseract-ocr.
Will be fixed in a future release.
Original comment by [email protected]
on 17 May 2007 at 7:26
- Changed state: Started
from tesseract-ocr.
In the hope that this can help you, I am hereby attaching samples written in
French,
scanned at 600 DPI and cleaned up in the GIMP. The columns could be parsed fine
by
OCROpus, but since the big problem here is accents, I figured I needed to submit
these to tesseract instead of OCROpus.
I don't think the 600dpi sample can be of much use: I have 1GiB of ram and
trying to
OCR it would make me swap to death instantly, whereas OCRing the 300dpi version
went
fine (using only a few MiBs of memory). At the same time, I have to ask, is
that huge
memory consumption when using the 600dpi sample normal at all?
If you want me to submit those samples to the ocropus project, just ask ;)
Original comment by [email protected]
on 30 May 2007 at 3:31
Attachments:
from tesseract-ocr.
V2.00 will support English, French, German, Italian, Spanish, Dutch.
Original comment by [email protected]
on 7 Jul 2007 at 1:29
from tesseract-ocr.
is there an estimated time of arrival for 2.0? The roadmap on the homepage is
very
vague...
Original comment by [email protected]
on 7 Jul 2007 at 2:33
from tesseract-ocr.
I updated the roadmap. It is almost ready. There are still a few issues to
check and
some inconsistency to resolve. Look for it next week!
Original comment by [email protected]
on 13 Jul 2007 at 2:05
from tesseract-ocr.
Original comment by [email protected]
on 18 Jul 2007 at 10:26
- Changed state: Fixed
from tesseract-ocr.
sorry, but it is not perfect just yet :) could this be reopened?
I have tested with my favorite samples, and certain characters screw up however.
Namely, in "french.png"
- e is converted to c
- o is converted to 0
- some nn are converted to m, others are converted to 11
- è is converted to é
- « and » are converted to < < and > >
- l is converted to 1
In the previous 300dpi.png sample from comment #5, the accents screw up a bit
(a lot)
more. Interestingly enough, 150dpi.png is slightly better parsed than
300dpi.png.
Original comment by [email protected]
on 19 Jul 2007 at 12:14
Attachments:
from tesseract-ocr.
Related Issues (20)
- tesseeract-ocr from dvn won't build with gettext 0.18 HOT 1
- tesseeract-ocr from svn won't build with gettext 0.18 HOT 12
- Could not open file eng.user-words HOT 7
- [Feature Request] Dictionaries should provide provide an easy way do identify them automatically HOT 24
- tesseract crash after training. HOT 1
- why is the output is junk.tr?? HOT 2
- why is the output is junk.tr?? HOT 11
- Doesn't compile with the flag -DGRAPHICS_DISABLED HOT 1
- Package(s) are tarbombs HOT 2
- help ! dont know what had happen.(assertion failed) HOT 6
- `config/config.rpath' not found, `./ABOUT-NLS' not found HOT 2
- Arco-Gio International HOT 1
- stdio.h not included in viewer/svutil.cpp - breaks compilation on Ubuntu 10.04 HOT 4
- problem about combine_tessdata HOT 12
- how to combine new .traineddata file with the one provide on the web? HOT 3
- Windows : application crash with tesseract 2.04 (lastest stable version) HOT 4
- Some file pointers are not closed HOT 1
- Danish fraktur update HOT 1
- why all the chinese character is wrong . HOT 3
- Include a C wrapper in TessBaseAPI (baseapi.cpp/.h) HOT 81
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tesseract-ocr.