Light

accent support about tesseract-ocr HOT 10 CLOSED

gamjaradio commented on August 15, 2024

accent support

from tesseract-ocr.

Comments (10)

GoogleCodeExporter commented on August 15, 2024

Is there any workaround for this issue? For instance by using some training 
facility?

Original comment by [email protected] on 12 Apr 2007 at 9:22

from tesseract-ocr.

GoogleCodeExporter commented on August 15, 2024

Well... you could probably add your own heuristics for these special cases 
without
training (which does not yet work). For example, see how tess determines 
whether a
"dot" is just noise or actually part of the letter "i":
http://tesseract-ocr.repairfaq.org/makerow_8cpp.html#90ccf46408d4dc726cb6ad4b7ab
b731d
Entry point might be here:
http://tesseract-ocr.repairfaq.org/makerow_8cpp.html#fdc4c4b87028fd7aafcb679f923
64d21

The "Meanwhile" part of your message is due to the permuter - the dictionary 
included
with tess simply does not include words that contain certain letters in the 
"wrong"
places so it will substitute something that makes "more" sense to it. See:
http://tesseract-ocr.repairfaq.org/allaboutdawg.html

Joke: You can probably include some of most common words that use é by using a 
'6'
instead of the "é" just so that later you *know* which letter it found ;-)

Cheers,
Fil

Original comment by [email protected] on 16 Apr 2007 at 1:26

from tesseract-ocr.

GoogleCodeExporter commented on August 15, 2024

To clarify, tesseract will need *formal* language-specific support to work as 
well
for other languages as it does now for English. This is not only because it was
trained for English fonts but also because the DAWG only has English words in 
it. So,
right now you have expected problems *recognizing* the non-english letters *and*
expected problems *verifying* that the recognized letters lead to a *valid* 
English
word. There are ways to shut off the permuter (with the config file, I forget 
the
option, sorry) but *trust me* you do not want to do that :-)

http://tesseract-ocr.repairfaq.org/allaboutdawg.html

Cheers,
Fil

Original comment by [email protected] on 16 Apr 2007 at 1:31

from tesseract-ocr.

GoogleCodeExporter commented on August 15, 2024

Will be fixed in a future release.

Original comment by [email protected] on 17 May 2007 at 7:26

Changed state: Started

from tesseract-ocr.

GoogleCodeExporter commented on August 15, 2024

In the hope that this can help you, I am hereby attaching samples written in 
French,
scanned at 600 DPI and cleaned up in the GIMP. The columns could be parsed fine 
by
OCROpus, but since the big problem here is accents, I figured I needed to submit
these to tesseract instead of OCROpus.

I don't think the 600dpi sample can be of much use: I have 1GiB of ram and 
trying to
OCR it would make me swap to death instantly, whereas OCRing the 300dpi version 
went
fine (using only a few MiBs of memory). At the same time, I have to ask, is 
that huge
memory consumption when using the 600dpi sample normal at all?

If you want me to submit those samples to the ocropus project, just ask ;)

Original comment by [email protected] on 30 May 2007 at 3:31

Attachments:

from tesseract-ocr.

GoogleCodeExporter commented on August 15, 2024

V2.00 will support English, French, German, Italian, Spanish, Dutch.

Original comment by [email protected] on 7 Jul 2007 at 1:29

from tesseract-ocr.

GoogleCodeExporter commented on August 15, 2024

is there an estimated time of arrival for 2.0? The roadmap on the homepage is 
very
vague...

Original comment by [email protected] on 7 Jul 2007 at 2:33

from tesseract-ocr.

GoogleCodeExporter commented on August 15, 2024

I updated the roadmap. It is almost ready. There are still a few issues to 
check and
some inconsistency to resolve. Look for it next week!

Original comment by [email protected] on 13 Jul 2007 at 2:05

from tesseract-ocr.

GoogleCodeExporter commented on August 15, 2024

Original comment by [email protected] on 18 Jul 2007 at 10:26

Changed state: Fixed

from tesseract-ocr.

GoogleCodeExporter commented on August 15, 2024

sorry, but it is not perfect just yet :) could this be reopened?

I have tested with my favorite samples, and certain characters screw up however.
Namely, in "french.png"
- e is converted to c
- o is converted to 0
- some nn are converted to m, others are converted to 11
- è is converted to é
- « and » are converted to < < and > >
- l is converted to 1

In the previous 300dpi.png sample from comment #5, the accents screw up a bit 
(a lot)
more. Interestingly enough, 150dpi.png is slightly better parsed than 
300dpi.png.

Original comment by [email protected] on 19 Jul 2007 at 12:14

Attachments:

from tesseract-ocr.

Related Issues (20)

tesseeract-ocr from dvn won't build with gettext 0.18 HOT 1
tesseeract-ocr from svn won't build with gettext 0.18 HOT 12
Could not open file eng.user-words HOT 7
[Feature Request] Dictionaries should provide provide an easy way do identify them automatically HOT 24
tesseract crash after training. HOT 1
why is the output is junk.tr?? HOT 2
why is the output is junk.tr?? HOT 11
Doesn't compile with the flag -DGRAPHICS_DISABLED HOT 1
Package(s) are tarbombs HOT 2
help ! dont know what had happen.(assertion failed) HOT 6
`config/config.rpath' not found, `./ABOUT-NLS' not found HOT 2
Arco-Gio International HOT 1
stdio.h not included in viewer/svutil.cpp - breaks compilation on Ubuntu 10.04 HOT 4
problem about combine_tessdata HOT 12
how to combine new .traineddata file with the one provide on the web? HOT 3
Windows : application crash with tesseract 2.04 (lastest stable version) HOT 4
Some file pointers are not closed HOT 1
Danish fraktur update HOT 1
why all the chinese character is wrong . HOT 3
Include a C wrapper in TessBaseAPI (baseapi.cpp/.h) HOT 81

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.