Code Monkey home page Code Monkey logo

Comments (10)

GoogleCodeExporter avatar GoogleCodeExporter commented on August 15, 2024
Is there any workaround for this issue? For instance by using some training 
facility?

Original comment by [email protected] on 12 Apr 2007 at 9:22

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 15, 2024
Well... you could probably add your own heuristics for these special cases 
without
training (which does not yet work). For example, see how tess determines 
whether a
"dot" is just noise or actually part of the letter "i":
http://tesseract-ocr.repairfaq.org/makerow_8cpp.html#90ccf46408d4dc726cb6ad4b7ab
b731d
Entry point might be here:
http://tesseract-ocr.repairfaq.org/makerow_8cpp.html#fdc4c4b87028fd7aafcb679f923
64d21

The "Meanwhile" part of your message is due to the permuter - the dictionary 
included
with tess simply does not include words that contain certain letters in the 
"wrong"
places so it will substitute something that makes "more" sense to it. See:
http://tesseract-ocr.repairfaq.org/allaboutdawg.html

Joke: You can probably include some of most common words that use é by using a 
'6'
instead of the "é" just so that later you *know* which letter it found ;-)

Cheers,
Fil

Original comment by [email protected] on 16 Apr 2007 at 1:26

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 15, 2024
To clarify, tesseract will need *formal* language-specific support to work as 
well
for other languages as it does now for English. This is not only because it was
trained for English fonts but also because the DAWG only has English words in 
it. So,
right now you have expected problems *recognizing* the non-english letters *and*
expected problems *verifying* that the recognized letters lead to a *valid* 
English
word. There are ways to shut off the permuter (with the config file, I forget 
the
option, sorry) but *trust me* you do not want to do that :-)

http://tesseract-ocr.repairfaq.org/allaboutdawg.html

Cheers,
Fil

Original comment by [email protected] on 16 Apr 2007 at 1:31

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 15, 2024
Will be fixed in a future release.

Original comment by [email protected] on 17 May 2007 at 7:26

  • Changed state: Started

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 15, 2024
In the hope that this can help you, I am hereby attaching samples written in 
French,
scanned at 600 DPI and cleaned up in the GIMP. The columns could be parsed fine 
by
OCROpus, but since the big problem here is accents, I figured I needed to submit
these to tesseract instead of OCROpus.

I don't think the 600dpi sample can be of much use: I have 1GiB of ram and 
trying to
OCR it would make me swap to death instantly, whereas OCRing the 300dpi version 
went
fine (using only a few MiBs of memory). At the same time, I have to ask, is 
that huge
memory consumption when using the 600dpi sample normal at all?

If you want me to submit those samples to the ocropus project, just ask ;)

Original comment by [email protected] on 30 May 2007 at 3:31

Attachments:

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 15, 2024
V2.00 will support English, French, German, Italian, Spanish, Dutch.

Original comment by [email protected] on 7 Jul 2007 at 1:29

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 15, 2024
is there an estimated time of arrival for 2.0? The roadmap on the homepage is 
very
vague...

Original comment by [email protected] on 7 Jul 2007 at 2:33

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 15, 2024
I updated the roadmap. It is almost ready. There are still a few issues to 
check and
some inconsistency to resolve. Look for it next week!

Original comment by [email protected] on 13 Jul 2007 at 2:05

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 15, 2024

Original comment by [email protected] on 18 Jul 2007 at 10:26

  • Changed state: Fixed

from tesseract-ocr.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 15, 2024
sorry, but it is not perfect just yet :) could this be reopened?

I have tested with my favorite samples, and certain characters screw up however.
Namely, in "french.png"
- e is converted to c
- o is converted to 0
- some nn are converted to m, others are converted to 11
- è is converted to é
- « and » are converted to < < and > >
- l is converted to 1

In the previous 300dpi.png sample from comment #5, the accents screw up a bit 
(a lot)
more. Interestingly enough, 150dpi.png is slightly better parsed than 
300dpi.png.

Original comment by [email protected] on 19 Jul 2007 at 12:14

Attachments:

from tesseract-ocr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.