Comments (4)
This is a known problem.
The FindLines code is assuming that each rectangle given to it is composed of a
single size of text, and, although the baseline may be curved, it does not shift
suddenly. While it will often succeed when these rules are broken, there is a
much
higher probability that the text will be lost or just wrong.
The problem of the example you give is unique to forms processing (more or
less), and
although it may be fixed in a future version, it will most likely be a distant
future
version. In the meantime, you could try to cut out rectangles of similar-sized
characters...
Original comment by [email protected]
on 19 Jul 2007 at 3:34
- Changed state: Accepted
from tesseract-ocr.
i solved this the same way i solved the "digits" problem issue (#164).
Since you're parsing a form, you probably know where are each element to
recognize on it.
My method is to
- Rotate the form based on an angle automatically detected (i use the black
areas around the scan to
determine two corner's points, then i just do an atan on the coef, it gives me
the angle).
- Crop the garbage generated by rotation, all around the picture (Easy if you
know the angle, and at least 3
corners of the document, i first shear it so the 3 points angle is 90° and
then crop).
- Determine the "type" of your form, if you're processing many types. myself i
do it with colorimetry, and
placemarks annalysis.
- Then, you have to extract each elements, but not using absolute coordinates,
i do use relative to size
coordinates (Basically each set of x/y is a percentage of width/height of the
document).
- Voila. You just extract things, and parse individually with tesseract.
Hope that helps,
Pierre.
Original comment by [email protected]
on 4 Apr 2010 at 11:49
from tesseract-ocr.
Original comment by [email protected]
on 20 May 2010 at 6:53
- Changed state: Look-here-for-help
from tesseract-ocr.
Reference to this issue was posted in FAQ
Original comment by [email protected]
on 2 Jan 2013 at 12:44
- Changed state: No-longer-an-issue
from tesseract-ocr.
Related Issues (20)
- tesseeract-ocr from dvn won't build with gettext 0.18 HOT 1
- tesseeract-ocr from svn won't build with gettext 0.18 HOT 12
- Could not open file eng.user-words HOT 7
- [Feature Request] Dictionaries should provide provide an easy way do identify them automatically HOT 24
- tesseract crash after training. HOT 1
- why is the output is junk.tr?? HOT 2
- why is the output is junk.tr?? HOT 11
- Doesn't compile with the flag -DGRAPHICS_DISABLED HOT 1
- Package(s) are tarbombs HOT 2
- help ! dont know what had happen.(assertion failed) HOT 6
- `config/config.rpath' not found, `./ABOUT-NLS' not found HOT 2
- Arco-Gio International HOT 1
- stdio.h not included in viewer/svutil.cpp - breaks compilation on Ubuntu 10.04 HOT 4
- problem about combine_tessdata HOT 12
- how to combine new .traineddata file with the one provide on the web? HOT 3
- Windows : application crash with tesseract 2.04 (lastest stable version) HOT 4
- Some file pointers are not closed HOT 1
- Danish fraktur update HOT 1
- why all the chinese character is wrong . HOT 3
- Include a C wrapper in TessBaseAPI (baseapi.cpp/.h) HOT 81
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tesseract-ocr.