thalesxav / tesseractdotnet Goto Github PK

Automatically exported from code.google.com/p/tesseractdotnet

C++ 58.38% Objective-C 0.01% C 26.90% Makefile 6.65% Shell 3.62% Groff 0.30% HTML 1.46% TeX 0.03% Java 0.94% C# 1.43% NSIS 0.27%

tesseractdotnet's People

Contributors

Watchers

tesseractdotnet's Issues

charlist in wrapper rc2_r552 not proper

What steps will reproduce the problem?
1. when we run the wrapper in VS2010 using C# method RetriveResultDetail
2. does not return proper word when we check charlist.
3.

What is the expected output? What do you see instead?
the charlist must contain all the characters in any word. but (in some cases) 
two boxes appear in single word  one box covering the first character of the 
word and the other box covering the entire word but in charlist the last 
character is missing in such scenario

What version of the product are you using? On what operating system?
VS 2010 C# wrapper rc2_r552 in windows xp

Please provide any additional information below.
kindly note the sample ocred file that contains two boxes in the word
Tahsil/Taluk  the charlist returns ascii value for Tahsil/Talu in one box  and 
k in seperate word for second box but actually the small box contains the 
character T  the return value for this small box is the ascii value of k 
instead of ascii of T which is the small box that appears in the image

Original issue reported on code.google.com by [email protected] on 9 Nov 2011 at 7:00

Attachments:

sample.GIF

Svn Source is out of date

The source in svn appears to be out of date. For instance the latest downloads, 
tesseractdotnetwrapper_r590.zip and IPoVn_Release_x86.zip at time of writing, 
have additional methods and functionality compared to what is in the svn 
repository. 

There also appears to be two different versions of the 
'tesseractenginewrapper.h' and 'tesseractenginewrapper.cpp' files one under 
'.\dotnetwrapper\TesseractEngineWrapper' and another under 
'.\dotnetwrapper\Source\api' where the former appears to be of an older version.

Assuming I haven't made some mistake would you be able to update the svn 
repository so that we can build tesseractdotnetwrapper_r590 ourselves?

Original issue reported on code.google.com by [email protected] on 21 Jul 2011 at 11:34

R552 small errors and solitions

1. Error
1bppIndexed image -> AccessViolationException
Solution  in ccmain->output.cpp->

void Tesseract::write_results(                    //write output
                  ETEXT_DESC *monitor,
                  WERD_RES *word,     //word to do
                  BLOCK *block,       //block it is from
                  ROW_RES *row,       //row it is from
                  const STRING &text, //text to write
                  const STRING &text_lengths) {.....}

this function calls for 3 times "ocr_append_char" function but use pix_grey_
if you change it "pix_grey_" to "pix_binary_", the error improves


2. Error
large image(greater than 127*100 chars in image) -> AccessViolationException
First solution
tesseract->tesseractenginewrapper.cpp->void 
TesseractProcessor::InitializeMonitor(){..}
this function change "fixed_buffer_factor" variable value 
forexample Increase  from 100 to 1000, 

Second solution
you can write function in api for request from .net users "fixed_buffer_factor" 
value

Third solution
"monitor" varriable is array, if change such as "linked-list" dynamic varriable 

3. Error Encoding problem
spare time, I want to look at this event to c++ code, but easy soliton on .net 
platform
 string k= Encoding.UTF8.GetString(Encoding.Default.GetBytes(tesseractProcessor.Apply(bmp)));

i think safer RetriveResultDetail funtion than 590's layout manager

Original issue reported on code.google.com by [email protected] on 10 Jun 2012 at 10:42

can't find "allheaders.h"

What steps will reproduce the problem?
1.compilation tesseract-ocr-3.02-vs2008


What is the expected output? What do you see instead?

"error  1   error C1083: Can not open include file:“allheaders.h”: No such 
file or directory"

What version of the product are you using? On what operating system?
vs2010 64bit

Please provide any additional information below.
I can't find "allheaders.h" in tesseract-ocr-3.02.02(the source code).
why the source code doesn't include "allheaders.h"?

Original issue reported on code.google.com by [email protected] on 29 Jun 2013 at 1:55

how to get charactor position in tesseract 3.02 r729?

What steps will reproduce the problem?
1.  bool succed = api->Recognize(monitor) >= 0;
succed return true, at function RetriveResultDetail, 
int nChars = head->count;
the nChars is always zero.
2.
3.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?
tesseract 3.02 r729,Windows XP, VS2008

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 7 Jul 2012 at 1:31

Simple example application - 5 warnings - dont want to execute

Operating System.
   Windows 7 64 bit
   Visual Studio 2010 C#

Simple Example app download from
esseractdotnet - Revision 41: /trunk/dotnetwrapper/TesseractEngineWrapper

Http://tesseractdotnet.googlecode.com/svn/trunk/dotnetwrapper/TesseractEngineWra
pper/


What steps will reproduce the problem?
1. Build the simple example application 
2. (5 Warnings) Unreachable code detected
   ImageViewer.cs Line 156
   Histogram.cs line 201 / 210
   GreyImage.cs line 253
   RGBImage.cs  line 309

What is the expected output? Try to run app.. Tesseract.OCr.AppEntry
Error under windows 7.... Stopped working

Original issue reported on code.google.com by [email protected] on 22 May 2011 at 6:07

whitelist being ignored

var _ocr = new TesseractProcessor();
_ocr.SetPageSegMode(ePageSegMode.PSM_SINGLE_CHAR);
_ocr.SetVariable("tessedit_char_whitelist", 
"ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789");          
_ocr.Init(Program.AppPath + "tessdata\\", "eng", 
(int)Enums.EOcrEngineMode.TesseractOnly);

-- expecting only alphanumeric output, i'm getting all kinds of weird characters

!*&() etc

I've read that the blacklist overrides the whitelist if the blacklist is null 
or empty... does this mean that the whitelist is ignored if the blacklist isn't 
specified?

What version of the product are you using? On what operating system?
r591 on windows 7

Please provide any additional information below.

also need to get at the confidence level for the characters...

Original issue reported on code.google.com by [email protected] on 24 Aug 2011 at 5:52

How to add language in teseract

Hi,
Windows 7
I'm working on visual studio 2010, and i need to know how to add the french 
language, And should i know which version of Emgu i have ? if so how to do this.
Thanks

Original issue reported on code.google.com by [email protected] on 9 Aug 2014 at 2:35

.net 3 Confidence is always 0

What steps will reproduce the problem?

I am using the vs 3 .net wrapper.
When I run the function Recognize it ocrs the image fine and I can get
the string.
I need the confidence level of each character, but it is always 0.
What am I doing wrong?



        Dim image As New Bitmap("C:\MyImage.tif")
        Dim ocr As New TesseractProcessor

        ocr.Init(Nothing, "eng", False)
        Console.WriteLine(ocr.Recognize(image))


        ocr.InitForAnalysePage()
        ocr.SetVariable("tessedit_thresholding_method", "1")
        ocr.SetVariable("save_best_choices", "T")


        Dim doc As DocumentLayout = ocr.AnalyseLayout(image)
        For Each blk As OCR.TesseractWrapper.Block In doc.Blocks
            Console.WriteLine("Block Confidence: " & blk.Confidence)


            For Each para As Paragraph In blk.Paragraphs
                Console.WriteLine("para Confidence: " &
para.Confidence)

                For Each ln As TextLine In para.Lines
                    Console.WriteLine("ln Confidence: " &
ln.Confidence)

                    For Each wrd As Word In ln.Words
                        Console.WriteLine("wrd Confidence: " &
wrd.Confidence)
                        Console.WriteLine("wrd Text: " & wrd.Text)

                        For Each ch As Character In wrd.CharList
                            Console.WriteLine("V:" & ch.Value)
                            Console.WriteLine("C:" & ch.Confidence)
                        Next

                    Next

                Next
            Next
        Next



What is the expected output? What do you see instead?
The confidence is always zero.

What version of the product are you using? On what operating system?
tesseract engine 3.x .net wrapper v1.0 RC2

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 15 Mar 2012 at 2:19

Broken Characters ? Not able to recognise, but legacy (without wrapper) Tesseract does recognize.

What steps will reproduce the problem?
1. Attached sample files can be OCRed using non .net wrapper 
2. But cannot be OCRed using .Net wrapper; It gives all garbage 
3.

What is the expected output? What do you see instead?
If the character are not broken, the .net wrapper works great. But the attached 
images are out of dot matrix images.
If legacy Tesseract can OCR the sample images why not the attached one?
Also how can we update the "eng.Traineddata" file for .net wrapper. Especially, 
if its possible to update the "eng.Traineddata" in legacy Tesseract.

What version of the product are you using? On what operating system?

tesseractdotnetwrapper_r590


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 27 Jun 2014 at 1:42

Attachments:

[Sample Images.zip](https://storage.googleapis.com/google-code-attachments/tesseractdotnet/issue-30/comment-0/Sample Images.zip)

No auto rotation out of box

What steps will reproduce the problem?
1. ran it with psm 0/1/6, yet did not see auto rotation of image

What is the expected output? What do you see instead?
was expecting image to be rotated to correct orientation

What version of the product are you using? On what operating system?
latest build, win 7

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 31 Aug 2012 at 7:10

RetieveResultDetail returning numbers always

What steps will reproduce the problem?
1._ocr.Apply(image)
2. string result = _ocr.Apply(image);                
3.List<Word> detectedWords = _ocr.RetriveResultDetail();

What is the expected output? What do you see instead?
If the image has only alphabets, the word list should contain words made of 
alphabets, however i get words containing numbers.

What version of the product are you using? On what operating system?
Windows XP, Dotnet wrapper 3.02 with tessdata version 3.01

Please provide any additional information below.
tessdata language "eng"

Original issue reported on code.google.com by [email protected] on 6 Jun 2015 at 8:24

I can't use 3.2 version traineddata with 3.1 dll.

What steps will reproduce the problem?
1.after 3.2, the eng.traineddata is big. > 20M. and if i use that in vs2012. 
exception told me : System.AccessViolationException. in :string iden = 
ocr.ToCR(bitmap);

and the command line display: actual_tessdata_num_entries_ <= 
TESSDATA_NUM_ENTRIES:Error:Assert faild: in file ..\ccutil\tessdatamanager.cpp, 
line48

if i use some little traineddata file , not that propblem . so please queickly 
build a 3.2 version please

Original issue reported on code.google.com by [email protected] on 12 Jan 2013 at 4:40

System.IO.FileLoadException

What steps will reproduce the problem?
1. Create new WinForms project
2. Add reference to tesseractengine3.dll
3. var x = new TesseractProcessor();

What is the expected output? What do you see instead?
Main form window

What version of the product are you using? On what operating system?
RC

Please provide any additional information below.
Thrown System.IO.FileLoadException
This exception is thrown if the file is not a valid .NET Framework assembly.

Many thanx for this library. I search for something last 2 week.

Original issue reported on code.google.com by [email protected] on 28 Feb 2011 at 2:15

Can I do "pre-processing image" and "page layout analysis" with source code last version tesseract ocr engine?

Hello everybody! Can you help me to resolve the below problem?
1. How to do pre-processing image (such as Binarization, Noise Detection & 
Reduction, Skew & orientation detection...) (use source code tesseract ocr 
3.01) ?
2. How to do "page layout analysis" combine with recognize characters (use 
source code tesseract ocr 3.01)?

Original issue reported on code.google.com by [email protected] on 11 Aug 2011 at 1:28

VietOCR.Net3.2, problem after installing the sett up.

What steps will reproduce the problem?
1.I got downloaded the VietOCR.Net3.2 and have done so many changes as i wants, 
Its running well when i run the project from visual studio.

2.Now i tried to make it as set up file, There is no error during the process 
of set up creating,

3.But When i open the project after installing in start->program, it shows the 
following error

What is the expected output? What do you see instead?

Could not load file or assembly 'tesseract, Version=0.0.0.0, Culture=neutral, 
PublicKeyToken=null' or one of its dependencies. The system cannot find the 
file specified.


What version of the product are you using? On what operating system?
visual studio 2008

Please provide any additional information below.

I have included the set up file which i generated, 
Please help any one for my Issue

Original issue reported on code.google.com by [email protected] on 20 Mar 2012 at 1:18

Attachments:

Release.zip

Tesseract confused to identify the already trained character

I have done the training as specified in the site for burmese language.
Instead of using another scanned page, i am trying to use the same image which 
i used for training tesseract.
So this procedure should give maximum accuracy.

What steps will reproduce the problem?
1. Please find attached the trained data and the tiff file  i used for training
   (For testing i used paper scan tiff image of dpi 300)
2. RUn tesseract for the same image with the attached trained data.
3. Still the tesseract get confused with the characters. Accuracy is only 60%

What is the expected output? What do you see instead?
Since the same training image is used for recognition, the accuracy must be 
high.
I am not sure why tesseract has problem to identify the characters.
Please help me , how to proceed with this

What version of the product are you using? On what operating system?
Tesseract 3.02 on windows 7 64 bit

Original issue reported on code.google.com by [email protected] on 10 Jul 2013 at 5:43

Could not load file; FileLoadException

What steps will reproduce the problem?
1. Checked out the project via svn (svn co 
https://tesseractdotnet.googlecode.com/svn/trunk/dotnetwrapper)
2. Opened in Visual Studio 2010
3. Hit F5 to load the error, running it while not in debug simply closes down 
the program

What is the expected output? What do you see instead?
I expected the form application to load, instead it simply closes


What version of the product are you using? On what operating system?
Revision 48, Windows 7, on Visual Studio 2010


Please provide any additional information below.
Just downloaded the project today and figured I'd play around with it and see 
what results I get. However, when I'm trying to run the project I get the below 
error:

Could not load file or assembly 'tesseractengine3, Version=0.0.0.0, 
Culture=neutral, PublicKeyToken=null' or one of its dependencies. The 
application has failed to start because its side-by-side configuration is 
incorrect. Please see the application event log or use the command-line 
sxstrace.exe tool for more detail. (Exception from HRESULT: 0x800736B1)
   at Tesseract.OCR.AppEntry.MainForm..ctor()
   at Tesseract.OCR.AppEntry.Program.Main() in C:\Users\Guest\Desktop\dotnetwrapper\TesseractBasedOCRAnalysis\Tesseract.OCR.AppEntry\Program.cs:line 36

Line 36 refers to the new MainForm I found, but the actual error break occurs 
on line 21 in the Main.Form.Designer.cs where this.end() is called. I feel that 
my issue is simple since I haven't seen it on the forums. Anyway, thanks for 
any help :D

Original issue reported on code.google.com by [email protected] on 9 Aug 2011 at 8:29

Unhandled exception of type 'System.Windows.Markup.XamlParseException'

What steps will reproduce the problem?
1.Create WPF project on Visual 2012 (C# , target framework .Net 4.0 and set 
project to x86) 
2. add references tesseractengine3.dll 
3. create TesseractProcessor tp = new TesseractProcessor() ;
4. compile in debug and released and result will be the same.
5. I found  "unhandled exception of type 
'System.Windows.Markup.XamlParseException' occurred in 
PresentationFramework.dll"

Additional information: 'The invocation of the constructor on type 
'ProjectTest.MainWindow' that matches the specified binding constraints threw 
an exception.' Line number '3' and line position '9'.


What is the expected output? What do you see instead?
Just the empty windows panel.



What version of the product are you using? On what operating system?
Visual studio 2012 express with .net framework 4.0


Please provide any additional information below.
I had done OCR project on my visual studio 2008 and it works very well.
This project, I try into VS2012 because kinect for Windows SDK is compatible 
with VS2010 and newer, so the big error is occur.

I spend 1 week for this error with no solution.
Thanks for your help.

Original issue reported on code.google.com by [email protected] on 30 Aug 2013 at 10:42

Attachments:

Untitled.jpg

x64 version?

What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?
Tesseract Net Wrapper x64

What version of the product are you using? On what operating system?
Win 7 x64

Please provide any additional information below.

Hello all,

Has anyone succeeded in compiling this project as a native x64 dll?
Is it possible?

The unfortunate situation is the current dll is compiled as x86 so
it cannot be included in an x64/any CPU project in windows.

Thanks for any assistance

Original issue reported on code.google.com by [email protected] on 6 Sep 2011 at 12:55

_ocrProcessor.Apply(System.Drawing.Image img) -- choke on corrupted memory

What steps will reproduce the problem?
1. _ocrProcessor.Apply(image_object)
2.
3.

What is the expected output? What do you see instead?
I expect it to accept any image of type System.Drawing.Image. Instead, i'm 
getting a corrupt memory message.

What version of the product are you using? On what operating system?
1.0 on windows vista 32 bit 

Please provide any additional information below.
Whenever i save the image as a tiff file using FreeImage.NET the wrapper has no 
problems loading the tiff file from the path and ocr'ing it, but when i pass a 
tiff image object into into apply method, the wrapper complains about corrupt 
memory. I've also tried bitmaps with the same results. It would be nice if the 
wrapper took any System.Drawing.Image object and converted the image into a 
format that tesseract will not choke on.

One more thing. I'm also not receiving the IList<Word> results when calling 
RecieveResults. Other than that, i want to thank the author for the time and 
effort put into this library. I really appreciate it.

Original issue reported on code.google.com by [email protected] on 10 May 2011 at 11:57

OCRWord is empty

What steps will reproduce the problem?
1. processor.Recognize(bmp) or rocessor.AnalyseLayout(bmp)

Tesseract 2 had the possibility to retrieve the confidence and position of each 
word in the OCRWord class. I noticed when testing with the latest version that 
this class is empty and doesn't contains a word and the confidence is always 0 
even when i OCRed a image upsidedown.

How can i acces these values?

Original issue reported on code.google.com by [email protected] on 12 Jul 2011 at 12:14

Binary and Grey Image tesseract.dll Functions Not found in VB.NET

What steps will reproduce the problem?
1. Load tesseract.dll into a VS2008 VB.NET project
2. Go to Object Browser or try to use functions in question.


What is the expected output? What do you see instead?
The functions appear when the same DLL is loaded into a C# project. It is 
expected they would appear in a VB.NET project. They do not.

What version of the product are you using? On what operating system?
Version in 7/4/2011 build. tesseract.dll SHA1 is 
146404737CE2D6F1A934BE54FF5A0817BEC82A81.

Please provide any additional information below.
The functions in question (detailed below) appear when using tesseract.dll in a 
C# project. However, when you bring the DLL into a VB.NET project, the 
functions are nowhere to be found.

Functions in question:
AnalyzeLayoutBinaryImage(byte*, int, int)
AnalyzeLayoutGreyImage(byte*, int, int)
AnalyzeLayoutGreyImage(ushort*, int, int)
RecognizeBinaryImage(byte*, int, int)
RecognizeGreyImage(byte*, int, int)
RecognizeBinaryImage(ushort*, int, int)

Am I missing a setting or a tweak in the VS2008 environment that would bring 
them in? Are there functions that aren't supposed to show up in VB.NET? Any 
help or guidance provided would be appreciated.

Original issue reported on code.google.com by [email protected] on 6 Jun 2013 at 10:27

unhandled application error when launching free ocr

What steps will reproduce the problem?
1.twice unexpected power interruption while using ocr
2.uninstall and re-install did not work
3.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?
using win 7 32 bit

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 19 Mar 2014 at 5:05

DLL in Release Configuration

What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?
To be able to build the .dll in Release mode.

What version of the product are you using? On what operating system?
r48; Win7 64-bit

Please provide any additional information below.
The .dll is built fine in Debug mode; however, when the Solution/Project is 
switched to Release mode, no .dll is generated.

Original issue reported on code.google.com by [email protected] on 17 Jul 2011 at 1:31

Return Unicode string

What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?
Expect to receive Unicode string; got UTF-8 string instead.

What version of the product are you using? On what operating system?
r42, Win7 64-bit

Please provide any additional information below.

Modify TesseractProcessor::Process(TessBaseAPI* api, Pix* pix) method in 
TesseractRecognizer.cpp as follows:

Old:
String* result = new String(text);

New:
String* result = new String(text, 0, strlen(text), Encoding::UTF8);

Original issue reported on code.google.com by [email protected] on 3 Jul 2011 at 2:13

compatebility issue with libtesseractdll and windows server 2012

What steps will reproduce the problem?
1.using libtesseract303.dll with jna wrapper tess4j in windows server 2012 
machine
2.tessAPI1.java should read libtessract303 from the respective path 
3.

What is the expected output? What do you see instead?
 should read the text from the image.

What version of the product are you using? On what operating system?
tess4j 64 bit version 64 bit dlls on windows server 2012

Please provide any additional information below.

jna is checking for libtesseract303.dll in respective path but not able to read 
the libtesseract303.dll file. I think it may be due to compatibilty issue 
between dlls and windows server 2012 

Kidly provide the solution for it.

Original issue reported on code.google.com by [email protected] on 11 Dec 2014 at 10:46

How to change the orientation of the tiff before processing it.

I'm passing the tiff to tesseract.doOCR(imageFile); but before doing this, i 
want the orientation of my multipage tiff to be in portrait. 
How to achieve this. I'm attaching a tiff file. I want each page to be in 
portrait before processing it.
Thanks In advance.

Original issue reported on code.google.com by [email protected] on 3 Oct 2011 at 5:15

Attachments:

FaxB27.tif

Can't compile the tesseractengine3.DLL

What steps will reproduce the problem?

1.  Followed Steps outlined in the Wiki

What is the expected output? What do you see instead?

Expected compilation of DLL. Fails with one error

"error C2061: syntax error : identifier 'FILE'"
"d:\tesseract-ocr\api\baseapi.h 134"


What version of the product are you using? On what operating system?
VS2008, 32 bit Vista.

Please provide any additional information below.

After making the changes to the tesseract project

Configuration Type: Dynamic Library (.dll) Common Language Runtime Support: Old 
Syntax (/clr:oldSyntax) 
Output File: tesseractengine3.dll
Also need to add System, System.Drawing assembly 

tesseractengine3.dll DOES compile

After adding in the tesseractenginewrapper.h and tesseractenginewrapper.cpp 
files , the project will not compile

Original issue reported on code.google.com by [email protected] on 2 Mar 2011 at 3:15

Failed to initialize Tesseract Engine 3.01


What steps will reproduce the problem?

this wrapper is great , how to make it compatible with the latest version r581? 
I download the demo source you supplied and  compile it displayed  "Failed to 
initialize Tesseract Engine 3.01" when start running it .I noticed the selected 
 Tesseract Data Path always add the slash at the end of it( see the attachment 
below) ,I remember that the Tesseract 2.04 data path must be without the ending 
slash . I removed the ending slash ,compiled and run it , it works .
sorry for my bad english .

what version of the product are you using? On what operating system?
win7 32bit  vs2008

Original issue reported on code.google.com by [email protected] on 12 Apr 2011 at 8:24

Attachments:

fa.jpg

Add support for recognizing a region of the image

What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?
r42, Win7 64-bit

Please provide any additional information below.

Add a method to recognize a rectangular region of the image. Followed are the 
changes:

Add to TesseractRecognizer.cpp:

String* TesseractProcessor::Recognize(System::Drawing::Image* image, 
System::Drawing::Rectangle rect)
{
    if (_apiInstance == null || image == null)
        return null;

    String* result = "";
    Pix* pix = null;

    try
    {
        pix = PixConverter::PixFromImage(image);
        if (rect != System::Drawing::Rectangle::Empty)
            this->EngineAPI->SetRectangle(rect.Left, rect.Top, rect.Width, rect.Height);

        result = this->Process(this->EngineAPI, pix);
    }
    catch (System::Exception* exp)
    {
        throw exp;
    }
    __finally
    {
        if (pix != null)
        {
            pixDestroy(&pix);
            pix = null;
        }
    }

    return result;
}

Add to TesseractEngineWrapper.h a declaration:

String* Recognize(System::Drawing::Image* image, System::Drawing::Rectangle 
rect);

Original issue reported on code.google.com by [email protected] on 3 Jul 2011 at 2:21

Problem with french characters

What steps will reproduce the problem?
1. use the application with french text
2.
3.

What is the expected output? What do you see instead?
special characters éèà...  not recognized correctly

What version of the product are you using? On what operating system?
last version - windows

Please provide any additional information below.
the bug could be corrected in tesseractenginewrapper.cpp :

static wchar_t *make_unicode_string(const char *utf8)
{
  int size = 0, out_index = 0;
  wchar_t *out;

  /* first calculate the size of the target string */
  int used = 0;
  int utf8_len = strlen(utf8);
  while (used < utf8_len) {
    int step = UNICHAR::utf8_step(utf8 + used);
    if (step == 0)
      break;
    used += step;
    ++size;
  }

  out = (wchar_t *) malloc((size + 1) * sizeof(wchar_t));
  if (out == NULL)
      return NULL;

  /* now convert to Unicode */
  used = 0;
  while (used < utf8_len) {
    int step = UNICHAR::utf8_step(utf8 + used);
    if (step == 0)
      break;
    UNICHAR ch(utf8 + used, step);
    out[out_index++] = ch.first_uni();
    used += step;
  }
  out[out_index] = 0;

  return out;
}


System::Collections::Generic::List<Word*>* 
TesseractProcessor::RetriveResultDetail()
{
    if (!_doMonitor || _monitorInstance == null)
        return null;

    System::Collections::Generic::List<Word*>* wordList = null;

    ETEXT_DESC* monitor = null;
    ETEXT_DESC* head = null;
    Word* currentWord = null;

    try
    {
        monitor = (ETEXT_DESC*)_monitorInstance.ToPointer();
        head = &monitor[1];

        int lineIndex=0;        
        int lineIdx = 0;
        int nChars = head->count;
        int i = 0;
        int j;
        while (i < nChars)
        {
            EANYCODE_CHAR* ch = &(head + i)->text[0];

            if (ch->blanks > 0)
            {   /*new word condition meets*/
                if (currentWord != null)
                    wordList = currentWord->UpdateConfidenceAndInsertTo(wordList);

                currentWord = null; // reset current word
            }

            if (currentWord != null && 
                (ch->left <= currentWord->Left || ch->top >= currentWord->Bottom))              
            {   /*new line condition meets*/
                wordList = currentWord->UpdateConfidenceAndInsertTo(wordList);

                lineIdx++;

                currentWord = null; // reset current word
            }

            if (currentWord == null)
            {   /*create new word*/
                currentWord = new Word();

                currentWord->LineIndex = lineIdx;

                currentWord->FontIndex = ch->font_index;
                currentWord->PointSize = ch->point_size;
                currentWord->Formating = ch->formatting;
            }

            unsigned char unistr[24]; 

            for (j = i; j < nChars; j++) 
            { 
                const EANYCODE_CHAR* unich = &(head + j)->text[0]; 
                if (ch->left != unich->left || ch->right != unich->right || 
                    ch->top != unich->top || ch->bottom != unich->bottom) 
                    break; 
                unistr[j - i] = static_cast<unsigned char>(unich->char_code); 
            }
            unistr[j - i] = '\0'; 
            wchar_t *utf16ch=make_unicode_string(reinterpret_cast<const char*>(unistr));

            Character* c = new Character(
                static_cast<char>(*utf16ch), 
                ch->confidence,
                ch->left, ch->top, ch->right, ch->bottom);

            /* update current word */
            currentWord->CharList->Add(c);

            System::String* sc = new String(*utf16ch, 1);
            currentWord->Text = System::String::Format(
                "{0}{1}", currentWord->Text->ToString(), sc);

            free(utf16ch);

            currentWord->Left = Math::Min(currentWord->Left, (int)ch->left);
            currentWord->Top = Math::Min(currentWord->Top, (int)ch->top);
            currentWord->Right = Math::Max(currentWord->Right, (int)ch->right);
            currentWord->Bottom = Math::Max(currentWord->Bottom, (int)ch->bottom);

            currentWord->Confidence += ch->confidence;

            i=j; /*go to next char*/
        } /* end while */

        if (currentWord != null)
            wordList = currentWord->UpdateConfidenceAndInsertTo(wordList);
    }
    catch (System::Exception* exp)
    {
        throw exp;
    }
    __finally
    {
        currentWord = null;
        head = null;
        monitor = null;
    }

    return wordList;
}

Original issue reported on code.google.com by [email protected] on 25 May 2011 at 4:11

how to OCR using Tesseract in VB.net

hi ,

I did search in a lot of forums to  found an easy tesseract exemple to ocr 
image in vb.net but i cant find a simple response or a complete one begins from 
what should i refer to how to initialise tesseract nor how to do the traitement 
on image .
please if anyone have the time to explain to me thanks in advance.

Original issue reported on code.google.com by [email protected] on 11 Jun 2015 at 5:02

_ocrProcessor.RetriveResultDetail() always null

What steps will reproduce the problem?
1. Process analyze
2.
3.

What is the expected output? What do you see instead?
List with tesseract.words

What version of the product are you using? On what operating system?
Windows XP x86

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 31 Mar 2011 at 6:55

thalesxav / tesseractdotnet Goto Github PK

tesseractdotnet's People

Contributors

Watchers

tesseractdotnet's Issues

Recommend Projects

Recommend Topics

Recommend Org