Code Monkey home page Code Monkey logo

Comments (15)

Dhirendramohanjha avatar Dhirendramohanjha commented on August 26, 2024

@muralimariyappan can you share sample input data

from cutie.

muralimariyappan avatar muralimariyappan commented on August 26, 2024

I used the same format as this file https://github.com/4kssoft/CUTIE/blob/master/invoice_data/Faktura1.pdf_0.json

from cutie.

SiddharthGadekar avatar SiddharthGadekar commented on August 26, 2024

Hi Murali,
Were you able to get a solution to this?
I am facing the same issue

from cutie.

muralimariyappan avatar muralimariyappan commented on August 26, 2024

No not yet

from cutie.

B34sTRider avatar B34sTRider commented on August 26, 2024

What is the input format for prediction? I used the main_evaluate_json.py to get output using the validation code. The output converts the numbers to "0's" and substrings are splitted and adds "##" to the prefix. Is is possible to get the actual text output from text_boxes key in input json?
I did change update_dict=True for evaluation and then commented the tokenization code where numbers are converted to 0's and replaced final string " ##" with "", which gives me the output somewhat. But why are we masking the data?
As suggested I updated the dict using main_dict_json.py with training data. Also updated the vocab.txt with new words.
If I change update_dict=False for evaluation I only get [UNK] for all words for both GT/INF.

@muralimariyappan Can you please tell which tokenization code you commented to stop the numbers being converted to 0 ? I am facing the similar issue too

from cutie.

Neelesh1121 avatar Neelesh1121 commented on August 26, 2024

@muralimariyappan , to avoid the numbers getting converted 0, you can comment the script which is just above the word tokenization, after commenting both word tokenization and numbers tokenization, somewhat i am getting the output but my testing accuracy is not increasing more than 0.25 and even training accuracy also becomes constant after a certain epoch with very less value around 0.5 and then it's repeating with same value for further epochs. So please can just share information about accuracy and if you are getting good accuracy, please just tell us the number of training documents?

from cutie.

vsymbol avatar vsymbol commented on August 26, 2024

What is the input format for prediction? I used the main_evaluate_json.py to get output using the validation code. The output converts the numbers to "0's" and substrings are splitted and adds "##" to the prefix. Is is possible to get the actual text output from text_boxes key in input json?
I did change update_dict=True for evaluation and then commented the tokenization code where numbers are converted to 0's and replaced final string " ##" with "", which gives me the output somewhat. But why are we masking the data?
As suggested I updated the dict using main_dict_json.py with training data. Also updated the vocab.txt with new words.
If I change update_dict=False for evaluation I only get [UNK] for all words for both GT/INF.

This is a good question.

Since numbers have almost the same meaning in receipts, we convert numbers into zeros where only the length of the numbers are counted, e.g. "1234"→"0000", "55555"→"00000". Moreover, different combinations of numbers can formulate countless number of types to learn for the model, which is difficult.

All in all, the masking is to help the model focus on the most import target in our task, that is the length of the numbers.

If your task have specific requirement to the type of the numbers but rather the length of numbers, comment the related line of code for masking the numbers to 0s.

from cutie.

Neelesh1121 avatar Neelesh1121 commented on August 26, 2024

@vsymbol I did same way, commented that section for numbers, but the issue was accuracy of the result, so please can you suggest how we can improve the accuracy of the model, i have around 500 training documents?
Thanks in Advance!

from cutie.

muralimariyappan avatar muralimariyappan commented on August 26, 2024

What is the input format for prediction? I used the main_evaluate_json.py to get output using the validation code. The output converts the numbers to "0's" and substrings are splitted and adds "##" to the prefix. Is is possible to get the actual text output from text_boxes key in input json?
I did change update_dict=True for evaluation and then commented the tokenization code where numbers are converted to 0's and replaced final string " ##" with "", which gives me the output somewhat. But why are we masking the data?
As suggested I updated the dict using main_dict_json.py with training data. Also updated the vocab.txt with new words.
If I change update_dict=False for evaluation I only get [UNK] for all words for both GT/INF.

This is a good question.

Since numbers have almost the same meaning in receipts, we convert numbers into zeros where only the length of the numbers are counted, e.g. "1234"→"0000", "55555"→"00000". Moreover, different combinations of numbers can formulate countless number of types to learn for the model, which is difficult.

All in all, the masking is to help the model focus on the most import target in our task, that is the length of the numbers.

If your task have specific requirement to the type of the numbers but rather the length of numbers, comment the related line of code for masking the numbers to 0s.

Thank you for your reply. Can you also explain how to get the original text after prediction?
The output still has tokenzied words with ## as prefix. I simply replaced it with space. Special characters are splitted separately which makes the output different from Ground Truth.
The bbox is also drawn and the input json didn't have the coordinates for the model generated output.
Do we have to pass the prediction to OCR again to get the original text?
As @Neelesh1121 mentioned the accuracy is less because of that. If you have used the 0's and ## in your code, can you please share us the prediction code?

from cutie.

NishuKumar avatar NishuKumar commented on August 26, 2024

i commented below code in data_loader_json.py to avoid masking while evaluating

   # for i, c in enumerate(string):
            # if is_number(c):
               # string = string[:i] + '0' + string[i+1:]
            
    strings = [string]
    #if self.tokenize:
        #strings = self.tokenizer.tokenize(strings[0])
        #print(string, '-->', strings)
          #if self.tokenize:
                    #self.tokenizer = tokenization.FullTokenizer('dict/vocab.txt', do_lower_case=not self.text_case) 

from cutie.

Karthik1904 avatar Karthik1904 commented on August 26, 2024

Hi All,
I have trained & evaluated model

but how can i predict for new invoice image. should i need json file for new image also?

from cutie.

mohammedayub44 avatar mohammedayub44 commented on August 26, 2024

@Karthik1904 What data and/or OCR tool did you use for labeling and formatting the input to the right format ?

from cutie.

Ibmaria avatar Ibmaria commented on August 26, 2024

Hi guys for the annotation with bounding boxes and for a better accuracy you can use aws textract or google ocr engine or this
link
https://www.pyimagesearch.com/2020/05/25/tesseract-ocr-text-localization-and-detection/

from cutie.

mohammedayub44 avatar mohammedayub44 commented on August 26, 2024

Thanks @Ibmaria for the link. Looks like Textract returns a json but needs to be reformatted as suggested earlier.

I checked out your fork of the repo, you had run some tests with SROIE dataset. How did you end up creating the json files for those. I was trying to do the same (raised an issue here #18 )

from cutie.

gibotsgithub avatar gibotsgithub commented on August 26, 2024

@Neelesh1121 @NishuKumar @Karthik1904 i have created the json files in the required format. i have 400 invoices data. the main_train_json.py gets killed because it utilises all the RAM. did u face this issue? I have 16 gb of ram.

from cutie.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.