Predict and Output about cutie HOT 15 OPEN

vsymbol commented on August 26, 2024

Predict and Output

from cutie.

Comments (15)

Dhirendramohanjha commented on August 26, 2024

@muralimariyappan can you share sample input data

from cutie.

muralimariyappan commented on August 26, 2024

I used the same format as this file https://github.com/4kssoft/CUTIE/blob/master/invoice_data/Faktura1.pdf_0.json

from cutie.

SiddharthGadekar commented on August 26, 2024

Hi Murali,
Were you able to get a solution to this?
I am facing the same issue

from cutie.

muralimariyappan commented on August 26, 2024

No not yet

from cutie.

B34sTRider commented on August 26, 2024

What is the input format for prediction? I used the main_evaluate_json.py to get output using the validation code. The output converts the numbers to "0's" and substrings are splitted and adds "##" to the prefix. Is is possible to get the actual text output from text_boxes key in input json?
I did change update_dict=True for evaluation and then commented the tokenization code where numbers are converted to 0's and replaced final string " ##" with "", which gives me the output somewhat. But why are we masking the data?
As suggested I updated the dict using main_dict_json.py with training data. Also updated the vocab.txt with new words.
If I change update_dict=False for evaluation I only get [UNK] for all words for both GT/INF.

@muralimariyappan Can you please tell which tokenization code you commented to stop the numbers being converted to 0 ? I am facing the similar issue too

from cutie.

Neelesh1121 commented on August 26, 2024

@muralimariyappan , to avoid the numbers getting converted 0, you can comment the script which is just above the word tokenization, after commenting both word tokenization and numbers tokenization, somewhat i am getting the output but my testing accuracy is not increasing more than 0.25 and even training accuracy also becomes constant after a certain epoch with very less value around 0.5 and then it's repeating with same value for further epochs. So please can just share information about accuracy and if you are getting good accuracy, please just tell us the number of training documents?

from cutie.

vsymbol commented on August 26, 2024

What is the input format for prediction? I used the main_evaluate_json.py to get output using the validation code. The output converts the numbers to "0's" and substrings are splitted and adds "##" to the prefix. Is is possible to get the actual text output from text_boxes key in input json?
I did change update_dict=True for evaluation and then commented the tokenization code where numbers are converted to 0's and replaced final string " ##" with "", which gives me the output somewhat. But why are we masking the data?
As suggested I updated the dict using main_dict_json.py with training data. Also updated the vocab.txt with new words.
If I change update_dict=False for evaluation I only get [UNK] for all words for both GT/INF.

This is a good question.

Since numbers have almost the same meaning in receipts, we convert numbers into zeros where only the length of the numbers are counted, e.g. "1234"→"0000", "55555"→"00000". Moreover, different combinations of numbers can formulate countless number of types to learn for the model, which is difficult.

All in all, the masking is to help the model focus on the most import target in our task, that is the length of the numbers.

If your task have specific requirement to the type of the numbers but rather the length of numbers, comment the related line of code for masking the numbers to 0s.

from cutie.

Neelesh1121 commented on August 26, 2024

@vsymbol I did same way, commented that section for numbers, but the issue was accuracy of the result, so please can you suggest how we can improve the accuracy of the model, i have around 500 training documents?
Thanks in Advance!

from cutie.

muralimariyappan commented on August 26, 2024

What is the input format for prediction? I used the main_evaluate_json.py to get output using the validation code. The output converts the numbers to "0's" and substrings are splitted and adds "##" to the prefix. Is is possible to get the actual text output from text_boxes key in input json?
I did change update_dict=True for evaluation and then commented the tokenization code where numbers are converted to 0's and replaced final string " ##" with "", which gives me the output somewhat. But why are we masking the data?
As suggested I updated the dict using main_dict_json.py with training data. Also updated the vocab.txt with new words.
If I change update_dict=False for evaluation I only get [UNK] for all words for both GT/INF.

This is a good question.

Since numbers have almost the same meaning in receipts, we convert numbers into zeros where only the length of the numbers are counted, e.g. "1234"→"0000", "55555"→"00000". Moreover, different combinations of numbers can formulate countless number of types to learn for the model, which is difficult.

All in all, the masking is to help the model focus on the most import target in our task, that is the length of the numbers.

If your task have specific requirement to the type of the numbers but rather the length of numbers, comment the related line of code for masking the numbers to 0s.

Thank you for your reply. Can you also explain how to get the original text after prediction?
The output still has tokenzied words with ## as prefix. I simply replaced it with space. Special characters are splitted separately which makes the output different from Ground Truth.
The bbox is also drawn and the input json didn't have the coordinates for the model generated output.
Do we have to pass the prediction to OCR again to get the original text?
As @Neelesh1121 mentioned the accuracy is less because of that. If you have used the 0's and ## in your code, can you please share us the prediction code?

from cutie.

NishuKumar commented on August 26, 2024

i commented below code in data_loader_json.py to avoid masking while evaluating

   # for i, c in enumerate(string):
            # if is_number(c):
               # string = string[:i] + '0' + string[i+1:]
            
    strings = [string]
    #if self.tokenize:
        #strings = self.tokenizer.tokenize(strings[0])
        #print(string, '-->', strings)
          #if self.tokenize:
                    #self.tokenizer = tokenization.FullTokenizer('dict/vocab.txt', do_lower_case=not self.text_case)

from cutie.

Karthik1904 commented on August 26, 2024

Hi All,
I have trained & evaluated model

but how can i predict for new invoice image. should i need json file for new image also?

from cutie.

mohammedayub44 commented on August 26, 2024

@Karthik1904 What data and/or OCR tool did you use for labeling and formatting the input to the right format ?

from cutie.

Ibmaria commented on August 26, 2024

Hi guys for the annotation with bounding boxes and for a better accuracy you can use aws textract or google ocr engine or this
link
https://www.pyimagesearch.com/2020/05/25/tesseract-ocr-text-localization-and-detection/

from cutie.

mohammedayub44 commented on August 26, 2024

Thanks @Ibmaria for the link. Looks like Textract returns a json but needs to be reformatted as suggested earlier.

I checked out your fork of the repo, you had run some tests with SROIE dataset. How did you end up creating the json files for those. I was trying to do the same (raised an issue here #18 )

from cutie.

gibotsgithub commented on August 26, 2024

@Neelesh1121 @NishuKumar @Karthik1904 i have created the json files in the required format. i have 400 invoices data. the main_train_json.py gets killed because it utilises all the RAM. did u face this issue? I have 16 gb of ram.

from cutie.

Predict and Output about cutie HOT 15 OPEN

Comments (15)

Related Issues (15)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent