naivehobo / invoicenet Goto Github PK
View Code? Open in Web Editor NEWDeep neural network to extract intelligent information from invoice documents.
License: MIT License
Deep neural network to extract intelligent information from invoice documents.
License: MIT License
Hi, excelent job. I have many questions:
Which is the best resolution for invoices scanner?
300 vs 600ppp. B&W vs Grey scale?
Total amount in invoice is 234.234,90. In json file i have intro same character separarion?
Regards.
Wouldn't it be the ultimate feature to get an UBL invoice from the input PDF?
File "/home/saurabh/Documents/InvoiceNet-master/env/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 2102, in execution_mode
yield
File "/home/saurabh/Documents/InvoiceNet-master/env/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 758, in _next_internal
output_shapes=self._flat_output_shapes)
File "/home/saurabh/Documents/InvoiceNet-master/env/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2610, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File "/home/saurabh/Documents/InvoiceNet-master/env/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6843, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [Op:IteratorGetNext]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/saurabh/Documents/InvoiceNet-master/env/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 772, in next
return self._next_internal()
File "/home/saurabh/Documents/InvoiceNet-master/env/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 764, in _next_internal
return structure.from_compatible_tensor_list(self._element_spec, ret)
File "/home/saurabh/anaconda3/lib/python3.7/contextlib.py", line 130, in exit
self.gen.throw(type, value, traceback)
File "/home/saurabh/Documents/InvoiceNet-master/env/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 2105, in execution_mode
executor_new.wait()
File "/home/saurabh/Documents/InvoiceNet-master/env/lib/python3.7/site-packages/tensorflow/python/eager/executor.py", line 67, in wait
pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 67, in
main()
File "train.py", line 62, in main
early_stop_steps=args.early_stop_steps
File "/home/saurabh/Documents/InvoiceNet-master/invoicenet/common/trainer.py", line 44, in train
train_loss = model.train_step(next(train_iter))
File "/home/saurabh/Documents/InvoiceNet-master/env/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 736, in next
return self.next()
File "/home/saurabh/Documents/InvoiceNet-master/env/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 774, in next
raise StopIteration
StopIteration
It would be amazing if you could create documentation / a YouTube video / a podcast where you explain on a high level how it works.
I know and understand how CNNs work, but I haven't worked with attention-based models and I think I miss a couple of papers to understand Attend, Copy, Parse End-to-end information extraction from documents. Maybe you could just tell me what you think are good resources to understand this paper? Or blog posts about this paper?
Pytesseract struggles with a lot of invoices, some very big clear text are unable to be read.
This is somewhat addressable by doing some preprocessing in cv like adding blurs, threshold, but requires such an enormous amount of preprocessing to the point where some other invoices might start failing if they go through the same preprocessing.
EasyOCR picks it up fine, but is incredibly slow. I'd be happy to provide some working and non working examples privately.
The text is super clear, but i suspect tesseract struggles with the surrounding border and background. I'm able to remove most of it with preprocessing, but again, then pytesseract performs worse on some other invoices.
Hi, how would you treat/use invoicenet with multiple pages documents?
When I debug the code, I found the the pattern regex is not working fine for me. Its converting everything to '-'.
Code under invoicenet/acp/data.py
pattern = text
pattern = re.sub(r"\p{Lu}", "X", pattern)
pattern = re.sub(r"\p{Ll}", "x", pattern)
pattern = re.sub(r"\p{N}", "0", pattern)
pattern = re.sub(r"[^Xx0]", "-", pattern)
Is there a possibility to train the model on all fields in my dataset?
Something like:
python train.py --field { FIELDS } --batch_size
Now, I can only train one field at a time?
hello can you please provide the dataset and pickle file . please
Whenever i train the model, I can prepare the data, and its located in the correct place, but every time i start the training i get an exception:
Exception in thread Thread-35: Traceback (most recent call last): File "C:\Users\Michael\Anaconda3\envs\invoicetest\lib\site-packages\tensorflow\python\client\session.py", line 1356, in _do_call return fn(*args) File "C:\Users\Michael\Anaconda3\envs\invoicetest\lib\site-packages\tensorflow\python\client\session.py", line 1341, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "C:\Users\Michael\Anaconda3\envs\invoicetest\lib\site-packages\tensorflow\python\client\session.py", line 1429, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [[{{node IteratorGetNext_1}}]]
I've tried everything i could think of, but I'm unable to get it to work.
Any tips would be much appreciated
Thanks in advance
python prepare_data.py --data_dir train_data/
In the above what should i do like when should i place my directory? For example, my directory is C:\Users\NITHIN\invoicenet\InvoiceNet how should i write the code?
Thanks
Hi,
Can you please provide the dataset and supported files, it looks interesting I want to run and check.
Thanks in advance
I am following this on GitHub. Interesting project. What is the road map and what do you need to take this project to the next level? Time? Money?
I am thinking that we could have a common repository for training so we can create one experienced training file together.
I am interested in translating this library to .NET but would like to do it at the right point.
I pre-processed all the data as indicated and I got this:
Exception in thread Thread-107:
Traceback (most recent call last):
File "C:\Users\nicol\anaconda3\envs\invoicenet\lib\threading.py", line 926, in _bootstrap_inner
self.run()
File "C:\Users\nicol\anaconda3\envs\invoicenet\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\nicol\Documents\InvoiceNet\invoicenet\gui\trainer.py", line 253, in _train
train_loss = model.train_step(next(train_iter))
File "C:\Users\nicol\anaconda3\envs\invoicenet\lib\site-packages\tensorflow\python\eager\def_function.py", line 780, in __call__
result = self._call(*args, **kwds)
File "C:\Users\nicol\anaconda3\envs\invoicenet\lib\site-packages\tensorflow\python\eager\def_function.py", line 823, in _call
self._initialize(args, kwds, add_initializers_to=initializers)
File "C:\Users\nicol\anaconda3\envs\invoicenet\lib\site-packages\tensorflow\python\eager\def_function.py", line 697, in _initialize
*args, **kwds))
File "C:\Users\nicol\anaconda3\envs\invoicenet\lib\site-packages\tensorflow\python\eager\function.py", line 2855, in _get_concrete_function_internal_garbage_collected
graph_function, _, _ = self._maybe_define_function(args, kwargs)
File "C:\Users\nicol\anaconda3\envs\invoicenet\lib\site-packages\tensorflow\python\eager\function.py", line 3213, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "C:\Users\nicol\anaconda3\envs\invoicenet\lib\site-packages\tensorflow\python\eager\function.py", line 3075, in _create_graph_function
capture_by_value=self._capture_by_value),
File "C:\Users\nicol\anaconda3\envs\invoicenet\lib\site-packages\tensorflow\python\framework\func_graph.py", line 986, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "C:\Users\nicol\anaconda3\envs\invoicenet\lib\site-packages\tensorflow\python\eager\def_function.py", line 600, in wrapped_fn
return weak_wrapped_fn().__wrapped__(*args, **kwds)
File "C:\Users\nicol\anaconda3\envs\invoicenet\lib\site-packages\tensorflow\python\eager\function.py", line 3735, in bound_method_wrapper
return wrapped_fn(*args, **kwargs)
File "C:\Users\nicol\anaconda3\envs\invoicenet\lib\site-packages\tensorflow\python\framework\func_graph.py", line 973, in wrapper
raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:
C:\Users\nicol\Documents\InvoiceNet\invoicenet\acp\acp.py:85 train_step *
predictions = self.model(inputs, training=True)
C:\Users\nicol\Documents\InvoiceNet\invoicenet\acp\model.py:176 call *
parsed = self.parser(inputs=(x, context), training=training)
C:\Users\nicol\Documents\InvoiceNet\invoicenet\parsing\parsers.py:65 call *
empty_answer = tf.constant(InvoiceData.eos_idx, tf.int32, shape=(tf.shape(x)[0], self.seq_out))
C:\Users\nicol\anaconda3\envs\invoicenet\lib\site-packages\tensorflow\python\framework\constant_op.py:264 constant **
allow_broadcast=True)
C:\Users\nicol\anaconda3\envs\invoicenet\lib\site-packages\tensorflow\python\framework\constant_op.py:282 _constant_impl
allow_broadcast=allow_broadcast))
C:\Users\nicol\anaconda3\envs\invoicenet\lib\site-packages\tensorflow\python\framework\tensor_util.py:453 make_tensor_proto
if shape is not None and np.prod(shape, dtype=np.int64) == 0:
<__array_function__ internals>:6 prod
C:\Users\nicol\anaconda3\envs\invoicenet\lib\site-packages\numpy\core\fromnumeric.py:2962 prod
keepdims=keepdims, initial=initial, where=where)
C:\Users\nicol\anaconda3\envs\invoicenet\lib\site-packages\numpy\core\fromnumeric.py:90 _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: setting an array element with a sequence.
JSON given for Training:
{
"total_amount": "850,456.00",
"amount": ["2,000.00","1,000.00","1,000.00","2,000.00","1,000.00","2,000.00"]
}
Note: Image taken from Google Images
I want to include multiple values for the same feature.. like the above-given example of "amount".
Error:
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Users\Anshul\Anaconda3\envs\CO\lib\tkinter\__init__.py", line 1705, in __call__
return self.func(*args)
File "C:\Users\Anshul\InvoiceNet\invoicenet\gui\trainer.py", line 363, in _prepare_data
fields[field] = util.normalize(labels[field], key="amount")
File "C:\Users\Anshul\InvoiceNet\invoicenet\common\util.py", line 199, in normalize
text = text.replace(",", '')
AttributeError: 'list' object has no attribute 'replace'
Any plans to integrate with/build an annotation tool to help create the json training data? Any suggestions on what to use? Doccanno comes to mind. What have you used to label your train/test invoices?
Hi there,
We're currently predicting the amount fields, does anyone else get weird output? Some of our predictions look like this: 25..5 or 111..4 ? Does anyone has a clue why this happens?
Since one can easily extract the information and write it to a .json file, I was wondering if anybody has an idea on how to convert that .json data to .xlsx? Or maybe to write the extracted information directly into a .xlsx file through tweaking the code. I don't seem to find a suitable option to do so.
Hello,
I tried training on some invoices for extracting Total. It worked fine only with NoOp Parser. Do i need to pre-train parser myself?
Thanks for the repository. In it, you include two parser checkpoins, are they trained parsers, or should i still synthesize a dataset and train my own, i dont see it documented anywhere, thanks.
You might want to mention this in the docs:
To fix the wand.exceptions.PolicyError: not authorized error:
The user needs to change the policy in /etc/ImageMagick-6/policy.xml for PDFs to read.
i.e: <policy domain="coder" rights="read" pattern="PDF" />
Hello,
This project looks interesting. In the paper it says that input is in HOCR format. Can you help me to figure out how to input a document to this project?
Have a good day !
Not sure the real issue but when I run the code for inference I get an error which states that
AttributeError: 'Checkpoint' object has no attribute 'read'
Which points to the code in acp.py
self.checkpoint.read(self.restore_all_path).expect_partial()
When I changed that to the following error goes away
self.checkpoint.restore(tf.train.latest_checkpoint(self.restore_all_path))
it went away.
while runing command
!python predict.py --field invoice_number --invoice ../invoice1.pdf
i am geting error,
Traceback (most recent call last):
File "predict.py", line 77, in <module>
main()
File "predict.py", line 53, in main
model = AttendCopyParse(field=args.field, restore=True)
File "/content/InvoiceNet/invoicenet/acp/acp.py", line 70, in __init__
raise Exception("No trained model available for the field '{}'".format(self.field))
Exception: No trained model available for the field 'invoice_number'
can you pls share your pretrainned weights/model?
The example in readme shows
python train.py --field total_amount --batch_size 8
and
python predict.py --field total_amount --invoice invoices/1.pdf
So the model is for only one field? or how to specify several fields?
Hi, I saw the invoicenet.commom.model module, it is not implemented yet ?
When it will be implemented ?
And what kind of model is used here ?
error: c:\Users\miniconda3\envs\InvoiceNet\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 774, in next
raise StopIteration
StopIteration
can anyone help me ?
N/A
Hello,
I have this issue where if field of interest is longer. For example has 5, 6 or even more words it only extracts first part. Can you advise what can I do to fix this? Is n-gram setting here something I can try?
Hi,
Thank you for sharing your implementation.
I have a question regarding the CloudScan model. Do you implement the proposed architecture: dense - dense - BI-LSTM - dense - dense?
As far as I understood your code, this is not the case.
If I am right, what is your reason to diverge from the proposed architecture?
Cheers, Michael
What is the minimum count of Invoice documents required to get a decent generalized model?
hi, thanks for sharing, it's quite usefully. But could you share a sample training file: dftrain.pk? Would like to see the exact format of it.
I am working on prepare_data.py. please check the below command for running code. I had successfully created the image from the .pdf file that is saved in the processed_data folder (train, val).
### **ngrams = util.create_ngrams(page)** from this line my code throwing exception. please check below.
### Skipping C:\Users\Daoud\Documents\ComputerVision\InvoiceNet-master\train_data\9.pdf : tesseract is not installed or it's not in your PATH.
I have installed tesseract already.
### "C:\Program Files (x86)\Tesseract-OCR" tesseract path in my system.
In the prepare_data.py file code there no line where we calling the tesseract method. I am a little confuse
please give me the solution.
thank you in advanced
Restoring total_amount parser ./models/parsers/amount/best...
Restoring all ./models/invoicenet/total_amount/best...
invoice-template-us-neat-750px.png 0/0
Filename: invoice-template-us-neat-750px.json
Traceback (most recent call last):
File "predict.py", line 92, in
main()
File "predict.py", line 83, in main
labels[field] = predictions[field][idx]
IndexError: list index out of range
Is this an issue because tesseract wasn't able to read the data? I have attached the image file
Hello,
I am very curious about your work. However, I would need some pdf invoice dataset to replicate it.
Do you know some invoice pdf dataset labelled?
Thank you in advanced for your attention.
Salvador.
Does this software support extracting date from multiple types of forms?
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Users\nicol\anaconda3\envs\tf\lib\tkinter\__init__.py", line 1705, in __call__
return self.func(*args)
File "C:\Users\nicol\Documents\InvoiceNet\invoicenet\gui\trainer.py", line 371, in _prepare_data
ngrams = util.create_ngrams(page)
File "C:\Users\nicol\Documents\InvoiceNet\invoicenet\common\util.py", line 118, in create_ngrams
words = extract_words(img)
File "C:\Users\nicol\Documents\InvoiceNet\invoicenet\common\util.py", line 91, in extract_words
for i in range(n_boxes) if data['text'][i]]
File "C:\Users\nicol\Documents\InvoiceNet\invoicenet\common\util.py", line 91, in <listcomp>
for i in range(n_boxes) if data['text'][i]]
IndexError: list index out of range
I'm guessing this happens when there is no Tesseract result, or when the result is an empty string. When I print data, some are empty strings, which equates to False:
'kdakdlasd', '', 'sdsdasas'
Did any body have a results? I train 2000 invoices and the result i would say would be better with key value extraction.
Is early stopping an option when working from the command line? I keep counting 500 steps and keeping track of the val_loss and it won't stop training. Am I missing something?
Hi,
Installing the package on Ubuntu 18.4 and getting below error.
Reading package lists... Done
Building dependency tree
Reading state information... Done
libsm-dev is already the newest version (2:1.2.2-1).
libxext-dev is already the newest version (2:1.3.3-1).
libxrender-dev is already the newest version (1:0.9.10-1).
tesseract-ocr is already the newest version (4.00~git2288-10f4998a-2).
poppler-utils is already the newest version (0.62.0-2ubuntu2.10).
0 upgraded, 0 newly installed, 0 to remove and 4 not upgraded.
./install.sh: line 7: virtualenv: command not found
./install.sh: line 8: env/bin/activate: No such file or directory
Processing /home/vagrant/InvoiceNet
Collecting Keras==2.4.3 (from InvoiceNet==0.1)
Downloading https://files.pythonhosted.org/packages/44/e1/dc0757b20b56c980b5553c1b5c4c32d378c7055ab7bfa92006801ad359ab/Keras-2.4.3-py2.py3-none-any.whl
Collecting Pillow==7.2.0 (from InvoiceNet==0.1)
Could not find a version that satisfies the requirement Pillow==7.2.0 (from InvoiceNet==0.1) (from versions: 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7.0, 1.7.1, 1.7.2, 1.7.3, 1.7.4, 1.7.5, 1.7.6, 1.7.7, 1.7.8, 2.0.0, 2.1.0, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2, 2.4.0, 2.5.0, 2.5.1, 2.5.2, 2.5.3, 2.6.0, 2.6.1, 2.6.2, 2.7.0, 2.8.0, 2.8.1, 2.8.2, 2.9.0, 3.0.0, 3.1.0rc1, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.3.0, 3.3.1, 3.3.2, 3.3.3, 3.4.0, 3.4.1, 3.4.2, 4.0.0, 4.1.0, 4.1.1, 4.2.0, 4.2.1, 4.3.0, 5.0.0, 5.1.0, 5.2.0, 5.3.0, 5.4.0, 5.4.1, 6.0.0, 6.1.0, 6.2.0, 6.2.1, 6.2.2)
No matching distribution found for Pillow==7.2.0 (from InvoiceNet==0.1)
Please help.
Conda with Python 3.8 installed.
I've installed the InvoiceNet.
but when I try to run the trainer it throws the error
python trainer.py
2020-12-02 16:39:44.983352: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib:/usr/local/cuda-10.0/lib64:/usr/local/cuda/extras/CUPTI/lib64
2020-12-02 16:39:44.983372: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
It seems, it supports only CUDA 10.1 but not CUDA 10.
Hi guys
I was wondering if fields represented in table (like the line items fields) are supported.
If Yes, how to set them up ?
If Not, that would really be a nice to have.
Hi,
I need your help because I got below message error, every time I tried to run the training. Is there any miss-step that can lead to this problem?
Traceback (most recent call last):
File "C:\Users\micha\.conda\envs\invoicenet\lib\site-packages\tensorflow\python\eager\context.py", line 2102, in execution_mode
yield
File "C:\Users\micha\.conda\envs\invoicenet\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 758, in _next_internal
output_shapes=self._flat_output_shapes)
File "C:\Users\micha\.conda\envs\invoicenet\lib\site-packages\tensorflow\python\ops\gen_dataset_ops.py", line 2610, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File "C:\Users\micha\.conda\envs\invoicenet\lib\site-packages\tensorflow\python\framework\ops.py", line 6843, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [Op:IteratorGetNext]
Will it work for me in identifying invoices in Hebrew?
I assume we need to extract it from the logits in the predict() function in invoicenet/acp/acp.py, but I cannot get it to work. Any help/suggestions are appreciated.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.