Comments (14)
Good news for the detection of duplicate reimbursements. I did a notebook to convert pdf files to png and then to detect common regions with sift.
Recipes used: 5645173 and 5645177.
Match regions:
So, with some experiments i found that use only sift will give us a lot of false positive.
Look at this case mentioned by @weslleymberg
Here are the document_ids: 5886345 and 5886361.
Sift keypoints:
Common regions between them:
So, I still working in the script to predict multiple reimbursements, i will try to combine sift output with the OCR data with have in this issue #188 to archive better results.
As soon as possible i will share my other news with you guys :D
from serenata-de-amor.
@cuducos In this example they have distinct document_id
's.
Before going to OCR, I'd try SIFT, which I believe is much faster since does not depend on a vocabulary of words, just plain linear algebra.
from serenata-de-amor.
Understand. I didn't know that the number of the receipt can be duplicated just by coincidence.
My thought at the time was that typing the wrong data might be a very common mistake thus making document_number
not very reliable to spot a possible fraud with duplicated receipt.
from serenata-de-amor.
Hi @cuducos
Yes i can open a PR, just let me play a little with this data in this weekend :D
After that i will do the PR.
I spent too much time to find an away to convert the pdf :/
I just got the insights to play with the prediction right now.
from serenata-de-amor.
If you remember by heart (otherwise I look for it in the .ipynb
): Do they have exactly the same document_number
in the dataset? Or this number was mocked?
Just asking because if the real document_number
differs (image vs dataset) we'll have to rely on OCR and stuff. If they are the same I think it's easier to spot.
from serenata-de-amor.
Sound great. SIFT is new for me but looks like something very effective for this kind of stuff. Awesome!
from serenata-de-amor.
I feel like SIFT is great for find similar stuff (e.g., receipts with the same layout), but is probably not going to be a good option to decide if 2 receipts are the same or not.
from serenata-de-amor.
Check the paper "Region Duplication Forgery Detection Technique Based on SURF and HAC" for references (https://sci-hub.cc/ is your friend). Here's an example of Python code to run SIFT.
from serenata-de-amor.
Came across 2 examples where 2 distinct reimbursements have the same document_number
, but do not have the same receipt.
On the first one the value that is presented as the document_number
is acctualy the congressperson's subscription number on the water company that issued the bills.
Here are the document_id
s: 5886345 and 5886361. And the document_number
is 0010100910378000. You can see this is the same number that is in the field "Inscrição" on both documents.
A similar thing happens with these other 2 documents: 5780419 and 5880166. Where the operator's number (t00408151) of a highway toll is used as the document_number
. Note that these two documents also have distinct applicant_id
s (3044 and 1133)
from serenata-de-amor.
Just sharing Jarbas links of the receipts mentioned by @weslleymberg:
- http://jarbas.datasciencebr.com/#/document/5886345
- http://jarbas.datasciencebr.com/#/document/5886361
- http://jarbas.datasciencebr.com/#/document/5780419
- http://jarbas.datasciencebr.com/#/document/5880166
from serenata-de-amor.
Came across 2 examples where 2 distinct reimbursements have the same document_number, but do not have the same receipt.
I'm not sure this is a problem per se. I mean, AFAIK the document_number
is the number of the receipt, the number controlled by the supplier (each supplier, each company have their own control of receipts sequential numbering). In other words it can be just a coincidence. But… coinciding the document_number
and the supplier is strange…
That said, it seems to me that it's a matter of typing the wrong data, not sure if it's compromising…
from serenata-de-amor.
My thought at the time was that typing the wrong data might be a very common mistake thus making
document_number
not very reliable to spot a possible fraud with duplicated receipt.
Good point!
from serenata-de-amor.
That's awesome progress @silviodc! Many thanks for that. Even if the results are still lots of false positives IMHO it would be great to have this notebook of yours in our master
branch. Just add in the conclusions the issues your analysis raised for future researchers ; ) Do you fancy opening a PR?
Cheers
from serenata-de-amor.
Hi everyone,
The PR #238 about the conversion of pdf to image and the use of SIFT is up.
I also put a plain which i think could be interesting to follow to build the ML approach to detect duplicates.
In near future i will try to do the steps 3 and 4 i mentioned there. However, if someone feel motivated just go, i want to see it working !!
from serenata-de-amor.
Related Issues (20)
- Use monetary adjustment to try a better accuracy in meal outlier classifier HOT 2
- New classifier: generalized item description
- The dataset 2017-02-15-receipts-texts-raw.tar.xz: Raw Cloud Vision API responses is no longer downloadable.
- Missing electronic receipt filter in Jarbas UI HOT 1
- Ministry of Economy's open consultation regarding Open Data (ends next july 15th)
- Installation using requirements.txt without Anaconda fails HOT 3
- [Suggestion] Please, port Rosie to Mastodon! HOT 1
- Review
- Base 2020
- Internal server error on reimbursement API for some IDs
- Reimbursement API does not return the ID used on the web application
- Prepare the Serenata apps to be deployable in a Kubernetes cluster HOT 4
- Migrate Serenata database for a managed database solution HOT 2
- rosie is not posting on twitter
- Is the project still alive? HOT 3
- Rosie stops mid-classification due to MemoryError in a 32gb ram machine HOT 1
- The trello link is 404
- Creating a Telegram Bot for the project - Criar um bot de telegram para o projeto HOT 1
- Revisão do readme.md
- [bug] Pipeline is failing
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from serenata-de-amor.