dedupeio / dedupe Goto Github PK
View Code? Open in Web Editor NEW:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
Home Page: https://docs.dedupe.io
License: MIT License
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
Home Page: https://docs.dedupe.io
License: MIT License
When scoring fields, treat empty fields differently.
Sometimes getting the following error when running canonical example. This is most likely due to the invertedIndex function failing when no TF/IDF canopy is chosen
Traceback (most recent call last): File "test/canonical_test.py", line 123, in blocked_data = dedupe.blocking.blockingIndex(data_d, blocker) File "/Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/blocking.py", line 173, in blockingIndex blocker.invertIndex(data_d.iteritems()) File "/Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/blocking.py", line 74, in invertIndex num_docs = len(self.token_vector[field]) UnboundLocalError: local variable 'field' referenced before assignment
As mentioned in #60, we should look into a caching strategy that reduces number of redundant comparisons but doesn't explode memory.
We want to call certain functions when running on smaller datastes that are processed 100% in memory. Abstract these functions into a helper class.
Bloodgood and Vijay Shanker's method looks good
http://www.aclweb.org/anthology/W/W09/W09-1107.pdf
When running on Mountain Lion
python setup.py install
executes successfully, but when the dedupe library is called, the following error occurs:
Traceback (most recent call last):
File "examples/canonical_example.py", line 3, in
import exampleIO
File "/Users/derekeder/projects/open-city/deduplication/dedupe/examples/exampleIO.py", line 3, in
import dedupe.core
File "/Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/init.py", line 14, in
import affinegap
ImportError: dlopen(/Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/affinegap.so, 2): Symbol not found: _newarrayobject
Referenced from: /Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/affinegap.so
Expected in: flat namespace
in /Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/affinegap.so
idea 1: a list of numbers
a list of numbers in the same order as the original dataset with IDs (row numbers) of found duplicates and zeros for the rest
idea 2: 2 files
Fixed with SHA: 968d190
Michael Wick describes an joint approach to deduplication, clustering and canonicalization that makes an enormous amount of sense and seems to perform wonderfully. http://people.cs.umass.edu/~mwick/MikeWeb/Publications_files/wick09entity.pdf This approach is extremely attractive, but would require an understanding of probabilistic graphical models currently beyond my ken.
If we don't go that route, then we should use the approach described by Chaudhuri, et.al. ftp://ftp.research.microsoft.com/users/datacleaning/dedup_icde05.pdf, also explained in Nauman's /An Introduction to Duplicate Detection/ http://www.morganclaypool.com/doi/pdf/10.2200/S00262ED1V01Y201003DTM003
The big thing is we need better blocking. To get better blocking we need to find better predicates. Part of that is making better predicates available, like tf-idf, but the major part is supplying more positive examples. The major limit to that is the number of records that we can calculate record-distances between. Right now we are around 700.
We should work on that bottleneck. First assignment: consolidate these three near identical functions, to one which we can begin to optimize. https://gist.github.com/3761519
some rows are showing up multiple times in different clusters. is this expected?
Interaction terms: provide some kind of hint, perhaps based on calculated weights, as to what fields are good candidates for interaction terms
current implementation used a for loop. this could possibly be replaced with matrix multiplication.
Interaction terms, i.e. affine gap distance of name field * affine gap distance of address field
core.scoreDuplicates creates a large numpy array in memory which blows up as the number of records increases. Currently trying to reduce this by chunking the candidates using itertools.islice.
However, the memory used the numpy arrays don't seem to be reclaimed by the garbage collector. This may be the issue: numpy/numpy#1601
Investigating further with memory_profiler and valgrind:
valgrind --tool=memcheck --suppressions=valgrind-python.supp python -E -tt ./dedupe/numpy_memory_test.py valgrind --tool=massif /usr/bin/python dedupe/numpy_memory_test.py
Do we want to help the user in setting thresholds for selecting duplicates, near duplicates, or nonduplicates? http://en.wikipedia.org/wiki/Loss_function
Description of some of the higher level concepts and how to use the library. @rcackerman said she might help with this back at the ChiPy meeting
Now that we have a clean, working csv example (https://github.com/open-city/dedupe/blob/master/examples/csv_example.py) its time to set up an example that can handle much larger datasets.
For this we will create an example that reads, writes and operates over a sqlite database.
Right now we depend upon networkx to calculated the connected components of the our dupes. However we only use a single method, connected_components.
Might be better to just implement that one algorithm and take the dupes numpy array as an input. http://stackoverflow.com/questions/13191010/a-fast-way-to-find-connected-component-in-a-1-nn-graph
Includes learned weights and blocking
How much should we prefer a predicate that covers NO distinct pairs over a predicate the covers ONE distinct pair.
To learn a good blocking, we need to proved a lot of examples of nonduplicates, which active learning does not provide. Perhaps we can prove all pairs that we score as very likely to be nonduplicates.
for tokens that appear more than x% of the total number of tokens, ignore them when creating TFIDF canopies.
Talked to a guy who used to work for NavTec. He said the way they used to handled address deduplication was by first matching street names to a canonical list, and then checking to see if the address was on the same 'block'. With Tiger or OSM, we could definitely do that.
We currently have two ways of testing each piece of dedupe:
We should think of a way to do tests in a consistent way. Here are some things to read up on:
We should have affinegap.pyx compile to a pure python, so we do not have to maintain two diverging affine gap distance functions.
There are a number of approaches to 'normalizing an edit distance'
I haven't tried 1, as it is a more complicated algorithm. There does not seem to be any appreciable difference between 2,3, and 4 on performance. 4 has an advantage over 2 and 3 that if the unnormalized edit distance is a metric so is the normalized edit distance.
Move this block of code in to cython:
field_distances[i] = [stringDistance(record_1[name], record_2[name]) for name in base_fields]
For a cluster of duplicates, return a canonical representation of the entity we hope they all refer to.
affine gap should be made more robust to handle this input
it also raises a bigger question of how we handle empty fields in general. do we throw them out completely?
Implementing Disjunctive blocking broke the writeSettings function.
We have implemented Chaduri and Hierarchical clustering algorithms, both with not the best results. Need to do more research.
Improving pairwise scoring (#3) will help regardless of which clustering approach we take.
option 1: ask Daniel Mullner (fastcluster author) to make SciPy import on line 27 of fastcluster.py optional
option 2: fork fastcluster, remove line, add to dedupe package (GPL license restrictions)
option 3: alternative library to fastcluster
Bilenko pre-processes the data by lowering the case and removing non-alphanumeric characters. Should we do this, or leave it to the user? Having everything be the same case helps our performance, but removing punctuation does not have much effect.
We will need to implement the Canopies algorithm http://www.kamalnigam.com/papers/canopy-kdd00.pdf
Our algorithm to learn blocking can only handle a modest number of examples, and we currently have code to reduce the number of examples the code looks if it is passed too many.
At the same time we sometimes make a pretty expensive call to recordDistances in semiSupervisedLearning to find likely distinct pairs to feed to our blocking training.
This is not a critical performance issue, because in typical case, when we use active learning, we don't make that call in semiSupervisedLearning.
However, it slows down the dev cycle, as about 20 of the 30 seconds it takes to run canonical_test.py is taken up by this function call. It also seems pretty smelly . 0
http://ozkatz.github.com/better-python-apis.html
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.