dedupeio / dedupe Goto Github PK

View Code? Open in Web Editor NEW

4.0K 120.0 538.0 6.05 MB

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Home Page: https://docs.dedupe.io

License: MIT License

Python 99.13% Shell 0.08% Cython 0.79%

dedupe record-linkage python python-library entity-resolution dedupe-library de-duplicating datamade clustering

dedupe's People

Contributors

Stargazers

Watchers

Forkers

tfmorris yoyossy arowla pombredanne derwiki waywardmonkeys emphanos gburt jeffdonovan markhuberty shankark10n wesm dwillis nbasu02 sfbrigade mekarpeles kzk0030 joegermuska syst3mw0rm nilesh-c nikitsaraf marcdacosta masdude antoniocuga neozhangthe1 achanda101 evz laprice beng imclab orientalperil briansipple cojito tlevine nidhog punalpatel abhijith jtbates zmaril davechan terretta sternb0t matthhong jgoleary dwyerk joestemarie giserh giantoak tongqqiu benjaminjonmiller rootfs-analytics lazymike tanwanirahul marspathfinder xykev shahin fideln8 rkiddy ezanltd krishnasamudrala-tal helenaxwang warstreetsmedia benstroud nkhuyu dataqc yaozhangpan aloknayak29 nmiranda tpnguyen davidkunio kiran246 jamesrhaley ricardofmteixeira albert1988 calli6 atulsharma2204 tinacloud targetholding rud jubjub primoz-k lucaswiser ambier hanhanwu nilay-automatesmb prashanthcz anukat2015 wleftwich vybhavk guyyos islammohamed davidsoloman oge77 paulgc ejokeeffe cequencer adamchainz xuanhan863 rlugojr sonnypolaris

dedupe's Issues

Documentation of Dedupe API

user documentation
- code documentation
- string docs

Handle missing data

When scoring fields, treat empty fields differently.

invertIndex fails when no TF/IDF canopy is chosen

Sometimes getting the following error when running canonical example. This is most likely due to the invertedIndex function failing when no TF/IDF canopy is chosen

Traceback (most recent call last):
  File "test/canonical_test.py", line 123, in 
    blocked_data = dedupe.blocking.blockingIndex(data_d, blocker)
  File "/Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/blocking.py", line 173, in blockingIndex
    blocker.invertIndex(data_d.iteritems())
  File "/Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/blocking.py", line 74, in invertIndex
    num_docs = len(self.token_vector[field])
UnboundLocalError: local variable 'field' referenced before assignment

Caching strategy for reducing duplicate candidates

As mentioned in #60, we should look into a caching strategy that reduces number of redundant comparisons but doesn't explode memory.

Create InMemory helper functions class

We want to call certain functions when running on smaller datastes that are processed 100% in memory. Abstract these functions into a helper class.

blocking.blockingIndex
sampling for training (not written yet)

Stopping Rule for Active Learning

Bloodgood and Vijay Shanker's method looks good
http://www.aclweb.org/anthology/W/W09/W09-1107.pdf

Affine Gap fails to compile in OSX Mountain Lion

When running on Mountain Lion

python setup.py install

executes successfully, but when the dedupe library is called, the following error occurs:

Traceback (most recent call last): File "examples/canonical_example.py", line 3, in import exampleIO File "/Users/derekeder/projects/open-city/deduplication/dedupe/examples/exampleIO.py", line 3, in import dedupe.core File "/Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/init.py", line 14, in import affinegap ImportError: dlopen(/Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/affinegap.so, 2): Symbol not found: _newarrayobject Referenced from: /Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/affinegap.so Expected in: flat namespace in /Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/affinegap.so

determine useful output

idea 1: a list of numbers
a list of numbers in the same order as the original dataset with IDs (row numbers) of found duplicates and zeros for the rest

idea 2: 2 files

one with the entire data_d without the duplicates, solving canonicalization by just picking the first one
another file with just the duplicates and what row they were flagged as duplicates of with a score

Select instances for user labeling

Speed of Affine Gap Distance

Fixed with SHA: 968d190

Clustering of duplicate pairs

Michael Wick describes an joint approach to deduplication, clustering and canonicalization that makes an enormous amount of sense and seems to perform wonderfully. http://people.cs.umass.edu/~mwick/MikeWeb/Publications_files/wick09entity.pdf This approach is extremely attractive, but would require an understanding of probabilistic graphical models currently beyond my ken.

If we don't go that route, then we should use the approach described by Chaudhuri, et.al. ftp://ftp.research.microsoft.com/users/datacleaning/dedup_icde05.pdf, also explained in Nauman's /An Introduction to Duplicate Detection/ http://www.morganclaypool.com/doi/pdf/10.2200/S00262ED1V01Y201003DTM003

Better Classifier

Better Blocking

The big thing is we need better blocking. To get better blocking we need to find better predicates. Part of that is making better predicates available, like tf-idf, but the major part is supplying more positive examples. The major limit to that is the number of records that we can calculate record-distances between. Right now we are around 700.

We should work on that bottleneck. First assignment: consolidate these three near identical functions, to one which we can begin to optimize. https://gist.github.com/3761519

clustering does not return unique row sets

some rows are showing up multiple times in different clusters. is this expected?

Guidance for model selection

Interaction terms: provide some kind of hint, perhaps based on calculated weights, as to what fields are good candidates for interaction terms

speed up findUncertainPairs

current implementation used a for loop. this could possibly be replaced with matrix multiplication.

Interaction terms

Interaction terms, i.e. affine gap distance of name field * affine gap distance of address field

Reduce memory usage for larger datasets

core.scoreDuplicates creates a large numpy array in memory which blows up as the number of records increases. Currently trying to reduce this by chunking the candidates using itertools.islice.

However, the memory used the numpy arrays don't seem to be reclaimed by the garbage collector. This may be the issue: numpy/numpy#1601

Investigating further with memory_profiler and valgrind:

valgrind --tool=memcheck --suppressions=valgrind-python.supp python -E -tt ./dedupe/numpy_memory_test.py
valgrind --tool=massif /usr/bin/python dedupe/numpy_memory_test.py

Setting thresholds for identifying duplicates, near duplicates, or non duplicates

Do we want to help the user in setting thresholds for selecting duplicates, near duplicates, or nonduplicates? http://en.wikipedia.org/wiki/Loss_function

Other potential predicates

geospacial - within a distance radius from each other (an example of 5 miles was given)
phonetically matching similar sounding words

Readme and Wiki Documentation

Description of some of the higher level concepts and how to use the library. @rcackerman said she might help with this back at the ChiPy meeting

Create sqlite database example

Now that we have a clean, working csv example (https://github.com/open-city/dedupe/blob/master/examples/csv_example.py) its time to set up an example that can handle much larger datasets.

For this we will create an example that reads, writes and operates over a sqlite database.

Inverted Index of Blocks

Cross validation for learning hyper-parameters

Calculate connected components

Right now we depend upon networkx to calculated the connected components of the our dupes. However we only use a single method, connected_components.

Might be better to just implement that one algorithm and take the dupes numpy array as an input. http://stackoverflow.com/questions/13191010/a-fast-way-to-find-connected-component-in-a-1-nn-graph

Read and Write Data Model

Includes learned weights and blocking

More principled objective function in chvatal algorithm

How much should we prefer a predicate that covers NO distinct pairs over a predicate the covers ONE distinct pair.

Would performance improve if trained on the subsample selected through blocking instead of sample of all instances?

Active Learning does not supply enough nonduplicates to learnable blocking

To learn a good blocking, we need to proved a lot of examples of nonduplicates, which active learning does not provide. Perhaps we can prove all pairs that we score as very likely to be nonduplicates.

ignore stop words in TFIDF canopy creation

for tokens that appear more than x% of the total number of tokens, ignore them when creating TFIDF canopies.

Address Preprocessing

Talked to a guy who used to work for NavTec. He said the way they used to handled address deduplication was by first matching street names to a canonical list, and then checking to see if the address was on the same 'block'. With Tiger or OSM, we could definitely do that.

Create Documentation for Data Model

better pattern for testing

We currently have two ways of testing each piece of dedupe:

Files like predicates.py and blocking.py are executable with some tests in the init section. This is now partially broken with the new package directory structure
We have two test files in dedupe/test that run unit tests on affinegap and clustering. These can be executed under the dedupe.test namespace.

We should think of a way to do tests in a consistent way. Here are some things to read up on:

unittest documentation: http://docs.python.org/library/unittest.html
example of unittest: https://github.com/simplegeo/python-oauth2/blob/master/tests/test_oauth.py

Remove duplicate implementations of affine gap

We should have affinegap.pyx compile to a pure python, so we do not have to maintain two diverging affine gap distance functions.

Learnable Edit Distance

Normalization of Affine Gap Edit Distance

There are a number of approaches to 'normalizing an edit distance'

Amortized edit distance, i.e. the minimum ratio of the sum of the cost of edit operation divided by edit sequence length
Division by the sum of the length of the strings
Division by the maximum length of two strings
Li Yujian and Li Bo's procedure: http://ieeexplore.ieee.org.proxy.uchicago.edu/xpl/articleDetails.jsp?arnumber=4160958

I haven't tried 1, as it is a more complicated algorithm. There does not seem to be any appreciable difference between 2,3, and 4 on performance. 4 has an advantage over 2 and 3 that if the unnormalized edit distance is a metric so is the normalized edit distance.

Further optimize buildRecordDistances

Move this block of code in to cython:

field_distances[i] = [stringDistance(record_1[name], record_2[name]) for name in base_fields]

Canonical entity representation for cluster of duplicates

For a cluster of duplicates, return a canonical representation of the entity we hope they all refer to.

affine gap throws 'ZeroDivisionError' with 2 empty strings

affine gap should be made more robust to handle this input

it also raises a bigger question of how we handle empty fields in general. do we throw them out completely?

Allow for disjunctive blocks in writeSettings

Implementing Disjunctive blocking broke the writeSettings function.

Better clustering

We have implemented Chaduri and Hierarchical clustering algorithms, both with not the best results. Need to do more research.

Improving pairwise scoring (#3) will help regardless of which clustering approach we take.

Remove SciPy dependency for fastcluster

option 1: ask Daniel Mullner (fastcluster author) to make SciPy import on line 27 of fastcluster.py optional

option 2: fork fastcluster, remove line, add to dedupe package (GPL license restrictions)

option 3: alternative library to fastcluster

Package everything into a Library

Pre-processing the data

Bilenko pre-processes the data by lowering the case and removing non-alphanumeric characters. Should we do this, or leave it to the user? Having everything be the same case helps our performance, but removing punctuation does not have much effect.

Different string similarity metric

Jaro for names
Other options:http://cs.anu.edu.au/~Peter.Christen/publications/tr-cs-06-02.pdf; http://www.cs.cmu.edu/~pradeepr/papers/kdd03.pdf
Learnable string distance
Minimum edit distance of permutations of words. Could be very good for short fields like names

Hardcoding of array sizes in Cython affine distance

Disjunctive Normal Form Blocking

TFD/IDF Predicate

We will need to implement the Canopies algorithm http://www.kamalnigam.com/papers/canopy-kdd00.pdf

SemiSupervisedNonDuplicates should provide fewer examples and find them more efficiently

Our algorithm to learn blocking can only handle a modest number of examples, and we currently have code to reduce the number of examples the code looks if it is passed too many.

At the same time we sometimes make a pretty expensive call to recordDistances in semiSupervisedLearning to find likely distinct pairs to feed to our blocking training.

This is not a critical performance issue, because in typical case, when we use active learning, we don't make that call in semiSupervisedLearning.

However, it slows down the dev cycle, as about 20 of the 30 seconds it takes to run canonical_test.py is taken up by this function call. It also seems pretty smelly . 0

Cleaner API output

http://ozkatz.github.com/better-python-apis.html

Make your code REPL friendly
Create iterators over complex lists of data