Code Monkey home page Code Monkey logo

dedupe's People

Contributors

cathydeng avatar cclauss avatar dejori avatar dependabot[bot] avatar derekeder avatar dwave-pmilesi avatar fgregg avatar fideln8 avatar fjsj avatar jeancochrane avatar leobouloc avatar lmores avatar markhuberty avatar matthhong avatar mbauman avatar mekarpeles avatar metcalfetom avatar nickcrews avatar nikitsaraf avatar nmiranda avatar primoz-k avatar sachinaraballi avatar shahin avatar stevemartingale avatar tfmorris avatar timgates42 avatar tonyduan avatar toolness avatar wleftwich avatar zmaril avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dedupe's Issues

invertIndex fails when no TF/IDF canopy is chosen

Sometimes getting the following error when running canonical example. This is most likely due to the invertedIndex function failing when no TF/IDF canopy is chosen

Traceback (most recent call last):
  File "test/canonical_test.py", line 123, in 
    blocked_data = dedupe.blocking.blockingIndex(data_d, blocker)
  File "/Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/blocking.py", line 173, in blockingIndex
    blocker.invertIndex(data_d.iteritems())
  File "/Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/blocking.py", line 74, in invertIndex
    num_docs = len(self.token_vector[field])
UnboundLocalError: local variable 'field' referenced before assignment

Create InMemory helper functions class

We want to call certain functions when running on smaller datastes that are processed 100% in memory. Abstract these functions into a helper class.

  • blocking.blockingIndex
  • sampling for training (not written yet)

Affine Gap fails to compile in OSX Mountain Lion

When running on Mountain Lion

python setup.py install

executes successfully, but when the dedupe library is called, the following error occurs:

Traceback (most recent call last):
File "examples/canonical_example.py", line 3, in
import exampleIO
File "/Users/derekeder/projects/open-city/deduplication/dedupe/examples/exampleIO.py", line 3, in
import dedupe.core
File "/Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/init.py", line 14, in
import affinegap
ImportError: dlopen(/Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/affinegap.so, 2): Symbol not found: _newarrayobject
Referenced from: /Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/affinegap.so
Expected in: flat namespace
in /Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/affinegap.so

determine useful output

idea 1: a list of numbers
a list of numbers in the same order as the original dataset with IDs (row numbers) of found duplicates and zeros for the rest

idea 2: 2 files

  • one with the entire data_d without the duplicates, solving canonicalization by just picking the first one
  • another file with just the duplicates and what row they were flagged as duplicates of with a score

Clustering of duplicate pairs

Michael Wick describes an joint approach to deduplication, clustering and canonicalization that makes an enormous amount of sense and seems to perform wonderfully. http://people.cs.umass.edu/~mwick/MikeWeb/Publications_files/wick09entity.pdf This approach is extremely attractive, but would require an understanding of probabilistic graphical models currently beyond my ken.

If we don't go that route, then we should use the approach described by Chaudhuri, et.al. ftp://ftp.research.microsoft.com/users/datacleaning/dedup_icde05.pdf, also explained in Nauman's /An Introduction to Duplicate Detection/ http://www.morganclaypool.com/doi/pdf/10.2200/S00262ED1V01Y201003DTM003

Better Blocking

The big thing is we need better blocking. To get better blocking we need to find better predicates. Part of that is making better predicates available, like tf-idf, but the major part is supplying more positive examples. The major limit to that is the number of records that we can calculate record-distances between. Right now we are around 700.

We should work on that bottleneck. First assignment: consolidate these three near identical functions, to one which we can begin to optimize. https://gist.github.com/3761519

Guidance for model selection

Interaction terms: provide some kind of hint, perhaps based on calculated weights, as to what fields are good candidates for interaction terms

Interaction terms

Interaction terms, i.e. affine gap distance of name field * affine gap distance of address field

Reduce memory usage for larger datasets

core.scoreDuplicates creates a large numpy array in memory which blows up as the number of records increases. Currently trying to reduce this by chunking the candidates using itertools.islice.

However, the memory used the numpy arrays don't seem to be reclaimed by the garbage collector. This may be the issue: numpy/numpy#1601

Investigating further with memory_profiler and valgrind:

valgrind --tool=memcheck --suppressions=valgrind-python.supp python -E -tt ./dedupe/numpy_memory_test.py
valgrind --tool=massif /usr/bin/python dedupe/numpy_memory_test.py 

Other potential predicates

  • geospacial - within a distance radius from each other (an example of 5 miles was given)
  • phonetically matching similar sounding words

Address Preprocessing

Talked to a guy who used to work for NavTec. He said the way they used to handled address deduplication was by first matching street names to a canonical list, and then checking to see if the address was on the same 'block'. With Tiger or OSM, we could definitely do that.

better pattern for testing

We currently have two ways of testing each piece of dedupe:

  • Files like predicates.py and blocking.py are executable with some tests in the init section. This is now partially broken with the new package directory structure
  • We have two test files in dedupe/test that run unit tests on affinegap and clustering. These can be executed under the dedupe.test namespace.

We should think of a way to do tests in a consistent way. Here are some things to read up on:

Normalization of Affine Gap Edit Distance

There are a number of approaches to 'normalizing an edit distance'

  1. Amortized edit distance, i.e. the minimum ratio of the sum of the cost of edit operation divided by edit sequence length
  2. Division by the sum of the length of the strings
  3. Division by the maximum length of two strings
  4. Li Yujian and Li Bo's procedure: http://ieeexplore.ieee.org.proxy.uchicago.edu/xpl/articleDetails.jsp?arnumber=4160958

I haven't tried 1, as it is a more complicated algorithm. There does not seem to be any appreciable difference between 2,3, and 4 on performance. 4 has an advantage over 2 and 3 that if the unnormalized edit distance is a metric so is the normalized edit distance.

Better clustering

We have implemented Chaduri and Hierarchical clustering algorithms, both with not the best results. Need to do more research.

Improving pairwise scoring (#3) will help regardless of which clustering approach we take.

Remove SciPy dependency for fastcluster

option 1: ask Daniel Mullner (fastcluster author) to make SciPy import on line 27 of fastcluster.py optional

option 2: fork fastcluster, remove line, add to dedupe package (GPL license restrictions)

option 3: alternative library to fastcluster

Pre-processing the data

Bilenko pre-processes the data by lowering the case and removing non-alphanumeric characters. Should we do this, or leave it to the user? Having everything be the same case helps our performance, but removing punctuation does not have much effect.

SemiSupervisedNonDuplicates should provide fewer examples and find them more efficiently

Our algorithm to learn blocking can only handle a modest number of examples, and we currently have code to reduce the number of examples the code looks if it is passed too many.

At the same time we sometimes make a pretty expensive call to recordDistances in semiSupervisedLearning to find likely distinct pairs to feed to our blocking training.

This is not a critical performance issue, because in typical case, when we use active learning, we don't make that call in semiSupervisedLearning.

However, it slows down the dev cycle, as about 20 of the 30 seconds it takes to run canonical_test.py is taken up by this function call. It also seems pretty smelly . 0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.