Code Monkey home page Code Monkey logo

ssdedupe's Introduction

ssdedupe

https://img.shields.io/travis/melvin15may/ssdedupe.svg?branch=master Documentation Status Updates

This is a fork from dssg/pgdedupe. This will now be a separate repo for MS SQL Server implementation. (See PR#40)

This packages is for working with Microsoft SQL Server. Please use pgdedupe for working with PostgreSQL

A work-in-progress to provide a standard interface for deduplication of large databases with custom pre-processing and post-processing steps.

Interface

This provides a simple command-line program, ssdedupe. Two configuration files specify the deduplication parameters and database connection settings. To run deduplication on a generated dataset, create a database.yml file that specifies the following parameters:

user:
password:
database:
host:
port:

You can now create a sample CSV file with:

$ python generate_fake_dataset.py --csv people.csv
creating people: 100%|█████████████████████| 9500/9500 [00:21<00:00, 445.38it/s]
adding twins: 100%|█████████████████████████| 500/500 [00:00<00:00, 1854.72it/s]
writing csv:  47%|███████████▋             | 4666/10000 [00:42<00:55, 96.28it/s]

Once complete, store this example dataset in a database with:

$ python test/initialize_db.py --db database.yml --csv people.csv
CREATE SCHEMA
DROP TABLE
CREATE TABLE
COPY 197617
ALTER TABLE
ALTER TABLE
UPDATE 197617

Now you can deduplicate this dataset. This will run dedupe as well as the custom pre-processing and post-processing steps as defined in config.yml:

$ ssdedupe --config config.yml --db database.yml

Custom pre- and post-processing

In addition to running a database-level deduplication with dedupe, this script adds custom pre- and post-processing steps to improve the run-time and results, making this a hybrid between fuzzy matching and record linkage.

  • Pre-processing: Before running dedupe, this script does an exact-match deduplication. Some systems create many identical rows; this can make it challenging for dedupe to create an effective blocking strategy and generally makes the fuzzy matching much harder and time intensive.
  • Post-processing: After running dedupe, this script does an optional exact-match merge across subsets of columns. For example, in some instances an exact match of just the last name and social security number are sufficient evidence that two clusters are indeed the same identity.

Further steps

This script was based upon and extended from the example in dedupe-examples. It would be nice to use this common interface across all database types, and potentially even allow reading from flat CSV files.

ssdedupe's People

Contributors

melvin15may avatar mbauman avatar pyup-bot avatar potash avatar melvinmathew avatar

Watchers

James Cloos avatar  avatar

ssdedupe's Issues

New documentation

New documentation for commands specific to sql server. A lot can be taken from pgdedupe doc.

Read configuration from a table

It will be awesome if we can read configuration from a table (table name can be a parameter). A use case would be when ssdedupe is run on a server. Its always easier to have a SQL command modify a table than updating a file in server.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.