Code Monkey home page Code Monkey logo

pandas-linker's Introduction

pandas-linker

pandas-linker runs comparison windows over different sortings of a pandas DataFrame and links the rows via assigned UUIDs. This library does not actually do any duplicate detection. Instead it provides a harness to run your own comparison functions on your data.

This library is meant for datasets of a size where comparing every row with every other is undesirable. Instead you can decide on a sorting order of the DataFrame and only compare every row with every other inside a sliding window.

Install

pip install pandas-linker

Example

Let's say you have a DataFrame like this:

[ix] name country
0 Pete Spain
1 Mary USA
2 Bart US
3 Mary US

and you want to detect similar rows and mark them as such. Here's how to do that:

from pandas_linker import get_linker


def compare_rows(a, b):
    ''' Example function that decides if two rows represent same entity.'''
    return a['name'] in b['name'] or b['name'] in a['name']

# df is a pandas.DataFrame with a unique index

with get_linker(df, field='uid') as linker:

    print('Comparing in 10 row window sorted by name')
    linker(sort_cols=['name'], window_size=10, cmp=compare_rows)

    print('Comparing in 15 row window sorted by country')
    linker(sort_cols=['country'], window_size=15, cmp=compare_rows)

After running the linker the process is complete

[ix] name country uid
0 Pete Spain 7509781940fc471cad5dc32944652d70
1 Mary USA 8f8dccd91568472daf740e9160349d6c
2 Bart US 12b55fbe80f64d378193acd727b0e051
3 Mary US 8f8dccd91568472daf740e9160349d6c

Note that both "Mary" rows in the DataFrame have been identified as representing the same entity and were assigned the same UUID.

pandas-linker's People

Contributors

stefanw avatar

Stargazers

Jan Kyri avatar Markus Zapke-Gründemann avatar

Watchers

Jannis R avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.