Pandas similarity is a small library that count and return a similarity index between the entries of a dataframe.
This index can be used to remove very similar observations in a Dataframe. This is useful when you have Dataset coming from different sources about the same subject.
Install and update using pip:
pip install pandas-similarity
IndexCalculator return a normalized index between 0 and 1.
import pandas as pd
from pandas_similarity import IndexCalculator
# Create a Pandas Dataframe.
df = pd.DataFrame([
[1050, "Ixelles", 0, 360000],
[1050, "Ixelles", 1, 340000]],
columns=['postal_code', 'city', 'property_type', 'price'])
# Print the index of similarity between all entries of the dataframe.
print(IndexCalculator(df).get_index())
# Index: 0.6667
In this example, we have a Dataframe with two observations, where price and property_type are different. The similarity index returned by IndexCalculator is 0.667.
Author: Joffrey Bienvenu, Junior developper & ML student @ Becode.