fritshermans / deduplipy Goto Github PK

View Code? Open in Web Editor NEW

76.0 5.0 9.0 532 KB

Python package for deduplication/entity resolution using active learning

Home Page: https://www.deduplipy.com

License: MIT License

Python 74.61% Jupyter Notebook 25.27% Makefile 0.11%

deduplication entity-resolution fuzzy-matching record-linkage

deduplipy's People

Contributors

Stargazers

Watchers

Forkers

koaning trendingtechnology sbrugman suryaavala sugatoray bellyfat dekelcohen hokkiefrank msonneveld

deduplipy's Issues

Fitting and Null-Values (NaN)

Having a Null values in a column which is part of the training df, leads to an error in the sklean package in feature_extraction\text.py.
I could of course set Null values to a default string ("None" for instance) however, would this not have an unwanted impact on the training itself?

Is it possible to exclude certain values in a series of a dataframe, to avoid blocking on them, if we would use a default string value instead of nan?

load_data() fails

I just tried installing the library and running the tutorial, per the docs.

import pandas as pd
from deduplipy.datasets import load_data

df = load_data()

This gave the following error:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_3092214/1389887512.py in <module>
----> 1 df = load_data()

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/deduplipy/datasets.py in load_data(kind)
     36         return load_stoxx50()
     37     elif kind == 'voters':
---> 38         return load_voters()

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/deduplipy/datasets.py in load_voters()
     14 def load_voters() -> pd.DataFrame:
     15     file_path = resource_filename('deduplipy', os.path.join('data', 'voter_names.csv'))
---> 16     df = pd.read_csv(file_path)
     17     print("Column names: 'name', 'suburb', 'postcode'")
     18     return df

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    584     kwds.update(kwds_defaults)
    585 
--> 586     return _read(filepath_or_buffer, kwds)
    587 
    588 

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
    480 
    481     # Create the parser.
--> 482     parser = TextFileReader(filepath_or_buffer, **kwds)
    483 
    484     if chunksize or iterator:

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
    809             self.options["has_index_names"] = kwds["has_index_names"]
    810 
--> 811         self._engine = self._make_engine(self.engine)
    812 
    813     def close(self):

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/readers.py in _make_engine(self, engine)
   1038             )
   1039         # error: Too many arguments for "ParserBase"
-> 1040         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1041 
   1042     def _failover_to_python(self):

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/c_parser_wrapper.py in __init__(self, src, **kwds)
     49 
     50         # open handles
---> 51         self._open_handles(src, kwds)
     52         assert self.handles is not None
     53 

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/base_parser.py in _open_handles(self, src, kwds)
    227             memory_map=kwds.get("memory_map", False),
    228             storage_options=kwds.get("storage_options", None),
--> 229             errors=kwds.get("encoding_errors", "strict"),
    230         )
    231 

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    704                 encoding=ioargs.encoding,
    705                 errors=errors,
--> 706                 newline="",
    707             )
    708         else:

FileNotFoundError: [Errno 2] No such file or directory: '/home/vincent/Development/calm-notebooks/venv/lib/python3.7/site-packages/deduplipy/data/voter_names.csv'

Here's my watermark info:

Python implementation: CPython
Python version       : 3.7.9
IPython version      : 7.27.0

numpy       : 1.20.3
pandas      : 1.3.2
scikit-learn: 0.24.2
deduplipy   : 0.5

Compiler    : GCC 9.3.0
OS          : Linux
Release     : 5.11.0-7614-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 12
Architecture: 64bit

Importing deduplipy fails due to modAL dependency (google colab)

It works when I run pip install modAL-python before installing deduplipy

How is it different from https://github.com/dedupeio/dedupe

I currently use https://github.com/dedupeio/dedupe and was curious if you there are any specific pain points which this library solves over it? From my cursory look, is one of the major differences the use of modAL library rather than custom built active learning solution in dedupe?

(Apologies if I'm abusing the power of opening an issue to ask this question)

Add topics to repo for better GitHub discovery

Hi! I was trying to do a survey of available record linkage libraries available and I missed this one when I was browsing GiHub topics such as "record linkage" and "entity resolution". I only found it via Google searches and your pydata talk. You should add some of those tags! Great work here, thank you.

Active learning on Databricks

Dear @fritshermans!

Thank you for your excellent library, is there any way how to run "the interactive active learning mode" on Databricks?
As far as I know there is no possibility to read shell input in Databricks in the notebook, so I am unable to confirm or reject matches during the active learning.

Thank you for any suggestion how to proceed here.

Cheers,
Andrej

Add a `conda` install option for `deduplipy`

It will be nice to have a conda install option. I have started the work on adding this library to conda-forge channel already (PR: conda-forge/staged-recipes#17495). Once the PR is merged, you will be able to install deduplipy with:

conda install -c conda-forge deduplipy

Handle abbreviations

Maybe it would be a nice enhancement to implement a logic for handling abbreviation.
E.g. VW = Volkswagen etc...

Error in MinHashSampler ("_create_minhash_pairs") when length of words within string compared are all equal to 1

Hi and thank you for creating the package!
I am exploring its applicability on a data set and I run into an error.
The data that I am using come from an ERP, so the user can insert whatever he wants sometimes erroneous data.
I identified that when all words lengths within the string to be compared are equal to 1 I get an error on the MinHashSampler when trying to fit.
I reproduced the error on an artificial dataset please see below:

def first_two_characters(x):
return x[:2]

df = pd.DataFrame(
data = [['george d'],['andy t'],['greg b'],['ret'],['pam'],['kos'],['andy'],
['pamela'],['pamla'],['kis'],['paul'],['paul d'],
['geirge d'],['ndy t'],['greg'],['retos'],['pipo'],['konstas'],['grig'],['gre']
,['k i']
],
columns = ['name'])
field_info = {'name':[ratio]}
myDedupliPy = Deduplicator(
field_info=field_info,
interaction=False,
rules={'name': [first_two_characters]},
verbose=1)
myDedupliPy.fit(X = df[['name']],n_samples = 100)

If I remove the ['k i'] row it runs without errors.
The error occurs when the MinHashSampler is called but I am not sure exactly what the function does and how to correct that.

I suppose that I could perform a check, for example counting the length of words of each row and omitting them, before calling the function but I wanted to check with you if you have any suggestion and recipes?

Thank you very much in advance,

ValueError Traceback (most recent call last)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/deduplipy/deduplicator/deduplicator.py:136, in Deduplicator.fit(self, X, n_samples)
]()124 def fit(self, X: pd.DataFrame, n_samples: int = 10_000) -> 'Deduplicator':
125 """
126 Fit the deduplicator instance
127
(...)
134
135 """
--> 136 pairs_table = self._create_pairs_table(X, n_samples)
137 similarities = self._calculate_string_similarities(pairs_table)
138[ self.myActiveLearner.fit(similarities)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/deduplipy/deduplicator/deduplicator.py:105, in Deduplicator._create_pairs_table(self, X, n_samples)
]()93 """
94 Create sample of pairs
95
(...)
102
103 """
104 n_samples_minhash = n_samples // 2
--> 105 minhash_pairs = MinHashSampler(self.col_names).sample(X, n_samples_minhash)
106 # the number of minhash samples can be (much) smaller than n_samples//2, in such case take more random pairs:
107[ n_samples_naive = n_samples - len(minhash_pairs)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/deduplipy/sampling/minhash_sampling.py:128, in MinHashSampler.sample(self, X, n_samples, threshold)
]()114 def sample(self, X: pd.DataFrame, n_samples: int, threshold: float = 0.2) -> pd.DataFrame:
115 """
116 Method to draw sample of pairs of size n_samples from dataframe X. Note that n_samples cannot be returned if
117 the number of pairs above the threshold is too low.
(...)
126
127 """
--> 128 minhash_pairs = self._create_minhash_pairs(X, threshold)
130 stratified_sample = self._get_stratified_sample(minhash_pairs, n_samples)
132[ non_stratified_sample = self._get_non_stratified_sample(minhash_pairs, stratified_sample, n_samples)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/deduplipy/sampling/minhash_sampling.py:49, in MinHashSampler._create_minhash_pairs(self, X, threshold)
]()47 minhash_pairs = pd.DataFrame()
48 for col in self.col_names:
---> 49 minhash_result = self.MinHasher.fit_predict(df, col)
51 # add other columns than the one used for minhashing
52 minhash_result = (minhash_result
53 .merge(df.drop(columns=[col]), left_on='row_number_1', right_on='row_number')
54[ .drop(columns=['row_number']))

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pyminhash/pyminhash.py:154, in MinHash.fit_predict(self, df, col_name)
]()152 df_['row_number'] = np.arange(len(df_))
153 df_ = self.sparse_vectorize(df, col_name)
--> 154 df_ = self.create_minhash_signatures(df)
155[ return self.create_pairs(df, col_name)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pyminhash/pyminhash.py:88, in MinHash._create_minhash_signatures(self, df)
]()76 def _create_minhash_signatures(self, df: pd.DataFrame) -> pd.DataFrame:
77 """
78 Apply minhashing to the column sparse_vector in Pandas dataframe df in the new column minhash_signature.
79 In addition, one column (e.g.: 'hash_{0}') per hash table is created.
(...)
86
87 """
---> 88 df['minhash_signature'] = df['sparse_vector'].apply(self._create_minhash)
89 # the following involved way of creating 'hash_' columns prevents efficiency warnings
90[ hash_df = df['minhash_signature'].apply(pd.Series)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pandas/core/series.py:4433, in Series.apply(self, func, convert_dtype, args, **kwargs)
]()4323 def apply(
4324 self,
4325 func: AggFuncType,
(...)
4328 **kwargs,
4329 ) -> DataFrame | Series:
4330 """
4331 Invoke function on values of Series.
4332
(...)
4431 dtype: float64
4432 """
-> 4433[ return SeriesApply(self, func, convert_dtype, args, kwargs).apply()

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pandas/core/apply.py:1082, in SeriesApply.apply(self)
]()1078 if isinstance(self.f, str):
1079 # if we are a string, try to dispatch
1080 return self.apply_str()
-> 1082[ return self.apply_standard()

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pandas/core/apply.py:1137, in SeriesApply.apply_standard(self)
]()1131 values = obj.astype(object)._values
1132 # error: Argument 2 to "map_infer" has incompatible type
1133 # "Union Callable[..., Any], str, List[Union[Callable[..., Any], str]],
1134[ # Dict[Hashable, UnionUnion[Callable[..., Any], str],
1135[ # List[Union[Callable[..., Any], str]]]]]"; expected
1136 # "Callable[[Any], Any]"
-> 1137 mapped = lib.map_infer(
1138 values,
1139 f, # type: ignore[arg-type]
1140 convert=self.convert_dtype,
1141 )
1143 if len(mapped) and isinstance(mapped[0], ABCSeries):
1144 # GH#43986 Need to do list(mapped) in order to get treated as nested
1145 # See also GH#25959 regarding EA support
1146[ return obj._constructor_expanddim(list(mapped), index=obj.index)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2870, in pandas._libs.lib.map_infer()

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pyminhash/pyminhash.py:73, in MinHash._create_minhash(self, doc)
]()71 hashes += self.b
72 hashes %= self.next_prime
---> 73 minhashes = hashes.min(axis=0)
74[ return minhashes

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/numpy/core/_methods.py:44, in _amin(a, axis, out, keepdims, initial, where)
]()42 def _amin(a, axis=None, out=None, keepdims=False,
43 initial=_NoValue, where=True):
---> 44[ return umr_minimum(a, axis, None, out, keepdims, initial, where)

ValueError: zero-size array to reduction operation minimum which has no identity]()

Reshape if sinlge value when havinf several values

Hi sometimes when trying to predict some values my it gaves me the next error:

Expected 2D array, got 1D array instead:
array=[].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

It's says my array is 1D when I read a .csv with pandas that contains more than 5000 values.

I don't know what can be wrong, my data are names of authors, and there's nothing strange, I have fit with more than 60000 and then try to predict and works with some values and don't with other, may be some kind of bug?

fritshermans / deduplipy Goto Github PK

deduplipy's People

Contributors

Stargazers

Watchers

Forkers

deduplipy's Issues

Fitting and Null-Values (NaN)

load_data() fails

Importing deduplipy fails due to modAL dependency (google colab)

How is it different from https://github.com/dedupeio/dedupe

Add topics to repo for better GitHub discovery

Active learning on Databricks

Add a `conda` install option for `deduplipy`

Handle abbreviations

Error in MinHashSampler ("_create_minhash_pairs") when length of words within string compared are all equal to 1

Reshape if sinlge value when havinf several values

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent