Code Monkey home page Code Monkey logo

deduplipy's People

Contributors

fritshermans avatar koaning avatar sugatoray avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

deduplipy's Issues

Fitting and Null-Values (NaN)

Having a Null values in a column which is part of the training df, leads to an error in the sklean package in feature_extraction\text.py.
I could of course set Null values to a default string ("None" for instance) however, would this not have an unwanted impact on the training itself?

Is it possible to exclude certain values in a series of a dataframe, to avoid blocking on them, if we would use a default string value instead of nan?
Screen

load_data() fails

I just tried installing the library and running the tutorial, per the docs.

import pandas as pd
from deduplipy.datasets import load_data

df = load_data()

This gave the following error:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_3092214/1389887512.py in <module>
----> 1 df = load_data()

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/deduplipy/datasets.py in load_data(kind)
     36         return load_stoxx50()
     37     elif kind == 'voters':
---> 38         return load_voters()

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/deduplipy/datasets.py in load_voters()
     14 def load_voters() -> pd.DataFrame:
     15     file_path = resource_filename('deduplipy', os.path.join('data', 'voter_names.csv'))
---> 16     df = pd.read_csv(file_path)
     17     print("Column names: 'name', 'suburb', 'postcode'")
     18     return df

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    584     kwds.update(kwds_defaults)
    585 
--> 586     return _read(filepath_or_buffer, kwds)
    587 
    588 

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
    480 
    481     # Create the parser.
--> 482     parser = TextFileReader(filepath_or_buffer, **kwds)
    483 
    484     if chunksize or iterator:

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
    809             self.options["has_index_names"] = kwds["has_index_names"]
    810 
--> 811         self._engine = self._make_engine(self.engine)
    812 
    813     def close(self):

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/readers.py in _make_engine(self, engine)
   1038             )
   1039         # error: Too many arguments for "ParserBase"
-> 1040         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1041 
   1042     def _failover_to_python(self):

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/c_parser_wrapper.py in __init__(self, src, **kwds)
     49 
     50         # open handles
---> 51         self._open_handles(src, kwds)
     52         assert self.handles is not None
     53 

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/base_parser.py in _open_handles(self, src, kwds)
    227             memory_map=kwds.get("memory_map", False),
    228             storage_options=kwds.get("storage_options", None),
--> 229             errors=kwds.get("encoding_errors", "strict"),
    230         )
    231 

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    704                 encoding=ioargs.encoding,
    705                 errors=errors,
--> 706                 newline="",
    707             )
    708         else:

FileNotFoundError: [Errno 2] No such file or directory: '/home/vincent/Development/calm-notebooks/venv/lib/python3.7/site-packages/deduplipy/data/voter_names.csv'

Here's my watermark info:

Python implementation: CPython
Python version       : 3.7.9
IPython version      : 7.27.0

numpy       : 1.20.3
pandas      : 1.3.2
scikit-learn: 0.24.2
deduplipy   : 0.5

Compiler    : GCC 9.3.0
OS          : Linux
Release     : 5.11.0-7614-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 12
Architecture: 64bit

Add topics to repo for better GitHub discovery

Hi! I was trying to do a survey of available record linkage libraries available and I missed this one when I was browsing GiHub topics such as "record linkage" and "entity resolution". I only found it via Google searches and your pydata talk. You should add some of those tags! Great work here, thank you.

Active learning on Databricks

Dear @fritshermans!

Thank you for your excellent library, is there any way how to run "the interactive active learning mode" on Databricks?
As far as I know there is no possibility to read shell input in Databricks in the notebook, so I am unable to confirm or reject matches during the active learning.

Thank you for any suggestion how to proceed here.

Cheers,
Andrej

Handle abbreviations

Maybe it would be a nice enhancement to implement a logic for handling abbreviation.
E.g. VW = Volkswagen etc...

Error in MinHashSampler ("_create_minhash_pairs") when length of words within string compared are all equal to 1

Hi and thank you for creating the package!
I am exploring its applicability on a data set and I run into an error.
The data that I am using come from an ERP, so the user can insert whatever he wants sometimes erroneous data.
I identified that when all words lengths within the string to be compared are equal to 1 I get an error on the MinHashSampler when trying to fit.
I reproduced the error on an artificial dataset please see below:


def first_two_characters(x):
return x[:2]

df = pd.DataFrame(
data = [['george d'],['andy t'],['greg b'],['ret'],['pam'],['kos'],['andy'],
['pamela'],['pamla'],['kis'],['paul'],['paul d'],
['geirge d'],['ndy t'],['greg'],['retos'],['pipo'],['konstas'],['grig'],['gre']
,['k i']
],
columns = ['name'])
field_info = {'name':[ratio]}
myDedupliPy = Deduplicator(
field_info=field_info,
interaction=False,
rules={'name': [first_two_characters]},
verbose=1)
myDedupliPy.fit(X = df[['name']],n_samples = 100)


If I remove the ['k i'] row it runs without errors.
The error occurs when the MinHashSampler is called but I am not sure exactly what the function does and how to correct that.

I suppose that I could perform a check, for example counting the length of words of each row and omitting them, before calling the function but I wanted to check with you if you have any suggestion and recipes?

Thank you very much in advance,


ValueError Traceback (most recent call last)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/deduplipy/deduplicator/deduplicator.py:136, in Deduplicator.fit(self, X, n_samples)
]()124 def fit(self, X: pd.DataFrame, n_samples: int = 10_000) -> 'Deduplicator':
125 """
126 Fit the deduplicator instance
127
(...)
134
135 """
-->
136 pairs_table = self._create_pairs_table(X, n_samples)
137 similarities = self._calculate_string_similarities(pairs_table)
138[ self.myActiveLearner.fit(similarities)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/deduplipy/deduplicator/deduplicator.py:105, in Deduplicator._create_pairs_table(self, X, n_samples)
]()93 """
94 Create sample of pairs
95
(...)
102
103 """
104 n_samples_minhash = n_samples // 2
-->
105 minhash_pairs = MinHashSampler(self.col_names).sample(X, n_samples_minhash)
106 # the number of minhash samples can be (much) smaller than n_samples//2, in such case take more random pairs:
107[ n_samples_naive = n_samples - len(minhash_pairs)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/deduplipy/sampling/minhash_sampling.py:128, in MinHashSampler.sample(self, X, n_samples, threshold)
]()114 def sample(self, X: pd.DataFrame, n_samples: int, threshold: float = 0.2) -> pd.DataFrame:
115 """
116 Method to draw sample of pairs of size n_samples from dataframe X. Note that n_samples cannot be returned if
117 the number of pairs above the threshold is too low.
(...)
126
127 """
-->
128 minhash_pairs = self._create_minhash_pairs(X, threshold)
130 stratified_sample = self._get_stratified_sample(minhash_pairs, n_samples)
132[ non_stratified_sample = self._get_non_stratified_sample(minhash_pairs, stratified_sample, n_samples)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/deduplipy/sampling/minhash_sampling.py:49, in MinHashSampler._create_minhash_pairs(self, X, threshold)
]()47 minhash_pairs = pd.DataFrame()
48 for col in self.col_names:
--->
49 minhash_result = self.MinHasher.fit_predict(df, col)
51 # add other columns than the one used for minhashing
52 minhash_result = (minhash_result
53 .merge(df.drop(columns=[col]), left_on='row_number_1', right_on='row_number')
54[ .drop(columns=['row_number']))

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pyminhash/pyminhash.py:154, in MinHash.fit_predict(self, df, col_name)
]()152 df_['row_number'] = np.arange(len(df_))
153 df_ = self.sparse_vectorize(df, col_name)
-->
154 df_ = self.create_minhash_signatures(df)
155[ return self.create_pairs(df, col_name)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pyminhash/pyminhash.py:88, in MinHash._create_minhash_signatures(self, df)
]()76 def _create_minhash_signatures(self, df: pd.DataFrame) -> pd.DataFrame:
77 """
78 Apply minhashing to the column sparse_vector in Pandas dataframe df in the new column minhash_signature.
79 In addition, one column (e.g.: 'hash_{0}') per hash table is created.
(...)
86
87 """
--->
88 df['minhash_signature'] = df['sparse_vector'].apply(self._create_minhash)
89 # the following involved way of creating 'hash_' columns prevents efficiency warnings
90[ hash_df = df['minhash_signature'].apply(pd.Series)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pandas/core/series.py:4433, in Series.apply(self, func, convert_dtype, args, **kwargs)
]()4323 def apply(
4324 self,
4325 func: AggFuncType,
(...)
4328 **kwargs,
4329 ) -> DataFrame | Series:
4330 """
4331 Invoke function on values of Series.
4332
(...)
4431 dtype: float64
4432 """
->
4433[ return SeriesApply(self, func, convert_dtype, args, kwargs).apply()

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pandas/core/apply.py:1082, in SeriesApply.apply(self)
]()1078 if isinstance(self.f, str):
1079 # if we are a string, try to dispatch
1080 return self.apply_str()
->
1082[ return self.apply_standard()

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pandas/core/apply.py:1137, in SeriesApply.apply_standard(self)
]()1131 values = obj.astype(object)._values
1132 # error: Argument 2 to "map_infer" has incompatible type
1133 # "UnionCallable[..., Any], str, List[Union[Callable[..., Any], str]],
1134[ # Dict[Hashable, UnionUnion[Callable[..., Any], str],
1135[ # List[Union[Callable[..., Any], str]]]]]"; expected
1136 # "Callable[[Any], Any]"
->
1137 mapped = lib.map_infer(
1138 values,
1139 f, # type: ignore[arg-type]
1140 convert=self.convert_dtype,
1141 )
1143 if len(mapped) and isinstance(mapped[0], ABCSeries):
1144 # GH#43986 Need to do list(mapped) in order to get treated as nested
1145 # See also GH#25959 regarding EA support
1146[ return obj._constructor_expanddim(list(mapped), index=obj.index)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2870, in pandas._libs.lib.map_infer()

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pyminhash/pyminhash.py:73, in MinHash._create_minhash(self, doc)
]()71 hashes += self.b
72 hashes %= self.next_prime
--->
73 minhashes = hashes.min(axis=0)
74[ return minhashes

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/numpy/core/_methods.py:44, in _amin(a, axis, out, keepdims, initial, where)
]()42 def _amin(a, axis=None, out=None, keepdims=False,
43 initial=_NoValue, where=True):
--->
44[ return umr_minimum(a, axis, None, out, keepdims, initial, where)

ValueError: zero-size array to reduction operation minimum which has no identity]()

Reshape if sinlge value when havinf several values

Hi sometimes when trying to predict some values my it gaves me the next error:

Expected 2D array, got 1D array instead:
array=[].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

It's says my array is 1D when I read a .csv with pandas that contains more than 5000 values.

I don't know what can be wrong, my data are names of authors, and there's nothing strange, I have fit with more than 60000 and then try to predict and works with some values and don't with other, may be some kind of bug?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.