Code Monkey home page Code Monkey logo

category_encoders's Introduction

Categorical Encoding Methods

Downloads Downloads Test Suite and Linting DOI

A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques.

Important Links

Documentation: http://contrib.scikit-learn.org/category_encoders/

Encoding Methods

Unsupervised:

  • Backward Difference Contrast [2][3]
  • BaseN [6]
  • Binary [5]
  • Gray [14]
  • Count [10]
  • Hashing [1]
  • Helmert Contrast [2][3]
  • Ordinal [2][3]
  • One-Hot [2][3]
  • Rank Hot [15]
  • Polynomial Contrast [2][3]
  • Sum Contrast [2][3]

Supervised:

  • CatBoost [11]
  • Generalized Linear Mixed Model [12]
  • James-Stein Estimator [9]
  • LeaveOneOut [4]
  • M-estimator [7]
  • Target Encoding [7]
  • Weight of Evidence [8]
  • Quantile Encoder [13]
  • Summary Encoder [13]

Installation

The package requires: numpy, statsmodels, and scipy.

To install the package, execute:

$ python setup.py install

or

pip install category_encoders

or

conda install -c conda-forge category_encoders

To install the development version, you may use:

pip install --upgrade git+https://github.com/scikit-learn-contrib/category_encoders

Usage

All of the encoders are fully compatible sklearn transformers, so they can be used in pipelines or in your existing scripts. Supported input formats include numpy arrays and pandas dataframes. If the cols parameter isn't passed, all columns with object or pandas categorical data type will be encoded. Please see the docs for transformer-specific configuration options.

Examples

There are two types of encoders: unsupervised and supervised. An unsupervised example:

from category_encoders import *
import pandas as pd
from sklearn.datasets import load_boston

# prepare some data
bunch = load_boston()
y = bunch.target
X = pd.DataFrame(bunch.data, columns=bunch.feature_names)

# use binary encoding to encode two categorical features
enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X)

# transform the dataset
numeric_dataset = enc.transform(X)

And a supervised example:

from category_encoders import *
import pandas as pd
from sklearn.datasets import load_boston

# prepare some data
bunch = load_boston()
y_train = bunch.target[0:250]
y_test = bunch.target[250:506]
X_train = pd.DataFrame(bunch.data[0:250], columns=bunch.feature_names)
X_test = pd.DataFrame(bunch.data[250:506], columns=bunch.feature_names)

# use target encoding to encode two categorical features
enc = TargetEncoder(cols=['CHAS', 'RAD'])

# transform the datasets
training_numeric_dataset = enc.fit_transform(X_train, y_train)
testing_numeric_dataset = enc.transform(X_test)

For the transformation of the training data with the supervised methods, you should use fit_transform() method instead of fit().transform(), because these two methods do not have to generate the same result. The difference can be observed with LeaveOneOut encoder, which performs a nested cross-validation for the training data in fit_transform() method (to decrease over-fitting of the downstream model) but uses all the training data for scoring with transform() method (to get as accurate estimates as possible).

Furthermore, you may benefit from following wrappers:

  • PolynomialWrapper, which extends supervised encoders to support polynomial targets
  • NestedCVWrapper, which helps to prevent overfitting

Additional examples and benchmarks can be found in the examples directory.

Contributing

Category encoders is under active development, if you'd like to be involved, we'd love to have you. Check out the CONTRIBUTING.md file or open an issue on the github project to get started.

References

  1. Kilian Weinberger; Anirban Dasgupta; John Langford; Alex Smola; Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. Proc. ICML.
  2. Contrast Coding Systems for categorical variables. UCLA: Statistical Consulting Group. From https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/.
  3. Gregory Carey (2003). Coding Categorical Variables. From http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf
  4. Owen Zhang - Leave One Out Encoding. From https://datascience.stackexchange.com/questions/10839/what-is-difference-between-one-hot-encoding-and-leave-one-out-encoding
  5. Beyond One-Hot: an exploration of categorical variables. From http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/
  6. BaseN Encoding and Grid Search in categorical variables. From http://www.willmcginnis.com/2016/12/18/basen-encoding-grid-search-category_encoders/
  7. Daniele Miccii-Barreca (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1. From http://dx.doi.org/10.1145/507533.507538
  8. Weight of Evidence (WOE) and Information Value Explained. From https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
  9. Empirical Bayes for multiple sample sizes. From http://chris-said.io/2017/05/03/empirical-bayes-for-multiple-sample-sizes/
  10. Simple Count or Frequency Encoding. From https://www.datacamp.com/community/tutorials/encoding-methodologies
  11. Transforming categorical features to numerical features. From https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/
  12. Andrew Gelman and Jennifer Hill (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. From https://faculty.psau.edu.sa/filedownload/doc-12-pdf-a1997d0d31f84d13c1cdc44ac39a8f2c-original.pdf
  13. Carlos Mougan, David Masip, Jordi Nin and Oriol Pujol (2021). Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems. Modeling Decisions for Artificial Intelligence, 2021. Springer International Publishing https://link.springer.com/chapter/10.1007%2F978-3-030-85529-1_14
  14. Gray Encoding. From https://en.wikipedia.org/wiki/Gray_code
  15. Jacob Buckman, Aurko Roy, Colin Raffel, Ian Goodfellow: Thermometer Encoding: One Hot Way To Resist Adversarial Examples. From https://openreview.net/forum?id=S18Su--CW
  16. Fairness implications of encoding protected categorical attributes. Carlos Mougan, Jose Alvarez, Salvatore Ruggieri, and Steffen Staab. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, AIES โ€™21, https://arxiv.org/abs/2201.11358

category_encoders's People

Contributors

8bit-pixies avatar anjum48 avatar bkhant1 avatar bmreiniger avatar bollwyvl avatar camerondavison avatar cmougan avatar david26694 avatar glevv avatar hbghhy avatar jaimearboleda avatar janmotl avatar joeybricks avatar johnnyc08 avatar jona-sassenhagen avatar joshuac3 avatar mausam3407 avatar nercisla avatar paulwestenthanner avatar pgijsbers avatar pyrrhull avatar quentinbernardlyst avatar rishoban avatar s-banach avatar tvdboom avatar wakabame avatar wdm0006 avatar wenwu313 avatar woodly0 avatar yagays avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

category_encoders's Issues

Issue at install with statsmodels v 0.9.0

I encountered an issue with categorical-encoders and a new release of statsmodels. I've created a package, and when calling pip install . on the package to install it in a Docker container, I get the following:

  Downloading https://files.pythonhosted.org/packages/67/68/eb3ec6ab61f97216c257edddb853cc174cd76ea44b365cf4adaedcd44482/statsmodels-0.9.0.tar.gz (12.7MB)
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-e8na0ezb/statsmodels/setup.py", line 347, in <module>
        from numpy.distutils.misc_util import get_info
    ImportError: No module named 'numpy'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-e8na0ezb/statsmodels/
The command '/bin/sh -c pip install .' returned a non-zero code: 1

This worked prior to the release of statsmodels 0.9.0, which was released at the end of April 2018.
Specifying statsmodels==0.8.0 in the setup.py install_requires list eliminated the issue.

What's the formula for HelmertEncoder

Could someone tell me the detail processing logic for HelmertEncoder? My Code as below:
sales = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': 140},
{'account': 'Alpha Co', 'Jan': 200, 'Feb': 210, 'Mar': 215},
{'account': 'Blue Inc', 'Jan': 50, 'Feb': 90, 'Mar': 95 }]
X = pd.DataFrame(sales)
y = np.array([1, 0, 1])

enc = HelmertEncoder(cols=['account']).fit(X, y)
numeric_dataset = enc.transform(X)
print(numeric_dataset)

Actually, how does HelmertEncoder tansfer [0, 1, 2] to [[1 -1 -1],[1 1 -1],[1 0 2]]?

[Question] Can i get transformed names from encoders?

When i use encoders in FeatureUnion after transform i always get numpy.ndarray regardless of what was fed to the input, pd.DataFrame or smth else. When i try get new feature names by FeatureUnion.get_feature_names() i got AttributeError: Transformer ohe (type OneHotEncoder) does not provide get_feature_names.. So how can i know transformed feature names?

Binary Encoding problem

Hi, there, first time using this binary encoding method, but I think there is a bug:

  • When encoding a column contained 3 possible categories, it encodes as 1 columns
  • When encoding a column contained 5 possible categories, it encodes as 2 columns
  • same thing you can find in 9 possible categories, only encoded as 3 columns
  • Also when encoding a column contained only 2 possible categories, it returns empty DataFrame

2 categories:

>>> import pandas as pd
>>> import category_encoders as ce
>>> df1 = pd.DataFrame([[1],[2],[2],[1]], columns=['col_a'])
>>> df1
   col_a
0      1
1      2
2      2
3      1
>>> encoder = ce.BinaryEncoder(cols=['col_a'])
>>> encoder.transform(df1)
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]

3 categories:

>>> df2 = pd.DataFrame([[1],[2],[2],[3]], columns=['col_a'])
>>> encoder.transform(df2)
   col_a_0
0        0
1        1
2        1
3        1

Request to add support for pandas categorical types when using OrdinalEncoder

Hi All,

I ran into an issue using OrdinalEncoder where I assumed that it would use the mapping in the categorical types in the passed pandas dataframe. Instead, it created a mapping based on the order of the data passed to fit_transform. I realize that this is user error, but it would be nice for OrdinalEncoder to use the existing mapping information from the categorical dtype. If the maintainers think this would be a nice addition, I'd be happy to make a PR.

Here's a demonstration of the issue.

First I'll create a categorical dtype and a dataframe that uses it.

from category_encoders.ordinal import OrdinalEncoder
import pandas as pd
from pandas.api.types import CategoricalDtype

platforms = ['android', 'ios', 'amazon']
platform_category = CategoricalDtype(categories=platforms, ordered=False)

df = pd.DataFrame([
    {'id': 1, 'platform': 'android'},
    {'id': 2, 'platform': 'ios'},
    {'id': 3, 'platform': 'amazon'},
])
df['platform'] = df['platform'].astype(platform_category)
print(df)
   id platform
0   1  android
1   2      ios
2   3   amazon

The encoding from the categorical dtype looks like this:

[(cat, code) for code, cat in enumerate(df['platform'].cat.categories)]
[('android', 0), ('ios', 1), ('amazon', 2)]

Now I'll make an encoder without a mapping parameter, then transform with the data sorted.

cat_encoder_a = OrdinalEncoder(cols=categorical_columns)
df_a = cat_encoder_a.fit_transform(df.sort_values(by='platform', ascending=True))
print(df_a)
   id  platform
0   1         0
1   2         1
2   3         2

The category mapping from the encoder happens to match the categorical dtype mapping because of the sort order of the dataframe.

cat_encoder_a.category_mapping
[{'col': 'platform', 'mapping': [('android', 0), ('ios', 1), ('amazon', 2)]}]

But if I reverse the order of the data passed to fit_transform I will get a different mapping.

cat_encoder_b = OrdinalEncoder(cols=categorical_columns)
df_b = cat_encoder_b.fit_transform(df.sort_values(by='platform', ascending=False))
cat_encoder_b.category_mapping
[{'col': 'platform', 'mapping': [('amazon', 0), ('ios', 1), ('android', 2)]}]

I can get a stable mapping from the categorical types in the dataframe itself (instead of relying on the order of the data).

category_mapping = [
    {'col': column_name, 'mapping': [(cat, code) for code, cat in enumerate(df[column_name].cat.categories)]} 
    for column_name in df.select_dtypes(['category']).columns
]
category_mapping
[{'col': 'platform', 'mapping': [('android', 0), ('ios', 1), ('amazon', 2)]}]
cat_encoder_c = OrdinalEncoder(cols=categorical_columns, mapping=category_mapping)
df_c = cat_encoder_c.fit_transform(df).sort_values(by='platform', ascending=False)
cat_encoder_c.category_mapping
[{'col': 'platform', 'mapping': [('android', 0), ('ios', 1), ('amazon', 2)]}]

So while it is not hard to pass a custom mapping, it would be nice for OrdinalEncoder to handle this automatically if no mapping is passed and if any of the columns are pandas categoricals. I think this logic could be handled around here. https://github.com/scikit-learn-contrib/categorical-encoding/blob/1.2.6/category_encoders/ordinal.py#L270

If others think this would be a positive addition, I will make a PR.

thanks,
Dennis

EDIT: I modified the above code to fix a bug in getting the mapping from the pandas categorical dtype. The previous code I had happened to work only because the data was in order, but in general that does not work. The correct way is

[(cat, code) for code, cat in enumerate(df[column_name].cat.categories)]

Moving required number of digits calculation to fit() - BinaryEncoder and BaseNEncoder

In binary.py, the required number of digits is calculated as such, inside the binary() function:

# figure out how many digits we need to represent the classes present
  digits = int(np.ceil(np.log2(len(X[col].unique()))))

binary() is called inside the transform() function.

Is there a specific reason why the required number of digits are calculated during transform() instead of fit()?

The reason I am asking is because, when performing distributed computation, different partitions of the same array are sent to different machines and the calculated required number of digits would then be different on each machine depending on how many different classes have each partition.

Error when using pipelines with BinaryEncoder

I adapted this code from the examples/encoding_examples.py file to use a pipeline. cross_val_score fail with the error at the end of this post

import pandas as pd
import numpy as np
from sklearn import cross_validation, linear_model, model_selection
import category_encoders
from examples.source_data.loaders import get_mushroom_data, get_cars_data, get_splice_data
from sklearn.pipeline import make_pipeline

X, y, mapping = get_mushroom_data()
t = category_encoders.BinaryEncoder(handle_unknown = "ignore")
mypipeline = make_pipeline(t, linear_model.LogisticRegression())

cross_validation.cross_val_score(mypipeline, X, y, n_jobs=1, cv=5)

Abridged List of packages installed

  • numpy 1.14.0 py36h4a99626_1
  • pandas 0.22.0 py36h6538335_0
  • python 3.6.4 h6538335_1
  • scikit-learn 0.19.1 py36h53aea1b_0

ERROR


ValueError Traceback (most recent call last)
in ()
10 mypipeline = make_pipeline(t, linear_model.LogisticRegression())
11
---> 12 cross_validation.cross_val_score(mypipeline, X, y, n_jobs=1, cv=5)
13
14

C:\Anaconda3\lib\site-packages\sklearn\cross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
1579 train, test, verbose, None,
1580 fit_params)
-> 1581 for train, test in cv)
1582 return np.array(scores)[:, 0]
1583

C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in call(self, iterable)
777 # was dispatched. In particular this covers the edge
778 # case of Parallel used with an exhausted iterator.
--> 779 while self.dispatch_one_batch(iterator):
780 self._iterating = True
781 else:

C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator)
623 return False
624 else:
--> 625 self._dispatch(tasks)
626 return True
627

C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in _dispatch(self, batch)
586 dispatch_timestamp = time.time()
587 cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588 job = self._backend.apply_async(batch, callback=cb)
589 self._jobs.append(job)
590

C:\Anaconda3\lib\site-packages\sklearn\externals\joblib_parallel_backends.py in apply_async(self, func, callback)
109 def apply_async(self, func, callback=None):
110 """Schedule a func to be run"""
--> 111 result = ImmediateResult(func)
112 if callback:
113 callback(result)

C:\Anaconda3\lib\site-packages\sklearn\externals\joblib_parallel_backends.py in init(self, batch)
330 # Don't delay the application, to avoid keeping the input
331 # arguments in memory
--> 332 self.results = batch()
333
334 def get(self):

C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in call(self)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def len(self):

C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in (.0)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def len(self):

C:\Anaconda3\lib\site-packages\sklearn\cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
1692
1693 else:
-> 1694 test_score = _score(estimator, X_test, y_test, scorer)
1695 if return_train_score:
1696 train_score = _score(estimator, X_train, y_train, scorer)

C:\Anaconda3\lib\site-packages\sklearn\cross_validation.py in _score(estimator, X_test, y_test, scorer)
1749 score = scorer(estimator, X_test)
1750 else:
-> 1751 score = scorer(estimator, X_test, y_test)
1752 if hasattr(score, 'item'):
1753 try:

C:\Anaconda3\lib\site-packages\sklearn\metrics\scorer.py in _passthrough_scorer(estimator, *args, **kwargs)
242 def _passthrough_scorer(estimator, *args, **kwargs):
243 """Function that wraps estimator.score"""
--> 244 return estimator.score(*args, **kwargs)
245
246

C:\Anaconda3\lib\site-packages\sklearn\utils\metaestimators.py in (*args, **kwargs)
113
114 # lambda, but not partial, allows help() to work with update_wrapper
--> 115 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
116 # update the docstring of the returned function
117 update_wrapper(out, self.fn)

C:\Anaconda3\lib\site-packages\sklearn\pipeline.py in score(self, X, y, sample_weight)
484 for name, transform in self.steps[:-1]:
485 if transform is not None:
--> 486 Xt = transform.transform(Xt)
487 score_params = {}
488 if sample_weight is not None:

C:\Anaconda3\lib\site-packages\category_encoders\binary.py in transform(self, X)
163 X = self.ordinal_encoder.transform(X)
164
--> 165 X = self.binary(X, cols=self.cols)
166
167 if self.drop_invariant:

C:\Anaconda3\lib\site-packages\category_encoders\binary.py in binary(self, X_in, cols)
248
249 # map the ordinal column into a list of these digits, of length digits
--> 250 X[col] = X[col].map(lambda x: self.col_transform(x, digits))
251
252 for dig in range(digits):

C:\Anaconda3\lib\site-packages\pandas\core\series.py in map(self, arg, na_action)
2352 else:
2353 # arg is a function
-> 2354 new_values = map_f(values, arg)
2355
2356 return self._constructor(new_values,

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()

C:\Anaconda3\lib\site-packages\category_encoders\binary.py in (x)
248
249 # map the ordinal column into a list of these digits, of length digits
--> 250 X[col] = X[col].map(lambda x: self.col_transform(x, digits))
251
252 for dig in range(digits):

C:\Anaconda3\lib\site-packages\category_encoders\binary.py in col_transform(col, digits)
309 else:
310
--> 311 col = list("{0:b}".format(int(col)))
312 if len(col) == digits:
313 return col

ValueError: cannot convert float NaN to integer

Leave One Out Encoding Behavior

I've been trying the LeaveOneOutEncoder but from what I can tell it just takes the mean of the target given the current levels without leaving the current example out. Also it seems like setting randomized = True does not seem to add any noise. Am I missing something?

Doing the example in the docs:

from category_encoders import *
import pandas as pd
from sklearn.datasets import load_boston
bunch = load_boston()
y = bunch.target
X = pd.DataFrame(bunch.data, columns=bunch.feature_names)

enc = LeaveOneOutEncoder(cols=['CHAS', 'RAD'], randomized = True).fit(X, y)
numeric_dataset = enc.transform(X)

Just replaces the 'CHAS' and 'RAD' columns with the mean of the target grouped by the corresponding variable.

I'm using:
Python 2.7.13
category-encoders==1.2.4
pandas==0.20.3
numpy==1.13.1

re-using pickled/saved encoders

I've trained a model, for which I use the one-hot-encoder, and let's say a categorical variable, e.g. device_type, had 20 possible values, the trained model is now expecting to see, among other columns, 20 columns related with the values of device_type, i.e.:

device_type_0, device_type_1, ..., device_type_20

I also saved/pickled this trained model as well as the one-hot-encoder.

I now have new data coming, and the device_type, for this new data, only has 10 possible values. I was assuming that by applying the pickled/saved one-hot-encoder which I used before, I would see the same 20 columns related with the values of device_type, i.e.:

device_type_0, device_type_1, ..., device_type_20

But I only see 10 device_type_X columns, i.e., only for the values presente in the new data.

Is there something I'm missing when I instantiate the encoder, or a parameter I'm missing when applying the fit, or simply this functionality is not implemented ?

Transformers always return pandas dataframes

All of the transformers in this library use pandas dataframes internally, but will accept either numpy arrays or pandas dataframes as inputs. They all return dataframes though. For use in pipelines it may be helpful to optionally return a numpy array (df.values) instead of a dataframe.

If anyone would like to get into contributing, this would be a good starter issue, comment here if you're interested.

Test cases failing

Hi Will,

Just wanted to bring your attention to test cases failing for the encoders utilizing patsy dmatrix. This was basically due to the fact that the test data frame had a column 'C' which is also the patsy way of denoting that you want a categorical variable. By renaming the column to something like C1 all the test passed.

Kind Regards,
Rakesh

========================================================================================= test session starts =========================================================================================
platform darwin -- Python 2.7.10, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
rootdir: categorical_encoding, ini file:
collected 7 items 

tests/test_encoders.py F..F.FF

============================================================================================== FAILURES ===============================================================================================
________________________________________________________________________________ TestEncoders.test_backward_difference ________________________________________________________________________________

self = <categorical_encoding.tests.test_encoders.TestEncoders testMethod=test_backward_difference>

    def test_backward_difference(self):
        """

            :return:
            """

        cols = ['C', 'D', 'E', 'F']
        enc = encoders.BackwardDifferenceEncoder(verbose=1, cols=cols)
        X = self.create_dataset(n_rows=1000)

>       X_test = enc.fit_transform(X, None)

tests/test_encoders.py:77: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Library/Python/2.7/site-packages/sklearn/base.py:455: in fit_transform
    return self.fit(X, **fit_params).transform(X)
category_encoders/backward_difference.py:83: in transform
    return backward_difference_coding(X, cols=self.cols)
category_encoders/backward_difference.py:32: in backward_difference_coding
    mod = dmatrix("C(%s, Diff)" % (col, ), X)
/Library/Python/2.7/site-packages/patsy/highlevel.py:291: in matrix
    NA_action, return_type)
/Library/Python/2.7/site-packages/patsy/highlevel.py:165: in _do_highlevel_design
    NA_action)
/Library/Python/2.7/site-packages/patsy/highlevel.py:70: in _try_incr_builders
    NA_action)
/Library/Python/2.7/site-packages/patsy/build.py:696: in design_matrix_builders
    NA_action)
/Library/Python/2.7/site-packages/patsy/build.py:443: in _examine_factor_types
    value = factor.eval(factor_states[factor], data)
/Library/Python/2.7/site-packages/patsy/eval.py:566: in eval
    data)
/Library/Python/2.7/site-packages/patsy/eval.py:551: in _eval
    inner_namespace=inner_namespace)
/Library/Python/2.7/site-packages/patsy/compat.py:117: in call_and_wrap_xc
    return f(*args, **kwargs)
/Library/Python/2.7/site-packages/patsy/eval.py:166: in eval
    + self._namespaces))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   TypeError: 'Series' object is not callable

<string>:1: TypeError
______________________________________________________________________________________ TestEncoders.test_helmert ______________________________________________________________________________________

self = <categorical_encoding.tests.test_encoders.TestEncoders testMethod=test_helmert>

    def test_helmert(self):
        """

            :return:
            """

        cols = ['C', 'D', 'E', 'F']
        enc = encoders.HelmertEncoder(verbose=1, cols=cols)
        X = self.create_dataset(n_rows=1000)

>       X_test = enc.fit_transform(X, None)

tests/test_encoders.py:113: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Library/Python/2.7/site-packages/sklearn/base.py:455: in fit_transform
    return self.fit(X, **fit_params).transform(X)
category_encoders/helmert.py:82: in transform
    return helmert_coding(X, cols=self.cols)
category_encoders/helmert.py:32: in helmet_coding
    mod = dmatrix("C(%s, Helmert)" % (col, ), X)
/Library/Python/2.7/site-packages/patsy/highlevel.py:291: in matrix
    NA_action, return_type)
/Library/Python/2.7/site-packages/patsy/highlevel.py:165: in _do_highlevel_design
    NA_action)
/Library/Python/2.7/site-packages/patsy/highlevel.py:70: in _try_incr_builders
    NA_action)
/Library/Python/2.7/site-packages/patsy/build.py:696: in design_matrix_builders
    NA_action)
/Library/Python/2.7/site-packages/patsy/build.py:443: in _examine_factor_types
    value = factor.eval(factor_states[factor], data)
/Library/Python/2.7/site-packages/patsy/eval.py:566: in eval
    data)
/Library/Python/2.7/site-packages/patsy/eval.py:551: in _eval
    inner_namespace=inner_namespace)
/Library/Python/2.7/site-packages/patsy/compat.py:117: in call_and_wrap_xc
    return f(*args, **kwargs)
/Library/Python/2.7/site-packages/patsy/eval.py:166: in eval
    + self._namespaces))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   TypeError: 'Series' object is not callable

<string>:1: TypeError
____________________________________________________________________________________ TestEncoders.test_polynomial _____________________________________________________________________________________

self = <categorical_encoding.tests.test_encoders.TestEncoders testMethod=test_polynomial>

    def test_polynomial(self):
        """

            :return:
            """

        cols = ['C', 'D', 'E', 'F']
        enc = encoders.PolynomialEncoder(verbose=1, cols=cols)
        X = self.create_dataset(n_rows=1000)

>       X_test = enc.fit_transform(X, None)

tests/test_encoders.py:131: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Library/Python/2.7/site-packages/sklearn/base.py:455: in fit_transform
    return self.fit(X, **fit_params).transform(X)
category_encoders/polynomial.py:77: in transform
    return polynomial_coding(X, cols=self.cols)
category_encoders/polynomial.py:31: in polynomial_coding
    mod = dmatrix("C(%s, Poly)" % (col, ), X)
/Library/Python/2.7/site-packages/patsy/highlevel.py:291: in matrix
    NA_action, return_type)
/Library/Python/2.7/site-packages/patsy/highlevel.py:165: in _do_highlevel_design
    NA_action)
/Library/Python/2.7/site-packages/patsy/highlevel.py:70: in _try_incr_builders
    NA_action)
/Library/Python/2.7/site-packages/patsy/build.py:696: in design_matrix_builders
    NA_action)
/Library/Python/2.7/site-packages/patsy/build.py:443: in _examine_factor_types
    value = factor.eval(factor_states[factor], data)
/Library/Python/2.7/site-packages/patsy/eval.py:566: in eval
    data)
/Library/Python/2.7/site-packages/patsy/eval.py:551: in _eval
    inner_namespace=inner_namespace)
/Library/Python/2.7/site-packages/patsy/compat.py:117: in call_and_wrap_xc
    return f(*args, **kwargs)
/Library/Python/2.7/site-packages/patsy/eval.py:166: in eval
    + self._namespaces))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   TypeError: 'Series' object is not callable

<string>:1: TypeError
________________________________________________________________________________________ TestEncoders.test_sum ________________________________________________________________________________________

self = <categorical_encoding.tests.test_encoders.TestEncoders testMethod=test_sum>

    def test_sum(self):
        """

            :return:
            """

        cols = ['C', 'D', 'E', 'F']
        enc = encoders.SumEncoder(verbose=1, cols=cols)
        X = self.create_dataset(n_rows=1000)

>       X_test = enc.fit_transform(X, None)

tests/test_encoders.py:149: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Library/Python/2.7/site-packages/sklearn/base.py:455: in fit_transform
    return self.fit(X, **fit_params).transform(X)
category_encoders/sum_coding.py:83: in transform
    return sum_coding(X, cols=self.cols)
category_encoders/sum_coding.py:32: in sum_coding
    mod = dmatrix("C(%s, Sum)" % (col, ), X)
/Library/Python/2.7/site-packages/patsy/highlevel.py:291: in matrix
    NA_action, return_type)
/Library/Python/2.7/site-packages/patsy/highlevel.py:165: in _do_highlevel_design
    NA_action)
/Library/Python/2.7/site-packages/patsy/highlevel.py:70: in _try_incr_builders
    NA_action)
/Library/Python/2.7/site-packages/patsy/build.py:696: in design_matrix_builders
    NA_action)
/Library/Python/2.7/site-packages/patsy/build.py:443: in _examine_factor_types
    value = factor.eval(factor_states[factor], data)
/Library/Python/2.7/site-packages/patsy/eval.py:566: in eval
    data)
/Library/Python/2.7/site-packages/patsy/eval.py:551: in _eval
    inner_namespace=inner_namespace)
/Library/Python/2.7/site-packages/patsy/compat.py:117: in call_and_wrap_xc
    return f(*args, **kwargs)
/Library/Python/2.7/site-packages/patsy/eval.py:166: in eval
    + self._namespaces))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   TypeError: 'Series' object is not callable

<string>:1: TypeError
================================================================================= 4 failed, 3 passed in 1.70 seconds ==================================================================================

testing new coding methods

Hi - is there a standard metric that's been established to evaluate the performance of the various categorical encoding methods? In my own work I use a generalization of 1-hot encoding I call k-hot encoding, where k denotes the number of activate bits. I'd be keen to evaluate it on a standard metric if one exists to see if it's any good for more general tasks.

Try to use hashencoder but fail due to high memory usage

I tried to convert several categorical columns to hash-encoding features and the size of input data is 2.7 million, one column has 20K category. But the amount of memory will be up to 10G, it's a very strange situation. Why the encoder used so much memory?

OneHotEncoder creates dummy column

In the following example:

>>> myDf = pd.DataFrame({'col': ['a', 'b', np.NAN]})
>>> enc = ce.OneHotEncoder(cols=['col'])
>>> enc = enc.fit(myDf)
>>> enc.transform(myDf)
  col_0  col_1  col_2  col_-1
0      1      0      0       0
1      0      1      0       0
2      0      0      0       1

I believe col_2 should not be created. I think this is a bug created due to OneHotEncoder calling internally OrdinalEncoder and both are imputing an extra category.

Happy to hear if I am wrong?

Update documentation

Update the docs with the new docstrings (in separate issue) and with new introduction and citations (also in separate issues). CircleCI is failing on the push to gh-pages, so that should be fixed as well.

Make the changes from #35 in BaseNEncoder

@andrethrill submitted a PR to move the determination of n_cols to fit() and not doing it on the fly in transform(), we should make the same change in BaseNEncoder.

Continuous distribution based on probablities

I found interesting approach in paper "The Synthetic Data Vault: Generative Modeling for Relational Databases". It seems like there are no implementations in popular libs.

Steps:

  1. Sort the categories from most frequently occurring to least.
  2. Split the interval [0, 1] into sections based on the cumulative probability for each category.
  3. To convert a category, find the interval [๐‘Ž, ๐‘] โˆˆ [0, 1] that corresponds to the category.
  4. Chose value between ๐‘Ž and ๐‘ by sampling from a truncated Gaussian distribution with ๐œ‡ at the center of the interval, and ๐œŽ = (๐‘โˆ’๐‘Ž) / 6.

Visualisation.
image

Does it seems reasonable to implement it? I'm ready to contribute this part.

Adding variable names to enconding

Is it possibly, for instance, with one-hot encoding to add to the variable name, the variable value? e.g.: with the variable gender having the following values: unknown, male, female

encoded as three variables with names:
gender_unknown gender_male gender_female

instead of:
gender_0 gender_1 gender_2

Drop Original Feature when Encoding

Can we set an option to drop the original feature feat_0 for the numeric encoders?

See this code below:

import category_encoders
import pandas as pd
import numpy as np

dd = {'feat1':['a','b','c','d','e'],'feat2':np.random.random(size=(5))}
df = pd.DataFrame(dd)
encoder = category_encoders.BackwardDifferenceEncoder(cols=['feat1'])
encoder.fit_transform(df)

this returns feat_0 as the first feaure and every row element is the number 1. Is this a desired property of should it be removed?

category value in column name?

E.g.

>>> df = pd.DataFrame({'col': ['a', 'b']})
>>> encoder = ce.OneHotEncoder()
>>> encoder.fit_transform(df)
   col_0  col_1  col_-1
0      1      0       0
1      0      1       0

Ignoring last column, I'd like to see

   col_a  col_b
0      1      0
1      0      1

So that e.g. if I find col_a is a strong feature in a model, I can interpret what it means easily (as opposed to col_0). I know I can hack this myself, but I'd have thought it'd be a common use case, and natively supported. Unless I'm missing something ...

Make use of multiprocessing module?

Hello,

Thanks for your work on this. Is there any way we can make use of the multiprocessing module here? I have this snippet,

# Specify the columns to encode then fit and transform
encoder = ce.binary.BinaryEncoder(cols=[.......],return_df=True)
encoder.fit(inputdatav1, verbose=1)

inputdatav2 = encoder.transform(inputdatav1)

I am trying to encode 9 categorical columns, I know 7 of them has less than 100 levels in it, however two columns are big, one has 40702 and another one has 48112 unique values respectively. My encoding is running for past 7 hours, is there a way to speed up the process?

Bug in basen.py

>>> import category_encoders as ce
>>> df = pd.DataFrame({'col1': ['a', 'b', 'c'], 'col2': ['d', 'e', 'f']})
  col1 col2
0    a    d
1    b    e
2    c    f
>>>df_1 = pd.DataFrame({'col1': ['a', 'b', 'd'], 'col2': ['d', 'e', 'f']})
  col1 col2
0    a    d
1    b    e
2    d    f
>>> enc = ce.BaseNEncoder(verbose=1)
>>> enc.fit(df)
>>> print enc.transform(df_1)
   col1_0  col1_1  col2_0  col2_1
0     0.0     0.0       0       0
1     0.0     1.0       0       1
2     0.0     0.0       1       0

Is there a bug in basen.py?

Add bases other than binary

The exact methodology used in binary encoding can be easily expanded out to other bases (3, 4, etc). It would be great to have a BaseXEncoder to allow for experiments with different bases.

Support full transform for OneHotEncoder

Currently, OneHotEncoder.transform only adds columns for the categories that are passed in, not for all the categories present. This breaks sklearn pipelines when applied to a few rows because OneHotEncoder.transform returns the wrong shape. The example from this stackoverflow post shows how fit_transform returns 5 columns, but transform only returns 3 columns when it's called on 2 rows.

Ideally, transform should behave as follows in that stackoverflow example, so it can be used in sklearn pipelines properly:

df_test = pd.DataFrame({'cat1':['A','N',],'cat2':['T','B']})
df_test_trans = enc_ohe.transform(df_test)

print(df_test_trans)

   cat1_0  cat1_1  cat1_2  cat1_3 cat2
0       0       1       0       0    T
1       0       0       0       1    B

Optionally inspect input data to infer cols

In many cases the only things being encoded are non-numeric columns (of course sometimes integer columns are categorical too, but that's a separate case). If would be nice to be able to pass convert_strings=True instead of columns, and then in the fit() model iterate through the input dataframe and set cols to be all those with the pandas type: object.

Documentation not clear about the "mapping" parameter of OrdinalEncoder

Hi,

Please look at this stack-overflow question for reference. The mapping parameter usage by the user in this question is wrong.

But I am not able to find the documentation of perfect usage of mapping param in OrdinalEncoder. The documentation only says this:

"a mapping of class to label to use for the encoding, optional."

Implementing default examples and the example in this issue #52, and then looking at the output of the following code:

encoder.category_mapping
# Output: [{'col': 'col1', 'mapping': [(None, 0), ('a', 1), ('b', 2)]}]

makes it clear.

I think the documentation should be clear on format that the 'mapping' should be a list of dicts and dicts should contain the keys 'col' and 'mapping' and that the mapping should be a list of tuples of format (original_label, encoded_label), with an example preferably.

How use LeaveOneOutEncoder in pipelines

I can't figure it out how to correctly use LeaveOneOutEncoder in pipelines.
Consider the following code:

from sklearn.base import BaseEstimator, TransformerMixin

class DataFrameSelector(BaseEstimator, TransformerMixin):
        def __init__(self, attribute_names,target):
            self.attribute_names = attribute_names
            self.y_=target
            
        def fit(self, X, y):
             return self
        def transform(self, X,y):
             return X[self.attribute_names]

cat_attribs = ["ocean_proximity"]

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('cat_en', LeaveOneOutEncoder(return_df=False))
    ])

cat_pipeline.fit(housing)
housing_prepared = cat_pipeline.transform(housing)
housing_prepared

I receive error:
TypeError: fit() missing 1 required positional argument: 'y'

But how to pass y?
Thanks.

References hyperlink is deprecated

What should 'None' be converted as ?

>>> import pandas as pd
>>> import category_encoders as ce
>>> df = pd.DataFrame({'col1':[None, 'a', 'b']})
>>> print df
   col1
0  None
1     a
2     b

>>> enc = ce.OrdinalEncoder(verbose=1)
>>> enc.fit(df)
>>> print enc.category_mapping
[{'col': 'col1', 'mapping': [(None, 0), ('a', 1), ('b', 2)]}]

>>> print enc.transform(df)
   col1
0    -1
1     1
2     2

As above, 'None' is converted as -1, but in the mapping , 'None' is mapped as 0. What should 'None' be converted as? why not converted as 0?
And I find

(line 251 in ordinal.py)
X.loc[X[switch.get('col')] == category[0], str(switch.get('col')) + '_tmp'] = str(category[1])   

for example :
>>> print None == None
True
>>> X = pd.DataFrame({'col1':[None, 'a', 'b']})
   col1
0  None
1     a
2     b
>>> print X['col1'] == None
0    False
1    False
2    False

Is there any problem?

For 'None' , there is difference between numpy and pandas , pandas always treat 'None' as nan, but numpy treat 'None' as NoneType. Is it necessay to modify as follows ?

X.loc[X[switch.get('col')].values == category[0], str(switch.get('col')) + '_tmp'] = str(category[1])   

Add .values turn Series to array!

convert_objects is deprecated

category_encoders\utils.py:30: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
 X = X.convert_objects(convert_numeric=True)`

I'm using the following:

pandas: 0.18.0
scikit: 0.18.1

reshape is deprecated

category_encoders/ordinal.py:167: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  X[switch.get('col')] = X[switch.get('col')].astype(int).reshape(-1, )

i am using the following versions and getting that warning

pandas (0.19.2)
scikit-learn (0.18)
numpy (1.10.4)

BinaryEncoder does not satisfy distributive law

Look at the following snippet,

In [68]: data
Out[68]:
array(['apple', 'orange', 'peach', 'lemon'],
      dtype='<U6')

In [69]: encoder = ce.BinaryEncoder()

In [70]: encoder.fit(data)
Out[70]:
BinaryEncoder(cols=[0], drop_invariant=False, handle_unknown='impute',
       impute_missing=True, return_df=True, verbose=0)

In [71]: encoder.transform(data)
Out[71]:
   0_0  0_1
0    0    0
1    0    1
2    1    0
3    1    1

In [72]: encoder.transform(data[:1])
Out[72]:
Empty DataFrame
Columns: []
Index: [0]

I would argue encoder.transform(data[:1]) should be mapped to [0, 0], otherwise the distributive law would not hold, e.g.,

data == data[:1] + data[1:]
encoder.transform(data) =!= encoder.transform(data[:1]) + encoder.transform(data[1:])

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.