scikit-learn-contrib / category_encoders Goto Github PK

A library of sklearn compatible categorical variable encoders

Home Page: http://contrib.scikit-learn.org/category_encoders/

License: BSD 3-Clause "New" or "Revised" License

Python 99.23% TeX 0.77%

category_encoders's Issues

Issue at install with statsmodels v 0.9.0

I encountered an issue with categorical-encoders and a new release of statsmodels. I've created a package, and when calling pip install . on the package to install it in a Docker container, I get the following:

  Downloading https://files.pythonhosted.org/packages/67/68/eb3ec6ab61f97216c257edddb853cc174cd76ea44b365cf4adaedcd44482/statsmodels-0.9.0.tar.gz (12.7MB)
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-e8na0ezb/statsmodels/setup.py", line 347, in <module>
        from numpy.distutils.misc_util import get_info
    ImportError: No module named 'numpy'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-e8na0ezb/statsmodels/
The command '/bin/sh -c pip install .' returned a non-zero code: 1

This worked prior to the release of statsmodels 0.9.0, which was released at the end of April 2018.
Specifying statsmodels==0.8.0 in the setup.py install_requires list eliminated the issue.

Documentation not clear about the "mapping" parameter of OrdinalEncoder

Hi,

Please look at this stack-overflow question for reference. The mapping parameter usage by the user in this question is wrong.

But I am not able to find the documentation of perfect usage of mapping param in OrdinalEncoder. The documentation only says this:

"a mapping of class to label to use for the encoding, optional."

Implementing default examples and the example in this issue #52, and then looking at the output of the following code:

encoder.category_mapping
# Output: [{'col': 'col1', 'mapping': [(None, 0), ('a', 1), ('b', 2)]}]

makes it clear.

I think the documentation should be clear on format that the 'mapping' should be a list of dicts and dicts should contain the keys 'col' and 'mapping' and that the mapping should be a list of tuples of format (original_label, encoded_label), with an example preferably.

Create inverse_transform functionality

Could the inverse_transform functionality that exists in sklearn encoders be introduced?
This would allow a user to obtain the original levels prior to encoding.

See the examples section in:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

category value in column name?

E.g.

>>> df = pd.DataFrame({'col': ['a', 'b']})
>>> encoder = ce.OneHotEncoder()
>>> encoder.fit_transform(df)
   col_0  col_1  col_-1
0      1      0       0
1      0      1       0

Ignoring last column, I'd like to see

   col_a  col_b
0      1      0
1      0      1

So that e.g. if I find col_a is a strong feature in a model, I can interpret what it means easily (as opposed to col_0). I know I can hack this myself, but I'd have thought it'd be a common use case, and natively supported. Unless I'm missing something ...

Continuous distribution based on probablities

I found interesting approach in paper "The Synthetic Data Vault: Generative Modeling for Relational Databases". It seems like there are no implementations in popular libs.

Steps:

Sort the categories from most frequently occurring to least.
Split the interval [0, 1] into sections based on the cumulative probability for each category.
To convert a category, find the interval [𝑎, 𝑏] ∈ [0, 1] that corresponds to the category.
Chose value between 𝑎 and 𝑏 by sampling from a truncated Gaussian distribution with 𝜇 at the center of the interval, and 𝜎 = (𝑏−𝑎) / 6.

Visualisation.

Does it seems reasonable to implement it? I'm ready to contribute this part.

testing new coding methods

Hi - is there a standard metric that's been established to evaluate the performance of the various categorical encoding methods? In my own work I use a generalization of 1-hot encoding I call k-hot encoding, where k denotes the number of activate bits. I'd be keen to evaluate it on a standard metric if one exists to see if it's any good for more general tasks.

Make use of multiprocessing module?

Hello,

Thanks for your work on this. Is there any way we can make use of the multiprocessing module here? I have this snippet,

# Specify the columns to encode then fit and transform
encoder = ce.binary.BinaryEncoder(cols=[.......],return_df=True)
encoder.fit(inputdatav1, verbose=1)

inputdatav2 = encoder.transform(inputdatav1)

I am trying to encode 9 categorical columns, I know 7 of them has less than 100 levels in it, however two columns are big, one has 40702 and another one has 48112 unique values respectively. My encoding is running for past 7 hours, is there a way to speed up the process?

References hyperlink is deprecated

http://contrib.scikit-learn.org/categorical-encoding/backward_difference.html
https://github.com/scikit-learn-contrib/categorical-encoding/blob/master/category_encoders/backward_difference.py

[1] Contrast Coding Systems for categorical variables. UCLA: Statistical Consulting Group. from
http://www.ats.ucla.edu/stat/r/library/contrast_coding.

This hyperlink is deprecated and moved to:

https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/

conda install currently breaks when using python 3.6

UnsatisfiableError: The following specifications were found to be in conflict:

category_encoders -> python 2.7*
python 3.6*
Use "conda info " to see the dependencies for each package.

one hot encoder handle unknown

Sklearn offers the "handle unknown flag" http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html with either rise error or ignore.

Do you have any plans to support this? / add a 3rd Option of creating a new category call unknown.

Add citations to readme

To comply with scikit-learn contrib style, add citations for all encoder types to the readme. An example is here: https://github.com/scikit-learn-contrib/imbalanced-learn

What should 'None' be converted as ?

>>> import pandas as pd
>>> import category_encoders as ce
>>> df = pd.DataFrame({'col1':[None, 'a', 'b']})
>>> print df
   col1
0  None
1     a
2     b

>>> enc = ce.OrdinalEncoder(verbose=1)
>>> enc.fit(df)
>>> print enc.category_mapping
[{'col': 'col1', 'mapping': [(None, 0), ('a', 1), ('b', 2)]}]

>>> print enc.transform(df)
   col1
0    -1
1     1
2     2

As above, 'None' is converted as -1, but in the mapping , 'None' is mapped as 0. What should 'None' be converted as? why not converted as 0?
And I find

(line 251 in ordinal.py)
X.loc[X[switch.get('col')] == category[0], str(switch.get('col')) + '_tmp'] = str(category[1])   

for example :
>>> print None == None
True
>>> X = pd.DataFrame({'col1':[None, 'a', 'b']})
   col1
0  None
1     a
2     b
>>> print X['col1'] == None
0    False
1    False
2    False

Is there any problem?

For 'None' , there is difference between numpy and pandas , pandas always treat 'None' as nan, but numpy treat 'None' as NoneType. Is it necessay to modify as follows ?

X.loc[X[switch.get('col')].values == category[0], str(switch.get('col')) + '_tmp'] = str(category[1])   

Add .values turn Series to array!

NaN and None are handled inconsistently

If you pass NoneType in, it will create a new category, which is never used, and then encode all instances of NoneType as unknown category.

See #54 and #52

Support missing value and unknown category handling in other encoders

Currently only onehot and ordinal support this, we need to implement it in the remaining encoders as well.

Support full transform for OneHotEncoder

Currently, OneHotEncoder.transform only adds columns for the categories that are passed in, not for all the categories present. This breaks sklearn pipelines when applied to a few rows because OneHotEncoder.transform returns the wrong shape. The example from this stackoverflow post shows how fit_transform returns 5 columns, but transform only returns 3 columns when it's called on 2 rows.

Ideally, transform should behave as follows in that stackoverflow example, so it can be used in sklearn pipelines properly:

df_test = pd.DataFrame({'cat1':['A','N',],'cat2':['T','B']})
df_test_trans = enc_ohe.transform(df_test)

print(df_test_trans)

   cat1_0  cat1_1  cat1_2  cat1_3 cat2
0       0       1       0       0    T
1       0       0       0       1    B

Encode as integers or strings?

https://github.com/scikit-learn-contrib/categorical-encoding/blob/d165a4957a7f4a8ca0287d8c01d44be6cf21aa3c/category_encoders/binary.py#L228

I was reading the code to understand how the algorithm works and I noticed you the col array contains arrays and the list of zeros you prepend to it contains integers. I think the data type used should be consistent, either use integers or strings.

handle_unknown

Hello!
I have faced smth that looks like a bug for me.

traceback.txt
Same behavior for other encoders that have this check:
https://github.com/scikit-learn-contrib/categorical-encoding/blob/master/category_encoders/ordinal.py#L260
I am not sure, but I haven't managed to find where key 'D' is set in package, so may be this check is bogus

Adding variable names to enconding

Is it possibly, for instance, with one-hot encoding to add to the variable name, the variable value? e.g.: with the variable gender having the following values: unknown, male, female

encoded as three variables with names:
gender_unknown gender_male gender_female

instead of:
gender_0 gender_1 gender_2

Request to add support for pandas categorical types when using OrdinalEncoder

Hi All,

I ran into an issue using OrdinalEncoder where I assumed that it would use the mapping in the categorical types in the passed pandas dataframe. Instead, it created a mapping based on the order of the data passed to fit_transform. I realize that this is user error, but it would be nice for OrdinalEncoder to use the existing mapping information from the categorical dtype. If the maintainers think this would be a nice addition, I'd be happy to make a PR.

Here's a demonstration of the issue.

First I'll create a categorical dtype and a dataframe that uses it.

from category_encoders.ordinal import OrdinalEncoder
import pandas as pd
from pandas.api.types import CategoricalDtype

platforms = ['android', 'ios', 'amazon']
platform_category = CategoricalDtype(categories=platforms, ordered=False)

df = pd.DataFrame([
    {'id': 1, 'platform': 'android'},
    {'id': 2, 'platform': 'ios'},
    {'id': 3, 'platform': 'amazon'},
])
df['platform'] = df['platform'].astype(platform_category)
print(df)

   id platform
0   1  android
1   2      ios
2   3   amazon

The encoding from the categorical dtype looks like this:

[(cat, code) for code, cat in enumerate(df['platform'].cat.categories)]

[('android', 0), ('ios', 1), ('amazon', 2)]

Now I'll make an encoder without a mapping parameter, then transform with the data sorted.

cat_encoder_a = OrdinalEncoder(cols=categorical_columns)
df_a = cat_encoder_a.fit_transform(df.sort_values(by='platform', ascending=True))
print(df_a)

   id  platform
0   1         0
1   2         1
2   3         2

The category mapping from the encoder happens to match the categorical dtype mapping because of the sort order of the dataframe.

cat_encoder_a.category_mapping

[{'col': 'platform', 'mapping': [('android', 0), ('ios', 1), ('amazon', 2)]}]

But if I reverse the order of the data passed to fit_transform I will get a different mapping.

cat_encoder_b = OrdinalEncoder(cols=categorical_columns)
df_b = cat_encoder_b.fit_transform(df.sort_values(by='platform', ascending=False))
cat_encoder_b.category_mapping

[{'col': 'platform', 'mapping': [('amazon', 0), ('ios', 1), ('android', 2)]}]

I can get a stable mapping from the categorical types in the dataframe itself (instead of relying on the order of the data).

category_mapping = [
    {'col': column_name, 'mapping': [(cat, code) for code, cat in enumerate(df[column_name].cat.categories)]} 
    for column_name in df.select_dtypes(['category']).columns
]
category_mapping

[{'col': 'platform', 'mapping': [('android', 0), ('ios', 1), ('amazon', 2)]}]

cat_encoder_c = OrdinalEncoder(cols=categorical_columns, mapping=category_mapping)
df_c = cat_encoder_c.fit_transform(df).sort_values(by='platform', ascending=False)
cat_encoder_c.category_mapping

[{'col': 'platform', 'mapping': [('android', 0), ('ios', 1), ('amazon', 2)]}]

So while it is not hard to pass a custom mapping, it would be nice for OrdinalEncoder to handle this automatically if no mapping is passed and if any of the columns are pandas categoricals. I think this logic could be handled around here. https://github.com/scikit-learn-contrib/categorical-encoding/blob/1.2.6/category_encoders/ordinal.py#L270

If others think this would be a positive addition, I will make a PR.

thanks,
Dennis

EDIT: I modified the above code to fix a bug in getting the mapping from the pandas categorical dtype. The previous code I had happened to work only because the data was in order, but in general that does not work. The correct way is

[(cat, code) for code, cat in enumerate(df[column_name].cat.categories)]

Optionally inspect input data to infer cols

In many cases the only things being encoded are non-numeric columns (of course sometimes integer columns are categorical too, but that's a separate case). If would be nice to be able to pass convert_strings=True instead of columns, and then in the fit() model iterate through the input dataframe and set cols to be all those with the pandas type: object.

re-using pickled/saved encoders

I've trained a model, for which I use the one-hot-encoder, and let's say a categorical variable, e.g. device_type, had 20 possible values, the trained model is now expecting to see, among other columns, 20 columns related with the values of device_type, i.e.:

device_type_0, device_type_1, ..., device_type_20

I also saved/pickled this trained model as well as the one-hot-encoder.

I now have new data coming, and the device_type, for this new data, only has 10 possible values. I was assuming that by applying the pickled/saved one-hot-encoder which I used before, I would see the same 20 columns related with the values of device_type, i.e.:

device_type_0, device_type_1, ..., device_type_20

But I only see 10 device_type_X columns, i.e., only for the values presente in the new data.

Is there something I'm missing when I instantiate the encoder, or a parameter I'm missing when applying the fit, or simply this functionality is not implemented ?

Switch to numpy style docstrings

In order to be more compliant with the scikit-learn styling, we need to switch to numpy style docstrings (examples here: https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt)

Test cases failing

Hi Will,

Just wanted to bring your attention to test cases failing for the encoders utilizing patsy dmatrix. This was basically due to the fact that the test data frame had a column 'C' which is also the patsy way of denoting that you want a categorical variable. By renaming the column to something like C1 all the test passed.

Kind Regards,
Rakesh

========================================================================================= test session starts =========================================================================================
platform darwin -- Python 2.7.10, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
rootdir: categorical_encoding, ini file:
collected 7 items 

tests/test_encoders.py F..F.FF

============================================================================================== FAILURES ===============================================================================================
________________________________________________________________________________ TestEncoders.test_backward_difference ________________________________________________________________________________

self = <categorical_encoding.tests.test_encoders.TestEncoders testMethod=test_backward_difference>

    def test_backward_difference(self):
        """

            :return:
            """

        cols = ['C', 'D', 'E', 'F']
        enc = encoders.BackwardDifferenceEncoder(verbose=1, cols=cols)
        X = self.create_dataset(n_rows=1000)

>       X_test = enc.fit_transform(X, None)

tests/test_encoders.py:77: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Library/Python/2.7/site-packages/sklearn/base.py:455: in fit_transform
    return self.fit(X, **fit_params).transform(X)
category_encoders/backward_difference.py:83: in transform
    return backward_difference_coding(X, cols=self.cols)
category_encoders/backward_difference.py:32: in backward_difference_coding
    mod = dmatrix("C(%s, Diff)" % (col, ), X)
/Library/Python/2.7/site-packages/patsy/highlevel.py:291: in matrix
    NA_action, return_type)
/Library/Python/2.7/site-packages/patsy/highlevel.py:165: in _do_highlevel_design
    NA_action)
/Library/Python/2.7/site-packages/patsy/highlevel.py:70: in _try_incr_builders
    NA_action)
/Library/Python/2.7/site-packages/patsy/build.py:696: in design_matrix_builders
    NA_action)
/Library/Python/2.7/site-packages/patsy/build.py:443: in _examine_factor_types
    value = factor.eval(factor_states[factor], data)
/Library/Python/2.7/site-packages/patsy/eval.py:566: in eval
    data)
/Library/Python/2.7/site-packages/patsy/eval.py:551: in _eval
    inner_namespace=inner_namespace)
/Library/Python/2.7/site-packages/patsy/compat.py:117: in call_and_wrap_xc
    return f(*args, **kwargs)
/Library/Python/2.7/site-packages/patsy/eval.py:166: in eval
    + self._namespaces))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   TypeError: 'Series' object is not callable

<string>:1: TypeError
______________________________________________________________________________________ TestEncoders.test_helmert ______________________________________________________________________________________

self = <categorical_encoding.tests.test_encoders.TestEncoders testMethod=test_helmert>

    def test_helmert(self):
        """

            :return:
            """

        cols = ['C', 'D', 'E', 'F']
        enc = encoders.HelmertEncoder(verbose=1, cols=cols)
        X = self.create_dataset(n_rows=1000)

>       X_test = enc.fit_transform(X, None)

tests/test_encoders.py:113: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Library/Python/2.7/site-packages/sklearn/base.py:455: in fit_transform
    return self.fit(X, **fit_params).transform(X)
category_encoders/helmert.py:82: in transform
    return helmert_coding(X, cols=self.cols)
category_encoders/helmert.py:32: in helmet_coding
    mod = dmatrix("C(%s, Helmert)" % (col, ), X)
/Library/Python/2.7/site-packages/patsy/highlevel.py:291: in matrix
    NA_action, return_type)
/Library/Python/2.7/site-packages/patsy/highlevel.py:165: in _do_highlevel_design
    NA_action)
/Library/Python/2.7/site-packages/patsy/highlevel.py:70: in _try_incr_builders
    NA_action)
/Library/Python/2.7/site-packages/patsy/build.py:696: in design_matrix_builders
    NA_action)
/Library/Python/2.7/site-packages/patsy/build.py:443: in _examine_factor_types
    value = factor.eval(factor_states[factor], data)
/Library/Python/2.7/site-packages/patsy/eval.py:566: in eval
    data)
/Library/Python/2.7/site-packages/patsy/eval.py:551: in _eval
    inner_namespace=inner_namespace)
/Library/Python/2.7/site-packages/patsy/compat.py:117: in call_and_wrap_xc
    return f(*args, **kwargs)
/Library/Python/2.7/site-packages/patsy/eval.py:166: in eval
    + self._namespaces))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   TypeError: 'Series' object is not callable

<string>:1: TypeError
____________________________________________________________________________________ TestEncoders.test_polynomial _____________________________________________________________________________________

self = <categorical_encoding.tests.test_encoders.TestEncoders testMethod=test_polynomial>

    def test_polynomial(self):
        """

            :return:
            """

        cols = ['C', 'D', 'E', 'F']
        enc = encoders.PolynomialEncoder(verbose=1, cols=cols)
        X = self.create_dataset(n_rows=1000)

>       X_test = enc.fit_transform(X, None)

tests/test_encoders.py:131: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Library/Python/2.7/site-packages/sklearn/base.py:455: in fit_transform
    return self.fit(X, **fit_params).transform(X)
category_encoders/polynomial.py:77: in transform
    return polynomial_coding(X, cols=self.cols)
category_encoders/polynomial.py:31: in polynomial_coding
    mod = dmatrix("C(%s, Poly)" % (col, ), X)
/Library/Python/2.7/site-packages/patsy/highlevel.py:291: in matrix
    NA_action, return_type)
/Library/Python/2.7/site-packages/patsy/highlevel.py:165: in _do_highlevel_design
    NA_action)
/Library/Python/2.7/site-packages/patsy/highlevel.py:70: in _try_incr_builders
    NA_action)
/Library/Python/2.7/site-packages/patsy/build.py:696: in design_matrix_builders
    NA_action)
/Library/Python/2.7/site-packages/patsy/build.py:443: in _examine_factor_types
    value = factor.eval(factor_states[factor], data)
/Library/Python/2.7/site-packages/patsy/eval.py:566: in eval
    data)
/Library/Python/2.7/site-packages/patsy/eval.py:551: in _eval
    inner_namespace=inner_namespace)
/Library/Python/2.7/site-packages/patsy/compat.py:117: in call_and_wrap_xc
    return f(*args, **kwargs)
/Library/Python/2.7/site-packages/patsy/eval.py:166: in eval
    + self._namespaces))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   TypeError: 'Series' object is not callable

<string>:1: TypeError
________________________________________________________________________________________ TestEncoders.test_sum ________________________________________________________________________________________

self = <categorical_encoding.tests.test_encoders.TestEncoders testMethod=test_sum>

    def test_sum(self):
        """

            :return:
            """

        cols = ['C', 'D', 'E', 'F']
        enc = encoders.SumEncoder(verbose=1, cols=cols)
        X = self.create_dataset(n_rows=1000)

>       X_test = enc.fit_transform(X, None)

tests/test_encoders.py:149: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Library/Python/2.7/site-packages/sklearn/base.py:455: in fit_transform
    return self.fit(X, **fit_params).transform(X)
category_encoders/sum_coding.py:83: in transform
    return sum_coding(X, cols=self.cols)
category_encoders/sum_coding.py:32: in sum_coding
    mod = dmatrix("C(%s, Sum)" % (col, ), X)
/Library/Python/2.7/site-packages/patsy/highlevel.py:291: in matrix
    NA_action, return_type)
/Library/Python/2.7/site-packages/patsy/highlevel.py:165: in _do_highlevel_design
    NA_action)
/Library/Python/2.7/site-packages/patsy/highlevel.py:70: in _try_incr_builders
    NA_action)
/Library/Python/2.7/site-packages/patsy/build.py:696: in design_matrix_builders
    NA_action)
/Library/Python/2.7/site-packages/patsy/build.py:443: in _examine_factor_types
    value = factor.eval(factor_states[factor], data)
/Library/Python/2.7/site-packages/patsy/eval.py:566: in eval
    data)
/Library/Python/2.7/site-packages/patsy/eval.py:551: in _eval
    inner_namespace=inner_namespace)
/Library/Python/2.7/site-packages/patsy/compat.py:117: in call_and_wrap_xc
    return f(*args, **kwargs)
/Library/Python/2.7/site-packages/patsy/eval.py:166: in eval
    + self._namespaces))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   TypeError: 'Series' object is not callable

<string>:1: TypeError
================================================================================= 4 failed, 3 passed in 1.70 seconds ==================================================================================

mapping with ordinal breaks if fit is called again

this makes it impossible to use in a pipeline where some things need to be fit and ordinal does not iff it has a mapping

This is demonstrated in the test case failing novilabs-archived@ddf21f7

target encoding

should be nice this one too:

https://www.kaggle.com/ogrellier/python-target-encoding-for-categorical-features

[Question] Can i get transformed names from encoders?

When i use encoders in FeatureUnion after transform i always get numpy.ndarray regardless of what was fed to the input, pd.DataFrame or smth else. When i try get new feature names by FeatureUnion.get_feature_names() i got AttributeError: Transformer ohe (type OneHotEncoder) does not provide get_feature_names.. So how can i know transformed feature names?

Leave One Out Encoding Behavior

I've been trying the LeaveOneOutEncoder but from what I can tell it just takes the mean of the target given the current levels without leaving the current example out. Also it seems like setting randomized = True does not seem to add any noise. Am I missing something?

Doing the example in the docs:

from category_encoders import *
import pandas as pd
from sklearn.datasets import load_boston
bunch = load_boston()
y = bunch.target
X = pd.DataFrame(bunch.data, columns=bunch.feature_names)

enc = LeaveOneOutEncoder(cols=['CHAS', 'RAD'], randomized = True).fit(X, y)
numeric_dataset = enc.transform(X)

Just replaces the 'CHAS' and 'RAD' columns with the mean of the target grouped by the corresponding variable.

I'm using:
Python 2.7.13
category-encoders==1.2.4
pandas==0.20.3
numpy==1.13.1

Make the changes from #35 in BaseNEncoder

@andrethrill submitted a PR to move the determination of n_cols to fit() and not doing it on the fly in transform(), we should make the same change in BaseNEncoder.

Drop Original Feature when Encoding

Can we set an option to drop the original feature feat_0 for the numeric encoders?

See this code below:

import category_encoders
import pandas as pd
import numpy as np

dd = {'feat1':['a','b','c','d','e'],'feat2':np.random.random(size=(5))}
df = pd.DataFrame(dd)
encoder = category_encoders.BackwardDifferenceEncoder(cols=['feat1'])
encoder.fit_transform(df)

this returns feat_0 as the first feaure and every row element is the number 1. Is this a desired property of should it be removed?

Update documentation

Update the docs with the new docstrings (in separate issue) and with new introduction and citations (also in separate issues). CircleCI is failing on the push to gh-pages, so that should be fixed as well.

What's the formula for HelmertEncoder

Could someone tell me the detail processing logic for HelmertEncoder? My Code as below:
sales = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': 140},
{'account': 'Alpha Co', 'Jan': 200, 'Feb': 210, 'Mar': 215},
{'account': 'Blue Inc', 'Jan': 50, 'Feb': 90, 'Mar': 95 }]
X = pd.DataFrame(sales)
y = np.array([1, 0, 1])

enc = HelmertEncoder(cols=['account']).fit(X, y)
numeric_dataset = enc.transform(X)
print(numeric_dataset)

Actually, how does HelmertEncoder tansfer [0, 1, 2] to [[1 -1 -1],[1 1 -1],[1 0 2]]?

Add leave-one-out encoding

A popular kaggle technique for encoding categorical variables is 'leave-one-out', shown here:

http://nycdatascience.com/featured-talk-1-kaggle-data-scientist-owen-zhang/

https://www.kaggle.com/c/caterpillar-tube-pricing/forums/t/15748/strategies-to-encode-categorical-variables-with-many-categories?forumMessageId=143154

It would be cool to add that into the library to evaluate it against other methods.

reshape is deprecated

category_encoders/ordinal.py:167: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  X[switch.get('col')] = X[switch.get('col')].astype(int).reshape(-1, )

i am using the following versions and getting that warning

pandas (0.19.2)
scikit-learn (0.18)
numpy (1.10.4)

convert_objects is deprecated

category_encoders\utils.py:30: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
 X = X.convert_objects(convert_numeric=True)`

I'm using the following:

pandas: 0.18.0
scikit: 0.18.1

Large scale benchmark

It would be great to use something like this: https://github.com/EpistasisLab/penn-ml-benchmarks
to get a comprehensive view of memory usage, time-to-transform, and end-model accuracy for all encoders.

It'd probably take a very long time to compute, but could be re-run once per release or something like that.

Binary Encoding problem

Hi, there, first time using this binary encoding method, but I think there is a bug:

When encoding a column contained 3 possible categories, it encodes as 1 columns
When encoding a column contained 5 possible categories, it encodes as 2 columns
same thing you can find in 9 possible categories, only encoded as 3 columns
Also when encoding a column contained only 2 possible categories, it returns empty DataFrame

2 categories:

>>> import pandas as pd
>>> import category_encoders as ce
>>> df1 = pd.DataFrame([[1],[2],[2],[1]], columns=['col_a'])
>>> df1
   col_a
0      1
1      2
2      2
3      1
>>> encoder = ce.BinaryEncoder(cols=['col_a'])
>>> encoder.transform(df1)
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]

3 categories:

>>> df2 = pd.DataFrame([[1],[2],[2],[3]], columns=['col_a'])
>>> encoder.transform(df2)
   col_a_0
0        0
1        1
2        1
3        1

Add weight of evidence encoding

Moving required number of digits calculation to fit() - BinaryEncoder and BaseNEncoder

In binary.py, the required number of digits is calculated as such, inside the binary() function:

# figure out how many digits we need to represent the classes present
  digits = int(np.ceil(np.log2(len(X[col].unique()))))

binary() is called inside the transform() function.

Is there a specific reason why the required number of digits are calculated during transform() instead of fit()?

The reason I am asking is because, when performing distributed computation, different partitions of the same array are sent to different machines and the calculated required number of digits would then be different on each machine depending on how many different classes have each partition.

Contributors add yourselves to the JOSS submission.

This repository is currently being reviewed for possible inclusion of the journal of open source software.

If you've committed here before and would like to be included, please add yourself to the authors list here, via PR: https://github.com/scikit-learn-contrib/categorical-encoding/blob/master/joss/paper.md

BinaryEncoder does not satisfy distributive law

Look at the following snippet,

In [68]: data
Out[68]:
array(['apple', 'orange', 'peach', 'lemon'],
      dtype='<U6')

In [69]: encoder = ce.BinaryEncoder()

In [70]: encoder.fit(data)
Out[70]:
BinaryEncoder(cols=[0], drop_invariant=False, handle_unknown='impute',
       impute_missing=True, return_df=True, verbose=0)

In [71]: encoder.transform(data)
Out[71]:
   0_0  0_1
0    0    0
1    0    1
2    1    0
3    1    1

In [72]: encoder.transform(data[:1])
Out[72]:
Empty DataFrame
Columns: []
Index: [0]

I would argue encoder.transform(data[:1]) should be mapped to [0, 0], otherwise the distributive law would not hold, e.g.,

data == data[:1] + data[1:]
encoder.transform(data) =!= encoder.transform(data[:1]) + encoder.transform(data[1:])

Bug in basen.py

>>> import category_encoders as ce
>>> df = pd.DataFrame({'col1': ['a', 'b', 'c'], 'col2': ['d', 'e', 'f']})
  col1 col2
0    a    d
1    b    e
2    c    f
>>>df_1 = pd.DataFrame({'col1': ['a', 'b', 'd'], 'col2': ['d', 'e', 'f']})
  col1 col2
0    a    d
1    b    e
2    d    f
>>> enc = ce.BaseNEncoder(verbose=1)
>>> enc.fit(df)
>>> print enc.transform(df_1)
   col1_0  col1_1  col2_0  col2_1
0     0.0     0.0       0       0
1     0.0     1.0       0       1
2     0.0     0.0       1       0

Is there a bug in basen.py?

Error when using pipelines with BinaryEncoder

I adapted this code from the examples/encoding_examples.py file to use a pipeline. cross_val_score fail with the error at the end of this post

import pandas as pd
import numpy as np
from sklearn import cross_validation, linear_model, model_selection
import category_encoders
from examples.source_data.loaders import get_mushroom_data, get_cars_data, get_splice_data
from sklearn.pipeline import make_pipeline

X, y, mapping = get_mushroom_data()
t = category_encoders.BinaryEncoder(handle_unknown = "ignore")
mypipeline = make_pipeline(t, linear_model.LogisticRegression())

cross_validation.cross_val_score(mypipeline, X, y, n_jobs=1, cv=5)

Abridged List of packages installed

numpy 1.14.0 py36h4a99626_1
pandas 0.22.0 py36h6538335_0
python 3.6.4 h6538335_1
scikit-learn 0.19.1 py36h53aea1b_0

ERROR

ValueError Traceback (most recent call last)
in ()
10 mypipeline = make_pipeline(t, linear_model.LogisticRegression())
11
---> 12 cross_validation.cross_val_score(mypipeline, X, y, n_jobs=1, cv=5)
13
14

C:\Anaconda3\lib\site-packages\sklearn\cross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
1579 train, test, verbose, None,
1580 fit_params)
-> 1581 for train, test in cv)
1582 return np.array(scores)[:, 0]
1583

C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in call(self, iterable)
777 # was dispatched. In particular this covers the edge
778 # case of Parallel used with an exhausted iterator.
--> 779 while self.dispatch_one_batch(iterator):
780 self._iterating = True
781 else:

C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator)
623 return False
624 else:
--> 625 self._dispatch(tasks)
626 return True
627

C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in _dispatch(self, batch)
586 dispatch_timestamp = time.time()
587 cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588 job = self._backend.apply_async(batch, callback=cb)
589 self._jobs.append(job)
590

C:\Anaconda3\lib\site-packages\sklearn\externals\joblib_parallel_backends.py in apply_async(self, func, callback)
109 def apply_async(self, func, callback=None):
110 """Schedule a func to be run"""
--> 111 result = ImmediateResult(func)
112 if callback:
113 callback(result)

C:\Anaconda3\lib\site-packages\sklearn\externals\joblib_parallel_backends.py in init(self, batch)
330 # Don't delay the application, to avoid keeping the input
331 # arguments in memory
--> 332 self.results = batch()
333
334 def get(self):

C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in call(self)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def len(self):

C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in (.0)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def len(self):

C:\Anaconda3\lib\site-packages\sklearn\cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
1692
1693 else:
-> 1694 test_score = _score(estimator, X_test, y_test, scorer)
1695 if return_train_score:
1696 train_score = _score(estimator, X_train, y_train, scorer)

C:\Anaconda3\lib\site-packages\sklearn\cross_validation.py in _score(estimator, X_test, y_test, scorer)
1749 score = scorer(estimator, X_test)
1750 else:
-> 1751 score = scorer(estimator, X_test, y_test)
1752 if hasattr(score, 'item'):
1753 try:

C:\Anaconda3\lib\site-packages\sklearn\metrics\scorer.py in _passthrough_scorer(estimator, *args, **kwargs)
242 def _passthrough_scorer(estimator, *args, **kwargs):
243 """Function that wraps estimator.score"""
--> 244 return estimator.score(*args, **kwargs)
245
246

C:\Anaconda3\lib\site-packages\sklearn\utils\metaestimators.py in (*args, **kwargs)
113
114 # lambda, but not partial, allows help() to work with update_wrapper
--> 115 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
116 # update the docstring of the returned function
117 update_wrapper(out, self.fn)

C:\Anaconda3\lib\site-packages\sklearn\pipeline.py in score(self, X, y, sample_weight)
484 for name, transform in self.steps[:-1]:
485 if transform is not None:
--> 486 Xt = transform.transform(Xt)
487 score_params = {}
488 if sample_weight is not None:

C:\Anaconda3\lib\site-packages\category_encoders\binary.py in transform(self, X)
163 X = self.ordinal_encoder.transform(X)
164
--> 165 X = self.binary(X, cols=self.cols)
166
167 if self.drop_invariant:

C:\Anaconda3\lib\site-packages\category_encoders\binary.py in binary(self, X_in, cols)
248
249 # map the ordinal column into a list of these digits, of length digits
--> 250 X[col] = X[col].map(lambda x: self.col_transform(x, digits))
251
252 for dig in range(digits):

C:\Anaconda3\lib\site-packages\pandas\core\series.py in map(self, arg, na_action)
2352 else:
2353 # arg is a function
-> 2354 new_values = map_f(values, arg)
2355
2356 return self._constructor(new_values,

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()

C:\Anaconda3\lib\site-packages\category_encoders\binary.py in (x)
248
249 # map the ordinal column into a list of these digits, of length digits
--> 250 X[col] = X[col].map(lambda x: self.col_transform(x, digits))
251
252 for dig in range(digits):

C:\Anaconda3\lib\site-packages\category_encoders\binary.py in col_transform(col, digits)
309 else:
310
--> 311 col = list("{0:b}".format(int(col)))
312 if len(col) == digits:
313 return col

ValueError: cannot convert float NaN to integer

Add bases other than binary

The exact methodology used in binary encoding can be easily expanded out to other bases (3, 4, etc). It would be great to have a BaseXEncoder to allow for experiments with different bases.

[Question] Does categorical-encoder work with numpy?

I tried with Leave One Out encoder, I got error when try to input numpy array. Does it work with Pandas dataframe only?

How use LeaveOneOutEncoder in pipelines

I can't figure it out how to correctly use LeaveOneOutEncoder in pipelines.
Consider the following code:

from sklearn.base import BaseEstimator, TransformerMixin

class DataFrameSelector(BaseEstimator, TransformerMixin):
        def __init__(self, attribute_names,target):
            self.attribute_names = attribute_names
            self.y_=target
            
        def fit(self, X, y):
             return self
        def transform(self, X,y):
             return X[self.attribute_names]

cat_attribs = ["ocean_proximity"]

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('cat_en', LeaveOneOutEncoder(return_df=False))
    ])

cat_pipeline.fit(housing)
housing_prepared = cat_pipeline.transform(housing)
housing_prepared

I receive error:
TypeError: fit() missing 1 required positional argument: 'y'

But how to pass y?
Thanks.

Target Encoding for regression task?

Is it possible to use Target Encoding for regression task?

OneHotEncoder creates dummy column

In the following example:

>>> myDf = pd.DataFrame({'col': ['a', 'b', np.NAN]})
>>> enc = ce.OneHotEncoder(cols=['col'])
>>> enc = enc.fit(myDf)
>>> enc.transform(myDf)
  col_0  col_1  col_2  col_-1
0      1      0      0       0
1      0      1      0       0
2      0      0      0       1

I believe col_2 should not be created. I think this is a bug created due to OneHotEncoder calling internally OrdinalEncoder and both are imputing an extra category.

Happy to hear if I am wrong?

Try to use hashencoder but fail due to high memory usage

I tried to convert several categorical columns to hash-encoding features and the size of input data is 2.7 million, one column has 20K category. But the amount of memory will be up to 10G, it's a very strange situation. Why the encoder used so much memory?

Hashing encoder breaks on NaNs

Hashing encoder fails if a NaN is there, it tries to hash the None which doesn't fly.

Transformers always return pandas dataframes

All of the transformers in this library use pandas dataframes internally, but will accept either numpy arrays or pandas dataframes as inputs. They all return dataframes though. For use in pipelines it may be helpful to optionally return a numpy array (df.values) instead of a dataframe.

If anyone would like to get into contributing, this would be a good starter issue, comment here if you're interested.

scikit-learn-contrib / category_encoders Goto Github PK

category_encoders's Issues

Abridged List of packages installed

ERROR

Recommend Projects

Recommend Topics

Recommend Org