bcg-x-official / sklearndf Goto Github PK

DataFrame support for scikit-learn.

Home Page: https://bcg-x-official.github.io/sklearndf/

License: Apache License 2.0

Python 80.35% Jupyter Notebook 19.63% Shell 0.02%

python data-science machine-learning model-selection hyper-parameter-tuning cross-validation pandas-dataframe feature-traceability

sklearndf's People

Contributors

Stargazers

Watchers

Forkers

actuarial-tools sandy4321 mtsokol alexander-manley techthiyanes eyazid booss3my

sklearndf's Issues

[ColumnTransformerDF] Allow "passthrough" option for remainder

Is your feature request related to a problem? Please describe.
Hi,
I would like to use the "passthrough" option for the remainder argument of ColumnTransformerDF as in the original version of ColumnTransformer of scikit-learn. For now, only "drop" is accepted.

Describe the solution you'd like
Right now, the only possible value for the argument remainder is "drop". It means that if we apply aColumnTransformerDF on only some columns of a DataFrame df, the other columns unused will be dropped. The solution I am looking for is to have the "passthrough" option. It means that the columns unaffected by the transformation in ColumnTransformerDF will be preserved and left unmodified in the final DataFrame df.

Describe alternatives you've considered
I tried to implement a custom version of ColumnTransformerWrapperDF with the "passthrough" option. It is working so far but I did not test it extensively and I do not know all your requirements and constraints.

Original Code

class ColumnTransformerWrapperDF(
    TransformerWrapperDF[ColumnTransformer], metaclass=ABCMeta
):
    """
    DF wrapper for :class:`sklearn.compose.ColumnTransformer`.
    Requires all transformers passed as the ``transformers`` parameter to implement
    :class:`.TransformerDF`.
    """

    __DROP = "drop"
    __PASSTHROUGH = "passthrough"

    __SPECIAL_TRANSFORMERS = (__DROP, __PASSTHROUGH)

    def _validate_delegate_estimator(self) -> None:
        column_transformer: ColumnTransformer = self.native_estimator

        if column_transformer.remainder != ColumnTransformerWrapperDF.__DROP:
            raise ValueError(
                f"unsupported value for arg remainder: ({column_transformer.remainder})"
            )

        non_compliant_transformers: List[str] = [
            type(transformer).__name__
            for _, transformer, _ in column_transformer.transformers
            if not (
                isinstance(transformer, TransformerDF)
                or transformer in ColumnTransformerWrapperDF.__SPECIAL_TRANSFORMERS
            )
        ]
        if non_compliant_transformers:
            from .. import ColumnTransformerDF

            raise ValueError(
                f"{ColumnTransformerDF.__name__} only accepts instances of "
                f"{TransformerDF.__name__} or special values "
                f'"{" and ".join(ColumnTransformerWrapperDF.__SPECIAL_TRANSFORMERS)}" '
                "as valid transformers, but "
                f'also got: {", ".join(non_compliant_transformers)}'
            )

    def _get_features_original(self) -> pd.Series:
        """
        Return the series mapping output column names to original columns names.
        :return: the series with index the column names of the output dataframe and
        values the corresponding input column names.
        """

        return reduce(
            lambda x, y: x.append(y),
            (
                (
                    pd.Series(index=columns, data=columns)
                    if df_transformer == ColumnTransformerWrapperDF.__PASSTHROUGH
                    else df_transformer.feature_names_original_
                )
                for _, df_transformer, columns in self.native_estimator.transformers_
                if (
                    len(columns) > 0
                    and df_transformer != ColumnTransformerWrapperDF.__DROP
                )
            ),
        )

My Custom Version

class ColumnTransformerCustomWrapperDF(
    TransformerWrapperDF[ColumnTransformer], metaclass=ABCMeta
):
    """
    DF wrapper for :class:`sklearn.compose.ColumnTransformer`.
    Requires all transformers passed as the ``transformers`` parameter to implement
    :class:`.TransformerDF`.
    """

    __DROP = "drop"
    __PASSTHROUGH = "passthrough"

    __SPECIAL_TRANSFORMERS = (__DROP, __PASSTHROUGH)

    def _validate_delegate_estimator(self) -> None:
        column_transformer: ColumnTransformer = self.native_estimator

        if (
            column_transformer.remainder
            not in ColumnTransformerCustomWrapperDF.__SPECIAL_TRANSFORMERS
        ):
            raise ValueError(
                f"unsupported value for arg remainder: ({column_transformer.remainder})"
            )

        non_compliant_transformers: List[str] = [
            type(transformer).__name__
            for _, transformer, _ in column_transformer.transformers
            if not (
                isinstance(transformer, TransformerDF)
                or transformer
                in ColumnTransformerCustomWrapperDF.__SPECIAL_TRANSFORMERS
            )
        ]
        if non_compliant_transformers:
            authorised_transformers = " and ".join(
                ColumnTransformerCustomWrapperDF.__SPECIAL_TRANSFORMERS
            )
            raise ValueError(
                f"{ColumnTransformerDF.__name__} only accepts instances of "
                f"{TransformerDF.__name__} or special values "
                f'"{authorised_transformers}" as valid transformers, but '
                f'also got: {", ".join(non_compliant_transformers)}'
            )

    def _get_features_original(self) -> pd.Series:
        """
        Return the series mapping output column names to original columns names.
        :return: the series with index the column names of the output dataframe and
        values the corresponding input column names.
        """
        result = reduce(
            lambda x, y: x.append(y),
            (
                (
                    pd.Series(
                        index=self.feature_names_in_[columns],
                        data=self.feature_names_in_[columns],
                    )
                    if df_transformer == ColumnTransformerCustomWrapperDF.__PASSTHROUGH
                    else df_transformer.feature_names_original_
                )
                for _, df_transformer, columns in self.native_estimator.transformers_
                if (
                    len(columns) > 0
                    and df_transformer != ColumnTransformerCustomWrapperDF.__DROP
                )
            ),
        )
        return result

# Instantiation of ColumnCustomTransformerDF
ColumnCustomTransformerDF = make_df_transformer(
    ColumnTransformer,
    name="ColumnTransformerCustomDF",
    base_wrapper=ColumnTransformerCustomWrapperDF,
)

Additional context

OS: ubuntu 20.04 LTS
Python: 3.8.10
sklearndf==1.2.1

Thank you in advance for your help,

[BUG] - Inverse Transform does not work for StandardScaler

Describe the bug
Hi,
First of all, well done for the package. I have been using sklearndf for a couple of months and it is very handy!
I notice however an issue with the Transformers. In particular, I can't apply the inverse_transform

To Reproduce
For example, for StandardScalerDF, I can fit it, transform my DataFrame without issues. However, if I try to inverse the transformation, it fails.

import pandas as pd
import numpy as np
from sklearndf.transformation import StandardScalerDF

# Initialise a random DataFrame
df = pd.DataFrame(np.random.randint(1, 100, size=(10, 2)), columns=["A", "B"])
print(df)

# Instantiation and Fitting of a Standard Scaler
scaler = StandardScalerDF()
scaler.fit(df)
df = scaler.transform(df)
print(df)

# Inverse the Scaling
df = scaler.inverse_transform(df) ## ERROR --> NotFittedError: StandardScalerDF is not fitted

Did I miss anything?

Expected behavior
Expect the inverse transform to be performed.

Screenshots

First Idea where to look for
I notice the usage of reset_fit() in the method inverse_transform of TransformerWrapperDF. Is it really needed? It is this command which generates the bug as it re-initializes the attribute self._features_in to None.
By commenting it, it works.

Desktop :

OS: ubuntu 20.04 LTS
Python: 3.8.10
sklearndf==1.2.1

Thank you in advance for your help,

OneHotEncoderDF: OneHot encoder wrapper fails for columns reduction options

Describe the bug
The wrapper for the OneHot encoder fails for columns reduction options (drop= "if_binary" or "first")

The wrapper automatically computes the expected columns length of the transformed dataset without taking into account the drop option

To Reproduce
Steps to reproduce the behavior:

open a notebook
Run the following code

from sklearn.compose import ColumnTransformer, make_column_selector
from sklearndf.pipeline import PipelineDF
from sklearndf.transformation import (
    ColumnTransformerDF,
    OneHotEncoderDF,
    SimpleImputerDF,
)
X_churn : pd.DataFrame = ...
y_churn : pd.Series = ...
<img width="1088" alt="Screenshot 2021-02-16 at 16 09 49" src="https://user-images.githubusercontent.com/32160831/108081572-6733de80-7071-11eb-8bca-f52932a4173e.png">

# For categorical features we will use the mode as the imputation value and also one-hot encode
preprocessing_categorical = PipelineDF(
    steps=[
        ("imputer", SimpleImputerDF(strategy="most_frequent", fill_value="<na>")),
        ("one-hot", OneHotEncoderDF(sparse=False, drop="if_binary")),
    ]
)

# For numeric features we will impute using the median
preprocessing_numerical = SimpleImputerDF(strategy="median")

# Put the pipeline together
preprocessing_features = ColumnTransformerDF(
    transformers=[
        (
            "categorical",
            preprocessing_categorical,
            make_column_selector(dtype_include=object),
        ),
        (
            "numerical",
            preprocessing_numerical,
            make_column_selector(dtype_include=np.number),
        ),
    ]
)

# Run the preprocessing
transformed_features = preprocessing_features.fit_transform(X=X_churn, y=y_churn)
transformed_features.head()

See error

Expected behavior
Expected to see the transformed dataset with only one column for categorical columns that have only 2 unique values

Version: sklearndf==1.0.1

Set up new test table for unit tests

Since test table was deleted, replace CSV file as referenced in test config and update tests

Support for scikit-learn>=0.24

Current requirement is scikit-learn 0.23.x. Sklearn 0.24.* has been out since December 2020 and is becoming the default version is most environments. This incompatibility prevents us from using sklearndf.

Cannot use ColumnTransformerDF inside of StackingRegressorDF

Summary:

Using StackingRegressorDF on pipelines containing a ColumnTransformerDF raises an error on .fit.

Using a StackingRegressorDF as the last part of a PipelineDF works as expected. But creating multiple PipelineDF objects with ColumnTransformerDF and then stacking these fails with the following error:

TypeError: StackingRegressorDF.fit: ColumnTransformerDF.fit_transform: arg y must be None, or a pandas Series or DataFrame

Root cause

Most likely the reason is this line in StackingRegressor.fit:

y = column_or_1d(y, warn=True)

Reproduceable example:

from sklearndf.pipeline import PipelineDF
from sklearndf.regression import LinearRegressionDF, ElasticNetDF
from sklearndf.transformation import ColumnTransformerDF, StandardScalerDF
from sklearndf.regression import StackingRegressorDF

import pandas as pd
import numpy as np

# toy data set
np.random.seed(1)
data = pd.DataFrame({
    'x1': np.random.uniform(size=(10,)),
    'x2': np.random.uniform(size=(10,)),
    'y': np.random.uniform(size=(10,)),
})

# basic building blocks
model1 = LinearRegressionDF()
model2 = ElasticNetDF()
preprocessing = ColumnTransformerDF([
    ('x1', StandardScalerDF(), ['x1']),
    ('x2', 'passthrough', ['x1']),
])

# Pipeline with stack works
pipeline = PipelineDF([
    ('preprocessing', preprocessing),
    ('stack', StackingRegressorDF([
        ('model1', model1),
        ('model2', model2),
    ]))
])
pipeline.fit(data, data['y'])
print(pipeline.predict(data))

# Stack of Pipelines doesn't
stack_of_pipelines = StackingRegressorDF([
    ('pipeline1', PipelineDF([
        ('preprocessing', preprocessing),
        ('model1', model1)
    ])),
    ('pipeline2', PipelineDF([
        ('preprocessing', preprocessing),
        ('model2', model1)
    ]))
])
stack_of_pipelines.fit(data, data['y'])

Add Maximum Relevance Minimum Redundancy as feature selection algorithm

Is your feature request related to a problem? Please describe.
Feature is not directly related to a problem, but is rather an enhancement of existing functionality. As suggested by Julian King on the facet Slack channel, we could add Maximum Relevance Minimum Redundancy (MRMR) as a feature selection algorithm.

The algorithm is explained in the following papers:
https://arxiv.org/pdf/1908.05376.pdf
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1423-9

Describe the solution you'd like
Implement MrmrDF in a similar fashion to BorutaDF such that it can be passed into the sklearndf pipeline.

Describe alternatives you've considered
The paper also suggests using a redundancy matrix in to shine some light on the feature selection as shown below. While this is for discussion, I would not use this output to avoid confusion with the shap value redundancy calculated as part of the feature selection.

numpy RandomState is now legacy

Is your feature request related to a problem? Please describe.
The RandomState used from numpy is now legacy (https://numpy.org/doc/1.19/reference/random/legacy.html) and has been indirectly replaced by Generator (https://numpy.org/doc/stable/reference/random/generator.html).
This is also something that may impact scikit-learn in the future and a PR has been opened regarding (scikit-learn/scikit-learn#16988).

Describe the solution you'd like
Given the dependency upon scikit-learn and the fact that RandomState is now legacy we should be proactive in ensuring we make the neccesary updates should RandomState be retired and/or scikit-learn replaces RandomState.

Native XGBoost support

Is your feature request related to a problem? Please describe.
XGboost is currently not natively supported by sklearndf

Describe the solution you'd like
Currently, users can fix this individually by using the make_df_regressor wrapper

from xgboost import XGBRegressor
from sklearndf.wrapper import make_df_regressor
XGBRegressorDF = make_df_regressor(XGBRegressor)

It would be desirable to move this directly into sklearnf.regression and sklearn.classification for the XGboost regressor/classifier respectively.

To avoid additional dependencies, we should make an assertion that xgboost must be installed if the XGBRegressorDF/XGBClassifierDF is used.

Describe alternatives you've considered
n/a

Additional context
n/a