bcg-x-official / sklearndf Goto Github PK
View Code? Open in Web Editor NEWDataFrame support for scikit-learn.
Home Page: https://bcg-x-official.github.io/sklearndf/
License: Apache License 2.0
DataFrame support for scikit-learn.
Home Page: https://bcg-x-official.github.io/sklearndf/
License: Apache License 2.0
Is your feature request related to a problem? Please describe.
Hi,
I would like to use the "passthrough" option for the remainder argument of ColumnTransformerDF
as in the original version of ColumnTransformer
of scikit-learn. For now, only "drop" is accepted.
Describe the solution you'd like
Right now, the only possible value for the argument remainder is "drop". It means that if we apply aColumnTransformerDF
on only some columns of a DataFrame df
, the other columns unused will be dropped. The solution I am looking for is to have the "passthrough" option. It means that the columns unaffected by the transformation in ColumnTransformerDF
will be preserved and left unmodified in the final DataFrame df
.
Describe alternatives you've considered
I tried to implement a custom version of ColumnTransformerWrapperDF
with the "passthrough" option. It is working so far but I did not test it extensively and I do not know all your requirements and constraints.
Original Code
class ColumnTransformerWrapperDF(
TransformerWrapperDF[ColumnTransformer], metaclass=ABCMeta
):
"""
DF wrapper for :class:`sklearn.compose.ColumnTransformer`.
Requires all transformers passed as the ``transformers`` parameter to implement
:class:`.TransformerDF`.
"""
__DROP = "drop"
__PASSTHROUGH = "passthrough"
__SPECIAL_TRANSFORMERS = (__DROP, __PASSTHROUGH)
def _validate_delegate_estimator(self) -> None:
column_transformer: ColumnTransformer = self.native_estimator
if column_transformer.remainder != ColumnTransformerWrapperDF.__DROP:
raise ValueError(
f"unsupported value for arg remainder: ({column_transformer.remainder})"
)
non_compliant_transformers: List[str] = [
type(transformer).__name__
for _, transformer, _ in column_transformer.transformers
if not (
isinstance(transformer, TransformerDF)
or transformer in ColumnTransformerWrapperDF.__SPECIAL_TRANSFORMERS
)
]
if non_compliant_transformers:
from .. import ColumnTransformerDF
raise ValueError(
f"{ColumnTransformerDF.__name__} only accepts instances of "
f"{TransformerDF.__name__} or special values "
f'"{" and ".join(ColumnTransformerWrapperDF.__SPECIAL_TRANSFORMERS)}" '
"as valid transformers, but "
f'also got: {", ".join(non_compliant_transformers)}'
)
def _get_features_original(self) -> pd.Series:
"""
Return the series mapping output column names to original columns names.
:return: the series with index the column names of the output dataframe and
values the corresponding input column names.
"""
return reduce(
lambda x, y: x.append(y),
(
(
pd.Series(index=columns, data=columns)
if df_transformer == ColumnTransformerWrapperDF.__PASSTHROUGH
else df_transformer.feature_names_original_
)
for _, df_transformer, columns in self.native_estimator.transformers_
if (
len(columns) > 0
and df_transformer != ColumnTransformerWrapperDF.__DROP
)
),
)
My Custom Version
class ColumnTransformerCustomWrapperDF(
TransformerWrapperDF[ColumnTransformer], metaclass=ABCMeta
):
"""
DF wrapper for :class:`sklearn.compose.ColumnTransformer`.
Requires all transformers passed as the ``transformers`` parameter to implement
:class:`.TransformerDF`.
"""
__DROP = "drop"
__PASSTHROUGH = "passthrough"
__SPECIAL_TRANSFORMERS = (__DROP, __PASSTHROUGH)
def _validate_delegate_estimator(self) -> None:
column_transformer: ColumnTransformer = self.native_estimator
if (
column_transformer.remainder
not in ColumnTransformerCustomWrapperDF.__SPECIAL_TRANSFORMERS
):
raise ValueError(
f"unsupported value for arg remainder: ({column_transformer.remainder})"
)
non_compliant_transformers: List[str] = [
type(transformer).__name__
for _, transformer, _ in column_transformer.transformers
if not (
isinstance(transformer, TransformerDF)
or transformer
in ColumnTransformerCustomWrapperDF.__SPECIAL_TRANSFORMERS
)
]
if non_compliant_transformers:
authorised_transformers = " and ".join(
ColumnTransformerCustomWrapperDF.__SPECIAL_TRANSFORMERS
)
raise ValueError(
f"{ColumnTransformerDF.__name__} only accepts instances of "
f"{TransformerDF.__name__} or special values "
f'"{authorised_transformers}" as valid transformers, but '
f'also got: {", ".join(non_compliant_transformers)}'
)
def _get_features_original(self) -> pd.Series:
"""
Return the series mapping output column names to original columns names.
:return: the series with index the column names of the output dataframe and
values the corresponding input column names.
"""
result = reduce(
lambda x, y: x.append(y),
(
(
pd.Series(
index=self.feature_names_in_[columns],
data=self.feature_names_in_[columns],
)
if df_transformer == ColumnTransformerCustomWrapperDF.__PASSTHROUGH
else df_transformer.feature_names_original_
)
for _, df_transformer, columns in self.native_estimator.transformers_
if (
len(columns) > 0
and df_transformer != ColumnTransformerCustomWrapperDF.__DROP
)
),
)
return result
# Instantiation of ColumnCustomTransformerDF
ColumnCustomTransformerDF = make_df_transformer(
ColumnTransformer,
name="ColumnTransformerCustomDF",
base_wrapper=ColumnTransformerCustomWrapperDF,
)
Additional context
Thank you in advance for your help,
Describe the bug
Hi,
First of all, well done for the package. I have been using sklearndf for a couple of months and it is very handy!
I notice however an issue with the Transformers. In particular, I can't apply the inverse_transform
To Reproduce
For example, for StandardScalerDF, I can fit it, transform my DataFrame without issues. However, if I try to inverse the transformation, it fails.
import pandas as pd
import numpy as np
from sklearndf.transformation import StandardScalerDF
# Initialise a random DataFrame
df = pd.DataFrame(np.random.randint(1, 100, size=(10, 2)), columns=["A", "B"])
print(df)
# Instantiation and Fitting of a Standard Scaler
scaler = StandardScalerDF()
scaler.fit(df)
df = scaler.transform(df)
print(df)
# Inverse the Scaling
df = scaler.inverse_transform(df) ## ERROR --> NotFittedError: StandardScalerDF is not fitted
Did I miss anything?
Expected behavior
Expect the inverse transform to be performed.
First Idea where to look for
I notice the usage of reset_fit() in the method inverse_transform of TransformerWrapperDF. Is it really needed? It is this command which generates the bug as it re-initializes the attribute self._features_in to None.
By commenting it, it works.
Desktop :
Thank you in advance for your help,
Describe the bug
The wrapper for the OneHot encoder fails for columns reduction options (drop= "if_binary" or "first")
The wrapper automatically computes the expected columns length of the transformed dataset without taking into account the drop option
To Reproduce
Steps to reproduce the behavior:
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearndf.pipeline import PipelineDF
from sklearndf.transformation import (
ColumnTransformerDF,
OneHotEncoderDF,
SimpleImputerDF,
)
X_churn : pd.DataFrame = ...
y_churn : pd.Series = ...
<img width="1088" alt="Screenshot 2021-02-16 at 16 09 49" src="https://user-images.githubusercontent.com/32160831/108081572-6733de80-7071-11eb-8bca-f52932a4173e.png">
# For categorical features we will use the mode as the imputation value and also one-hot encode
preprocessing_categorical = PipelineDF(
steps=[
("imputer", SimpleImputerDF(strategy="most_frequent", fill_value="<na>")),
("one-hot", OneHotEncoderDF(sparse=False, drop="if_binary")),
]
)
# For numeric features we will impute using the median
preprocessing_numerical = SimpleImputerDF(strategy="median")
# Put the pipeline together
preprocessing_features = ColumnTransformerDF(
transformers=[
(
"categorical",
preprocessing_categorical,
make_column_selector(dtype_include=object),
),
(
"numerical",
preprocessing_numerical,
make_column_selector(dtype_include=np.number),
),
]
)
# Run the preprocessing
transformed_features = preprocessing_features.fit_transform(X=X_churn, y=y_churn)
transformed_features.head()
Expected behavior
Expected to see the transformed dataset with only one column for categorical columns that have only 2 unique values
Since test table was deleted, replace CSV file as referenced in test config and update tests
Current requirement is scikit-learn 0.23.x. Sklearn 0.24.* has been out since December 2020 and is becoming the default version is most environments. This incompatibility prevents us from using sklearndf.
Using StackingRegressorDF
on pipelines containing a ColumnTransformerDF
raises an error on .fit
.
Using a StackingRegressorDF
as the last part of a PipelineDF
works as expected. But creating multiple PipelineDF
objects with ColumnTransformerDF
and then stacking these fails with the following error:
TypeError: StackingRegressorDF.fit: ColumnTransformerDF.fit_transform: arg y must be None, or a pandas Series or DataFrame
Most likely the reason is this line in StackingRegressor.fit
:
y = column_or_1d(y, warn=True)
from sklearndf.pipeline import PipelineDF
from sklearndf.regression import LinearRegressionDF, ElasticNetDF
from sklearndf.transformation import ColumnTransformerDF, StandardScalerDF
from sklearndf.regression import StackingRegressorDF
import pandas as pd
import numpy as np
# toy data set
np.random.seed(1)
data = pd.DataFrame({
'x1': np.random.uniform(size=(10,)),
'x2': np.random.uniform(size=(10,)),
'y': np.random.uniform(size=(10,)),
})
# basic building blocks
model1 = LinearRegressionDF()
model2 = ElasticNetDF()
preprocessing = ColumnTransformerDF([
('x1', StandardScalerDF(), ['x1']),
('x2', 'passthrough', ['x1']),
])
# Pipeline with stack works
pipeline = PipelineDF([
('preprocessing', preprocessing),
('stack', StackingRegressorDF([
('model1', model1),
('model2', model2),
]))
])
pipeline.fit(data, data['y'])
print(pipeline.predict(data))
# Stack of Pipelines doesn't
stack_of_pipelines = StackingRegressorDF([
('pipeline1', PipelineDF([
('preprocessing', preprocessing),
('model1', model1)
])),
('pipeline2', PipelineDF([
('preprocessing', preprocessing),
('model2', model1)
]))
])
stack_of_pipelines.fit(data, data['y'])
Is your feature request related to a problem? Please describe.
Feature is not directly related to a problem, but is rather an enhancement of existing functionality. As suggested by Julian King on the facet Slack channel, we could add Maximum Relevance Minimum Redundancy (MRMR) as a feature selection algorithm.
The algorithm is explained in the following papers:
https://arxiv.org/pdf/1908.05376.pdf
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1423-9
Describe the solution you'd like
Implement MrmrDF
in a similar fashion to BorutaDF
such that it can be passed into the sklearndf pipeline.
Describe alternatives you've considered
The paper also suggests using a redundancy matrix in to shine some light on the feature selection as shown below. While this is for discussion, I would not use this output to avoid confusion with the shap value redundancy calculated as part of the feature selection.
Is your feature request related to a problem? Please describe.
The RandomState used from numpy is now legacy (https://numpy.org/doc/1.19/reference/random/legacy.html) and has been indirectly replaced by Generator (https://numpy.org/doc/stable/reference/random/generator.html).
This is also something that may impact scikit-learn in the future and a PR has been opened regarding (scikit-learn/scikit-learn#16988).
Describe the solution you'd like
Given the dependency upon scikit-learn and the fact that RandomState is now legacy we should be proactive in ensuring we make the neccesary updates should RandomState be retired and/or scikit-learn replaces RandomState.
Is your feature request related to a problem? Please describe.
XGboost is currently not natively supported by sklearndf
Describe the solution you'd like
Currently, users can fix this individually by using the make_df_regressor
wrapper
from xgboost import XGBRegressor
from sklearndf.wrapper import make_df_regressor
XGBRegressorDF = make_df_regressor(XGBRegressor)
It would be desirable to move this directly into sklearnf.regression
and sklearn.classification
for the XGboost regressor/classifier respectively.
To avoid additional dependencies, we should make an assertion that xgboost
must be installed if the XGBRegressorDF
/XGBClassifierDF
is used.
Describe alternatives you've considered
n/a
Additional context
n/a
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.