nubank / fklearn Goto Github PK

View Code? Open in Web Editor NEW

1.5K 103.0 163.0 2.59 MB

fklearn: Functional Machine Learning

License: Apache License 2.0

Ruby 0.01% Python 41.17% Jupyter Notebook 58.82%

data-science machine-learning python data-analysis ml

fklearn's Introduction

fklearn: Functional Machine Learning

fklearn uses functional programming principles to make it easier to solve real problems with Machine Learning.

The name is a reference to the widely known scikit-learn library.

fklearn Principles

Validation should reflect real-life situations.
Production models should match validated models.
Models should be production-ready with few extra steps.
Reproducibility and in-depth analysis of model results should be easy to achieve.

Documentation | Getting Started | API Docs | Contributing |

Installation

To install via pip:

pip install fklearn

You can also install from the source:

git clone [email protected]:nubank/fklearn.git
cd fklearn
git checkout master
pip install -e .

License

Apache License 2.0

fklearn's People

Contributors

Stargazers

Watchers

Forkers

yahgoam caique-lima victor-ab stjordanis vitasiku spencerai matheusfacure rcbull shyamalschandra matheus-santos andrebnf marcelomata adolfoeliazat wilkerfoureaux cristianosantana daisp92 vevilarinho raphaelraia juliofreitas maiquelleonel rafaelbeckel tiagoapolo juliocmelo diemesleno gassantos marcogandra carlosdoki rafaelescrich williamjamir ericohenrique danielcavalli batermj hhy5277 jeanderson-souza felipevlima wendelas shaunstanislauslau ruan4f mbrukman leanderdulac pranavcode felipefu wangzy argonalyst rafaelstyvie andrecosta90 ffsouza armanhi eduardoalthaus matheusramosribeiro magnomatos822 devodla viniciustavanoferreira brunofbessa chuckwoody guru-raj dailyactie bigyuyo eduardoburgel victorrgds italoadler thiagoblopes rmtorre eltonoliver felipealbuquerq azbr walace-datascientist zorrock pedroalpacheco ericcoutinho wesleykaoru josebrunods datadonk23 jessicadesousa vultor33 zerg-ga engcompaulo rodrigohenriquerc viniferraria luiz-ortega cesardraw2 gbakie dieggoluis gabrielvcbessa oliveirabruno92 ianbernardino marcelogdeandrade felipefonsecadev gabrielferreira95 ramonreq anuragsinghchaudhary gppeixoto hf-lopes jorgelos84 tamesps jairhenrique pedrolacerda linuxsoares mlhoffmann fernandofavoretti

fklearn's Issues

Output encoded values into a new column

Describe the feature and the current state.

Currently, most of fklearn transformations (such as transformation.target_categorizer, transformation.truncate_categorical, transformation.rank_categorical) replace the original values by the encoded values.

Will this change a current behavior? How?

However, for a series of reasons (such as preserving the original value, or applying multiple encodings to the same feature), one may want to save the encoded values as a different feature, and keep the original column.

It would be useful to add an option to save the encoded feature to a new column.

TypeError on regression.ipynb

Instructions

Just run regression.ipynb two first cells.
Path: .\fklearn\docs\source\examples\regression.ipynb

Code sample

On second cell of regression.ipynb notebook:

import numpy.random as random

random.seed(150)

dates = pd.DataFrame({'score_date': pd.date_range('2016-01-01', '2016-12-31')})
dates['key'] = 1

ids = pd.DataFrame({'id': np.arange(0, 100)})
ids['key'] = 1

data = pd.merge(ids, dates).drop('key', axis=1)

data['x1'] = 23 * random.randn(data.shape[0]) + 500
data['x2'] = 59 * random.randn(data.shape[0]) + 235
data['x3'] = 73 * random.randn(data.shape[0]) + 793  # Noise feature.

data['y'] = 0.37*data['x1'] + 0.97*data['x2'] + 0.32*data['x2']**2 - 5.0*data['id']*0.2 + \
            np.cos(pd.to_datetime(data['score_date']).astype(int)*200)*20.0

nan_idx = np.random.randint(0, data.shape[0], size=100)  # Inject nan in x1.
data.loc[nan_idx, 'x1'] = np.nan

nan_idx = np.random.randint(0, data.shape[0], size=100)  # Inject nan in x2.
data.loc[nan_idx, 'x2'] = np.nan

Problem description

I got the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-61-af16353aef4c> in <module>
     16 
     17 data['y'] = 0.37*data['x1'] + 0.97*data['x2'] + 0.32*data['x2']**2 - 5.0*data['id']*0.2 + \
---> 18             np.cos(pd.to_datetime(data['score_date']).astype(int)*200)*20.0
     19 
     20 nan_idx = np.random.randint(0, data.shape[0], size=100)  # Inject nan in x1.

TypeError: cannot astype a datetimelike from [datetime64[ns]] to [int32]

Expected behavior

Run without errors.

Possible solutions

I added ".values" on datetime and it worked nicely

BEFORE:
data['y'] = 0.37data['x1'] + 0.97data['x2'] + 0.32data['x2']**2 - 5.0data['id']*0.2 +
np.cos(pd.to_datetime(data['score_date']).astype(int)*200)*20.0

ADDING .values
data['y'] = 0.37data['x1'] + 0.97data['x2'] + 0.32data['x2']**2 - 5.0data['id']*0.2 +
np.cos(pd.to_datetime(data['score_date']).values.astype(int)*200)*20.0

"causal_inference.IPTW_learner" and "fklearn.documentation" missing

Instructions

Just run casual.ipynb first cell.

Code sample

      4 from fklearn.training.regression import xgb_regression_learner
      5 from fklearn.training.classification import xgb_classification_learner, logistic_classification_learner
----> 6 from fklearn.training.causal_inference import IPTW_learner
      7 from fklearn.validation.evaluators import r2_evaluator, auc_evaluator
      8 from fklearn.data.datasets import make_confounded_data

ModuleNotFoundError: No module named 'fklearn.training.causal_inference'

Just as above, but with the Demos.ipynb:

----> 1 from fklearn.documentation.pd_extractors import evaluator_extractor as pd_evaluator_extractor, extract as pd_extract, \
      2                                         reverse_learning_curve_evaluator_extractor
      3 #from fklearn.documentation.extractors import evaluator_extractor, extract ## different from pd_extractor

ModuleNotFoundError: No module named 'fklearn.documentation'

Problem description

It seems that you guys deleted these module before public release

Version `1.14.1` have an installation error

Problem description

While installing this version, we have the following error
zlib.error: Error -3 while decompressing data: invalid code lengths set

Adapt code to avoid future and deprecation warnings from dependencies.

Instructions

Check all FutureWarning and DeprecationWarning raised by the test suite.
Adapt the code to make sure the library won't be broken by imminent changes in dependencies API.

Describe the issue

The current test suite is raising a few FutureWarning and DeprecationWarning raised by sklearn. In order to avoid getting surprised by changes in sklearn and numpy APIs, it would be interesting to check those warnings.

Categorical Features in lgbm_classification_learner not working

Describe the feature and the current state.

I was using lgbm_classification_learner with categorical features and currently it's not possible to deal with them. It's possible to pass "categorical_feature" via "extra_params" but an error is raised because lgbm_classification_learner uses df.values (numpy array) instead of the original pandas dataframe.

To avoid this behavior, I remove the .values and it worked

Why is it using "df.values"? Because of the performance?

If you agree with the solution of removing ".values" I could do a PR.

Potential dependency conflicts between fklearn and pandas

Hi, as shown in the following full dependency graph of fklearn, fklearn requires pandas >=0.24.1,<0.25, fklearn requires statsmodels >=0.9.0,<1 (statsmodels 0.11.1 will be installed, i.e., the newest version satisfying the version constraint), and directed dependency statsmodels 0.11.1 transitively introduces pandas >=0.21.

Obviously, there are multiple version constraints set for pandas in this project. However, according to pip's “first found wins” installation strategy, pandas 0.24.2 (i.e., the newest version satisfying constraint >=0.24.1,<0.25) is the actually installed version.

Although the first found package version pandas 0.24.2 just satisfies the later dependency constraint （pandas >=0.21), such installed version is very close to the upper bound of the version constraint of Pandas specified by statsmodels 0.11.1.

Once statsmodels upgrades，its newest version will be installed. Therefore, it will easily cause a dependency conflict (build failure), if the upgraded statsmodels version introduces a higher version of Pandas, violating its another version constraint >=0.24.1,<0.25.

According to the release history of statsmodels, it habitually upgrates Pandas in its recent releases. For instance, statsmodels v0.10.0rc1 upgrated Pandas’s constraint from >=0.15 to >=0.18, statsmodels v0.10.0rc2 upgrated Pandas’s constraint from >=0.18 to >=0.19, and statsmodels v0.11.0rc1 upgrated Pandas’s constraint from >=0.19 to >=0.21.

As such, it is a warm warning of a potential dependency conflict issue for fklearn.

Dependency tree

fklearn  - 1.18.0
| +- cloudpickle(install version:0.8.1 version range:>=0.8.0,<0.9.0)
| +- joblib(install version:0.13.2 version range:>=0.13.2,<0.14.0)
| +- numpy(install version:1.16.6 version range:>=1.16.4,<1.17.0)
| +- pandas(install version:0.24.2 version range:>=0.24.1,<0.25)
| +- scikit-learn(install version:0.21.3 version range:>=0.21.2,<0.22.0)
| +- statsmodels(install version:0.11.1 version range:>=0.9.0,<1)
| | +- numpy(install version:1.16.6 version range:>=1.14)
| | +- pandas(install version:0.24.2 version range:>=0.21)
| | +- patsy(install version:0.5.1 version range:>=0.5)
| | | +- numpy(install version:1.16.6 version range:>=1.4)
| | | +- six(install version:1.14.0 version range:*)
| | +- scipy(install version:1.2.3 version range:>=1.0)
| +- toolz(install version:0.10.0 version range:>=0.9.0,<1)

Thanks for your help.
Best,
Neolith

Missing information and duplicated lines on splitting.py

Instructions

There is some missing information and duplicated lines on splitting.py documentation.
Path: .\fklearn\src\fklearn\preprocessing\splitting.py

Describe the documentation issue

It is in space_time_split_dataset function:

    Returns
    ----------
    train_set : pandas.DataFrame
        The in ID sample and in time training set.

    intime_outspace_hdout : pandas.DataFrame
        The out of ID sample and in time hold out set.  #duplicated line

    outime_inspace_hdout : pandas.DataFrame
         The out of ID sample and in time hold out set. #duplicated line

    holdout_space : pandas.DataFrame    
         The out of ID sample and in time hold out set. #duplicated line



#Should it return holdout_space?

Possible solutions

The following text is my guess of what this function should return:

   Returns
    ----------
    train_set : pandas.DataFrame
        Samples with timestamp >= train_start_date and timestamp < train_end_date
        All IDs are included except from those selected for validation (holdout_space)

    intime_outspace_hdout : pandas.DataFrame
        Samples with same timestamps of train_set
        IDs are selected in holdout_space array
        All rows with selected ID and in specified timestamps are included

    outime_inspace_hdout : pandas.DataFrame
        Samples with timestamp >= train_end_date and timestamp < holdout_end_date
        All IDs are included

    outime_outspace_hdout : pandas.DataFrame
        Samples with same timestamps of outime_inspace_hdout.
        IDs are selected in holdout_space array 
        All rows with selected ID and in specified timestamps are included

Double/Debiased Resusing Models when Curried

Code sample

A common pattern in fklearn is to partially define a training function to create a fit_fn that will later be reused. For example, let's say we want to do feature selection. We can create a fit_fn passing all the parameters to a learner but the feature_columns

fit_fn = non_parametric_double_ml_learner(
    df=train_set_eng.head(10000),
    treatment_column = treatment,
    outcome_column = outcome,
    debias_model = LGBMRegressor(),
    denoise_model = LGBMRegressor(),
    final_model = LGBMRegressor(),
    prediction_column= "dml_elast_prediction",
)

Then, we can try estimating the above model with different feature sets. For example, we can fit a model with only 3 (m1) features and another one with 6 features (m2).

f1 = features[:3]
f2 = features[:6]

m1 = fit_fn(feature_columns=f1)[0]
m2 = fit_fn(feature_columns=f2)[0]

However, once we try to make predictions with the first model, we get the following error

m1(train_set_eng[f1].head())["dml_elast_prediction"]

ValueError: Number of features of the model must match the input. Model n_features_ is 6 and input n_features is 3

So, m1 is requiring 6 features? But we only trained it with 3? What is happening?

Problem description

The problem is that fklearn is reusing the same instance of the model parameters in the learner function

debias_model = LGBMRegressor(),
denoise_model = LGBMRegressor(),
final_model = LGBMRegressor(),

That way, since we train m2 after m1 and we reuse instance across both models, m1 ends up having the model instance from m2, which requires 6 features.

Possible solutions

One easy solution is to use from sklearn.base import clone in the learner here

debias_model = GradientBoostingRegressor() if not debias_model else clone(debias_model)
denoise_model = GradientBoostingRegressor() if not denoise_model else clone(denoise_model)
final_model = GradientBoostingRegressor() if not final_model else clone(final_model)

That ensures we are always creating a new instance when calling this learner function.

This is probably happening here too

fklearn/src/fklearn/causal/cate_learning/double_machine_learning.py

Line 29 in ca1d8bd

cv_pred.iloc[test] = m.predict(train_data[features].iloc[test])

Potential dependency conflicts between fklearn and numpy

Hi, as shown in the following full dependency graph of fklearn, fklearn requires numpy >=1.16.4,<1.17.0, fklearn requires statsmodels >=0.9.0,<1 (statsmodels 0.11.1 will be installed, i.e., the newest version satisfying the version constraint), and directed dependency statsmodels 0.11.1 transitively introduces numpy >=1.14.

Obviously, there are multiple version constraints set for numpy in this project. However, according to pip's “first found wins” installation strategy, numpy 1.16.6 (i.e., the newest version satisfying constraint >=1.16.4,<1.17.0) is the actually installed version.

Although the first found package version numpy 1.16.6 just satisfies the later dependency constraint （numpy >=1.14), such installed version is very close to the upper bound of the version constraint of Numpy specified by statsmodels 0.11.1.

Once statsmodels upgrades，its newest version will be installed. Therefore, it will easily cause a dependency conflict (build failure), if the upgraded statsmodels version introduces a higher version of Numpy, violating its another version constraint >=1.16.4,<1.17.0.

According to the release history of statsmodels, it habitually upgrates Numpy in its recent releases. For instance, statsmodels v0.10.0rc1 upgrated Numpy’s constraint from >=1.09 to >=1.11 and statsmodels v0.11.0rc1 upgrated Numpy’s constraint from >=1.11 to >=1.14.

As such, it is a warm warning of a potential dependency conflict issue for fklearn.

Dependency tree

fklearn  - 1.18.0
| +- cloudpickle(install version:0.8.1 version range:>=0.8.0,<0.9.0)
| +- joblib(install version:0.13.2 version range:>=0.13.2,<0.14.0)
| +- numpy(install version:1.16.6 version range:>=1.16.4,<1.17.0)
| +- pandas(install version:0.24.2 version range:>=0.24.1,<0.25)
| +- scikit-learn(install version:0.21.3 version range:>=0.21.2,<0.22.0)
| +- statsmodels(install version:0.11.1 version range:>=0.9.0,<1)
| | +- numpy(install version:1.16.6 version range:>=1.14)
| | +- pandas(install version:0.24.2 version range:>=0.21)
| | +- patsy(install version:0.5.1 version range:>=0.5)
| | | +- numpy(install version:1.16.6 version range:>=1.4)
| | | +- six(install version:1.14.0 version range:*)
| | +- scipy(install version:1.2.3 version range:>=1.0)
| +- toolz(install version:0.10.0 version range:>=0.9.0,<1)

Thanks for your help.
Best,
Neolith

Allow fit params for CatBoost

Describe the feature and the current state.

CatBoost has several parameters inside the fit method. One possible solution is to add a fit_params into the CatBoost function in FkLearn for increased flexibility.
This improvement would have to be done for both classification and regression.

This is important especially for CatBoost because the parameter that allows it to treat categorical features byitself is inside the fit method, and this is the main differential in this library.

Will this change a current behavior? How?

We would be able to configure the fit function using FkLearn.

Additional Information

https://catboost.ai/docs/concepts/python-reference_catboost_fit.html

An example with categorical features

Have a tutorial for dealing with categorical features in a machine learning problem, including the usage of tools inside fklearn.training.transformation.

Support for ONNX

Describe the feature and the current state.

Support for onnx format: https://github.com/onnx in similar form like for sklearn.

Will this change a current behavior? How?

No, but it will increase interoperability for some tools that support onnx but not fklearn e.g. mlflow

Parallel Split Evaluator

Describe the feature and the current state.

Currently, we have the split evaluator, but as you increase the evaluations, it can take long time to run. Building a parallel evaluator would improve its performance.

Will this change a current behavior? How?

No. This feature would only make it faster when running on multicore machines

Additional Information

Needs to be careful when running it together with other parallel methods as it may create a deadlock

Typo on learner_doc_string of onehot_categorizer function

Instructions

At the end of onehot_categorizer learner we have:

onehot_categorizer

...
    if store_mapping:
        log['onehot_categorizer']['mapping'] = vec

    return p, p(df), log

quantile_biner.__doc__ += learner_return_docstring("Onehot Categorizer") 
#quantile biner?

I guess it should be:

...
    if store_mapping:
        log['onehot_categorizer']['mapping'] = vec

    return p, p(df), log

onehot_categorizer.__doc__ += learner_return_docstring("Onehot Categorizer")

Imputer wrong return text

Describe the documentation issue

fklearn.training.imputation.imputer
fklearn.training.imputation.placeholder_imputer

Both Return values talk about new columns with predictions from the model, which are not correct

Migration plan for python 3.8, 3.9 or some "newer tooling"

Hey everyone!

I recently started trying to contribute to this project and I saw PR #182, so I just want to know which would be the expected support level for that version or future versions of the language and whether would be great to start helping with something like that.

Also, which kind of features/tooling would be nice to have in here? And which ones seems valuable for the project itself, I mean, things like the following:

pyproject.toml file for centralized configurations (this includes mypy, flake8, isort, or maybe black(?) in the same config file).
Poetry as the default deps and venvs manager(?)
Github actions instead of CircleCI sounds viable? or in case that doesn't make sense at all, and in that case, do you have currently a code quality workflow checking all the points, linting, formatting, code complexity, docstrings, and typing? Are you open to considering options to avoid using custom shell scripts?

Sorry if this is dumb or something, but I saw the gitter room a little inactive and I don't know if some of those points can be a real deal or if I'm saying things that other persons had proposed.

Thanks in advance folks 💜

Allow fit params for LightGBM

Describe the feature and the current state.

LightGBM has several parameters inside the fit method. One possible solution is to add a fit_params into the LightGBM function in FkLearn for increased flexibility.
This improvement would have to be done for both classification and regression.

Will this change a current behavior? How?

We would be able to configure the fit function using FkLearn.

Additional Information

https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier.fit

Missing features removal with SimpleImputer

Code sample

In the sample code below, a column is removed from the dataset during the pipeline

>>> from sklearn.impute import SimpleImputer
>>> import numpy as np
>>> imp = SimpleImputer()
>>> imp.fit([[0, np.nan], [1, np.nan]])
>>> imp.transform([[0, np.nan], [1, 1]])
array([[0.],
       [1.]])

Problem description

Currently sklearn.impute.SimpleImputer silently removes features that are np.nan on every training sample.

Therefore

fklearn/src/fklearn/training/imputation.py

Line 43 in 06475b6

    
           new_cols = pd.DataFrame(data=new_data, columns=columns_to_impute).to_dict('list')

fails as new_data.shape[1] != len(columns_to_impute).

Possible solutions

For the problematic features, either keep their values if valid or impute a default value during transform.

Out of Time In space split is wrong on space_time_split_dataset function

Issue location

src/fklearn/preprocessing/splitting.py

Function: space_time_split_dataset

Problem description

space_time_split_dataset splits the input DataFrame into 4 parts:

train_set
intime_outspace_hdout
outime_inspace_hdout
outime_outspace_hdout

The outime_inspace_hdout split is defined wrongly on the function. It is only filtered by time, but not in space

outime_inspace_hdout = dataset[
        (dataset[time_column] >= train_end_date) & (dataset[time_column] < holdout_end_date)]

Expected behavior

The outime_inspace_hdout should split the DataFrame out of space and in time, not only out of space

Possible solutions

We should rename this variable to outime_hdout, and define the right split with the other 3.

E.g.

train_set = train_period[~train_period[space_column].isin(holdout_space)]
intime_outspace_hdout = train_period[train_period[space_column].isin(holdout_space)]
outime_outspace_hdout = outime_hdout[outime_hdout[space_column].isin(holdout_space)]
outime_inspace_hdout = outime_hdout[~outime_hdout[space_column].isin(holdout_space)]

Add an option to order by ascending/descending prediction in cumulative effect curves

Describe the feature and the current state.

In the causal validation module and the curves file, it would be useful to add an ascending parameter for the cumulative effect and cumulative gain curves.

The current state is to order predictions descending:

ordered_df = df.sort_values(prediction, ascending=False).reset_index(drop=True)

If we add an ascending: bool = False argument to the cumulative_effect_curve, cumulative_gain_curve, relative_cumulative_gain_curve, and effect_curves, a user could modify how these effects are computed, whether to do them ascending or descending by the prediction column.

Will this change a current behavior? How?

Not if the user does not explicitly change the argument to ascending=True. If they do, the cumulative effect or cumulative gain curves will be computed using an ascending ordering in the prediction column.

A model could output a prediction that is not necessarily positively related to the effect to be computed, so adding an option to order this relationship differently will allow for effects and gains with negatively related predictions and outcomes to be computed adequately.

One current workaround is to do this:

df["prediction"] = -df["prediction"]

and then the computation will be made adequately. But this seems like a hack and maybe something we want to solve more cleanly.

Additional Information

The new definition of cumulative_effect_curve would look like this:

@curry
def cumulative_effect_curve(df: pd.DataFrame,
                            treatment: str,
                            outcome: str,
                            prediction: str,
                            min_rows: int = 30,
                            steps: int = 100,
                            effect_fn: EffectFnType = linear_effect,
                            ascending: bool = False) -> np.ndarray:
    """
    Orders the dataset by prediction and computes the cumulative effect curve according to that ordering

    Parameters
    ----------
    df : Pandas' DataFrame
        A Pandas' DataFrame with target and prediction scores.

    treatment : Strings
        The name of the treatment column in `df`.

    outcome : Strings
        The name of the outcome column in `df`.

    prediction : Strings
        The name of the prediction column in `df`.

    min_rows : Integer
        Minimum number of observations needed to have a valid result.

    steps : Integer
        The number of cumulative steps to iterate when accumulating the effect

    effect_fn : function (df: pandas.DataFrame, treatment: str, outcome: str) -> int or Array of int
        A function that computes the treatment effect given a dataframe, the name of the treatment column and the name
        of the outcome column.

    ascending : bool
        Whether the prediction column should be ordered ascending or not. Default is False.


    Returns
    ----------
    cumulative effect curve: Numpy's Array
        The cumulative treatment effect according to the predictions ordering.
    """

    size = df.shape[0]
    ordered_df = df.sort_values(prediction, ascending=ascending).reset_index(drop=True)
    n_rows = list(range(min_rows, size, size // steps)) + [size]
    return np.array([effect_fn(ordered_df.head(rows), treatment, outcome) for rows in n_rows])

`lgbm_classification_learner` failing when using multiclass one-vs-all objective

Code sample

import pandas as pd
import numpy as np
from fklearn.training.classification import lgbm_classification_learner

# sample df with three classes
sample_df = pd.DataFrame(
    {
        "a": np.linspace(1, 100, 100),
        "b": np.sin(np.linspace(1, 100, 100)),
    }
)
sample_df["target_raw"] = sample_df["a"]*sample_df["b"]
sample_df["target"] = np.where((sample_df["a"] < 50)&(sample_df["b"] < 0), 0, np.where(sample_df["target_raw"] < 0, 1, 2))

# train the multiclass lightgbm
p, train_scored, logs = lgbm_classification_learner(num_estimators=10, target="target", features=["a", "b"], extra_params={"objective": "multiclassova", "num_class": 3})(sample_df)

Problem description

The current implementation of lgbm_classification_learner breaks when we use the multiclassova or any of its aliases as an objective function.
The check in the learner only accounts for multiclass, which uses the Softmax objective function, but will give an error for the multiclassova.

The previous code will fail with ValueError: Wrong number of items passed 3, placement implies 1

Expected behavior

The function should check if the objective is not only multiclass but also multiclassova or any valid aliases of these objectives to make sure it doesn't break.

Possible solutions

Change the lgbm_classification_learner if condition that checks for multiclass to check for all the other multi output objectives as well as its aliases ("softmax", "ova", "ovr")

One-hot encoder's drop_first_column parameter does nothing

The parameter is not used in the transformation code.

fklearn/src/fklearn/training/transformation.py

Line 668 in 3f3226d

drop_first_column: bool = False,

Improve fklearn build process for local testing/development

My Computer

Macbook Pro M1
macOS Big Sur 11.6.3
conda 4.11.0
python 3.9.12

Context

Hey folks, recently I tried to install fklearn on my personal computer and realized that the main installation is not pretty straightforward.
I will add my detailed steps but it would be great to know wdyt about migrating some libraries in order to improve this process, the main error that I'm still getting is this one:

error: legacy-install-failure

Instructions

Clone the repository on a clean machine
Install any conda environment, that just has access to conda repos and pypi.
conda create -n fklearn python=3.9
conda activate
pip install -e ".[all]" (to start working with this repo).

At the last point I started having the error legacy-install-failure with numpy, so I tried to install first numpy using conda with:

conda install numpy

Fortunately, conda installed all the missing dependencies (like the BLAS libraries), however, when I tried to re-run the pip install -e ".[all]" I received the exact same error with numpy, so after reviewing the content of the trace, I realized that the version installed by conda was different, conda installed numpy==1.22 (which seems to be valid according to the requirements file) but the main pip process was trying to install numpy==1.18 and it seems like, that version didn't include the bdist/wheels needed, or was unable to build'em because some of the other libraries (I'm not sure about which ones) are using a pyproject.toml definition now.

So the next step was to retry installing that specific version using conda: conda install 'numpy=1.18', which helped me avoid the problem with numpy.
In the next iteration I had the same error with scikit-learn so the process was the same, check if the expected version was in the conda-forge repo and after that just run the install command.

Expected behavior

It would be great if at some point we are able to just run a `pip install -e ".[all]"' to start testing locally the library, maybe this is just a problem related to my computer, but I want to be sure that this is not a problem for someone else.
Maybe the M1 chip is not supported yet, and maybe we are in the process to start supporting that chip (without using rosetta?)

Possible solutions

As I saw, it seems like there are some libraries that could be installed without any problem directly with conda, so in my understanding, it means that we can find a way to do that directly without having to deal with extra iterations for the compiled packages that we use.

Thanks in advance!

Include kwargs in the evaluator's wrappers

Instructions

Include kwargs in the evaluator's functions

from

def precision_evaluator(test_data: pd.DataFrame,
                        threshold: float = 0.5,
                        prediction_column: str = "prediction",
                        target_column: str = "target",
                        eval_name: str = None) -> EvalReturnType:

    eval_fn = generic_sklearn_evaluator("precision_evaluator__", precision_score)
    eval_data = test_data.assign(**{prediction_column: (test_data[prediction_column] > threshold).astype(int)})
    return eval_fn(eval_data, prediction_column, target_column, eval_name)

def precision_evaluator(
    test_data: pd.DataFrame,
    threshold: float = 0.5,
    prediction_column: str = "prediction",
    target_column: str = "target",
    eval_name: str = None,
    **kwargs,
) -> EvalReturnType:   

    eval_fn = generic_sklearn_evaluator("precision_evaluator__", precision_score)
    eval_data = test_data.assign(**{prediction_column: (tet_data[prediction_column] > threshold).astype(int)})
    return eval_fn(eval_data, prediction_column, target_column, eval_name, **kwargs)

Describe the feature and the current state.

Evaluators are parsed through a function that does not have **kwargs, so one cannot use other parametrizations than the default.

Will this change a current behavior? How?

One will be able, as required by my project, to have the precision and recall by label and not an average of labels, which can only be changed by setting the proper parameter, as done below. Furthermore, with only this change, any king of extra parametrization for the evaluators will be possible

precision_evaluator(target_column=target, average=None, labels=[0, 1])

Extra information

Given the structure of the function of the generic evaluator generic_sklearn_evaluator, it seems to me that to have **kwargs was the intention since the beginning, but they missed the kwarg in the individual evaluator's wrappers, as it can be read in its definition:

def generic_sklearn_evaluator(name_prefix: str, sklearn_metric: Callable[..., float]) -> UncurriedEvalFnType:
    """
    Returns an evaluator build from a metric from sklearn.metrics
    Parameters
    ----------
    name_prefix: str
        The default name of the evaluator will be name_prefix + target_column.
    sklearn_metric: Callable
        Metric function from sklearn.metrics. It should take as parameters y_true, y_score, kwargs.

Pipelines can't be nested as any other learner

Code sample

This script is complete, it should run "as is"

from fklearn.training.regression import linear_regression_learner
from fklearn.training.pipeline import build_pipeline
from fklearn.training.imputation import imputer

# This example will try to create a linear regressor...
def load_sample_data():
    import pandas as pd
    from sklearn import datasets
    x,y = datasets.load_diabetes(as_frame=True, return_X_y=True)
    return pd.concat([x, y], axis=1)

# ...but pipelines can't be used as individual learners
nested_learner = build_pipeline(
    build_pipeline(
        imputer(columns_to_impute=["age"], impute_strategy='median'),
    ),
    linear_regression_learner(features=["age"], target=["target"])
)
fn, df_result, logs = nested_learner(load_sample_data())
'''
Output:
ValueError: Predict function of learner pipeline contains variable length arguments: kwargs
Make sure no predict function uses arguments like *args or **kwargs.
'''

Problem description

This is a problem because (taken from the docs) "pipelines (should) behave exactly as individual learner functions". That is, pipelines should be consistent with L in SOLID, but that is not happening.

The main benefit of supporting nested pipelines is that you can produce more maintainable code as packing complex operations in one big step like:

def custom_learner(feature: str, target: str):
    return build_pipeline(
        imputer(columns_to_impute=[feature], impute_strategy='median'),
        linear_regression_learner(features=[feature], target=[target])
    )

Is cleaner and more readable.

Expected behavior

The "Code sample" should produce the same output as if nested_learner was defined as:

nested_learner = build_pipeline(
    imputer(columns_to_impute=["age"], impute_strategy='median'),
    linear_regression_learner(features=["age"], target=["target"])
)

That is, if a pipeline is a type of learner, it should be possible to put it in place of any other learner.

Possible solutions

A workaround is proposed in #145 , but it only works if you already have a DataFrame (which doesn't happen in this scenario). Would like to hear if someone has investigated this or knows what we need to change to support this, but in the meantime I propose adding a note to the docs to prevent someone else from trying to do this.

[Code Style] Do not use asserts to check correct input-args, raise errors

Describe the feature and the current state.

This would be more of a suggestion than a feature or bug, as it is effectively a subjective matter which may be ignored. However, it'd be great to start adopting a coding style more in line with what other great python packages are doing.

Looking at the source code for example, here you assert that the function's input args are correct

fklearn/src/fklearn/training/transformation.py

Line 834 in f9a42d3

    
           assert (proportion > 0.0) & (proportion < 1.0), "proportions must be between 0 and 1"

However a better widespread practice is to raiseErrors for these cases.
https://github.com/google/styleguide/blob/gh-pages/pyguide.md#244-decision

Make use of built-in exception classes when it makes sense. 
For example, raise a ValueError if you were passed a negative number but were expecting a positive one. 
Do not use assert statements for validating argument values of a public API. 
assert is used to ensure internal correctness, not to enforce correct usage nor to indicate that some unexpected event occurred. 
If an exception is desired in the latter cases, use a raise statement.

More in general, this feature request can be extended to adopt a more widespread code-style, such as the one in the aforementioned pyguide.

Will this change a current behavior? How?

Some fair amount of code would change, mostly in error/exceptions handling.

Additional Information

cc @darlansf1

Keeping track of feature names after applying onehot_categorizer

PATH: .\fklearn\src\fklearn\training\transformation.py : onehot_categorizer

Instructions

When onehot_categorizer is applied, some columns are deleted and others are created.
However, we need the final column names for machine learning models.
These names are accessible by hard code, but I think hard code is not fkleonic.

Describe the feature and the current state.

onehot function can be created by the following code:

from fklearn.training.transformation import onehot_categorizer
onehot_function = onehot_categorizer(columns_to_categorize = ['categorical_feature'])

Inside the pipeline, "onehot_function" will be applied to data and new columns will be created.
I couldn't find any simple way to find the new columns, so, I had created a method to address this and it is enough for me, but I think an official one is needed.

Proposed solution

At "transformation.py" file, I had added the following code:

def update_onehot_feature_names(features: List[str], 
                            log: Dict[str,Dict]) -> List[str]:
    """ Update feature names that were created by onehot_categorizer 
	Parameters
        ----------
	features : list of str
		A list of column names that are used as features for the model. All this names
		should be in `df`.
	log: dict
		Log of onehot_categorizer applied in training data.
		Must be training data, because test data could have categories that aren't previously known.

	Returns
        ----------
	new_features: list of str
		A list of column names with original categorical features deleted
		and new onehot columnnames added.
    """

    new_features = list(features)
    for feature_updated in log['onehot_categorizer']['mapping']:
        if feature_updated not in new_features:
            raise Exception(str(feature_updated) + ' not found in features list')

        if log['onehot_categorizer']['hardcode_nans']:
            new_features += [feature_updated + '==nan']

        new_features.remove(feature_updated)
        for one_hot_column_key in log['onehot_categorizer']['mapping'][feature_updated]:
            new_features += [feature_updated + '==' + one_hot_column_key]
    return new_features

setattr(onehot_categorizer,'update_features', update_onehot_feature_names)

TEST CODE

import random
import pandas as pd
from fklearn.training.transformation import onehot_categorizer

# GENERATE DATA
column1 = [random.choice(['a','b','c']) for x in range(100)]
column2 = [random.random() for x in range(100)]
training_data = pd.DataFrame({'categorical_feature' : column1, 'numerical_feature' : column2})
FEATURES = training_data.columns.tolist()
print(training_data.head(5))

# NEW FUNCTIONALITY TEST
print('Initial FEATURES:  ', FEATURES)
oneHot = onehot_categorizer(columns_to_categorize = ['categorical_feature'],store_mapping =  True)
_, _, log = oneHot(training_data)  # You can only define extra columns if you 'see' the training_data
NEW_FEATURES = onehot_categorizer.update_features(FEATURES, log)
print('Final FEATURES:  ', NEW_FEATURES)

Test expected result

Initial FEATURES: ['categorical_feature', 'numerical_feature']
Final FEATURES: ['numerical_feature', 'categorical_feature==a', 'categorical_feature==b', 'categorical_feature==c']

Will this change a current behavior? How?

This will not change the current behaviour, will just simplify onehot_categorizer use.

.

Missing pd_extractor documentation

Describe the documentation issue

Improvement
The src/fklearn/metrics/pd_extractors.py has no documentation on it.

Is there a reason why the `object` in learner logs isn't inside the learner key?

Code sample

Taking a look at the return logs of the learners, e.g. the logistic regression one:

    log = {'logistic_classification_learner': {
        'features': features,
        'target': target,
        'parameters': merged_params,
        'prediction_column': prediction_column,
        'package': "sklearn",
        'package_version': sk_version,
        'feature_importance': dict(zip(features, clf.coef_.flatten())),
        'training_samples': len(df)},
        'object': clf}

Problem description

Is there a reason why the object key isn't inside the dictionary of logistic_classification_learner? This leads to a problem where, if I have multiple learners in my pipeline, the final object depends only on the order of the learners inside the pipeline, and I lose the objects of the first learners.
E.g.: My pipeline is (logistic_regression, isotonic_calibration). Since the build_pipeline function will merge the logs of the two objects, the final object will have only the isotonic calibration, and I lose the logistic_regression object.

Expected behavior

Access all learner objects of the pipeline, not just the last one.

Possible solutions

Put the learner object inside the dictionary of the logs:

    log = {'logistic_classification_learner': {
        'features': features,
        'target': target,
        'parameters': merged_params,
        'prediction_column': prediction_column,
        'package': "sklearn",
        'package_version': sk_version,
        'feature_importance': dict(zip(features, clf.coef_.flatten())),
        'training_samples': len(df),
        'object': clf}
        }

Problems to pickle pipelines in the 1.21.0

Problem description

Users reported that they were able to train the models without any big issue, but when they tried to pickle using cloudpickle==0.8.1 they had the following error:

_pickle.PicklingError: Can't pickle typing.Union[str, float]: it's not the same object as typing.Union

cc: @pedroig

space_time_split_dataset slowness

fklearn/src/fklearn/preprocessing/splitting.py

Line 72 in f158861

def space_time_split_dataset(dataset: pd.DataFrame,

I've been having slowness issues with the space_time_split_dataset. It takes about 30-60 minutes to split a 0.5-1million rows dataset, which seems somewhat excessive (I know that, manually, this would take a couple of minutes only).

Code:

TRAIN_END_DATE = '2019-04-01' 
HOLDOUT_END_DATE = '2019-07-01'

split_fn = space_time_split_dataset(train_start_date=TRAIN_START_DATE,
                                train_end_date=TRAIN_END_DATE,
                                holdout_end_date=HOLDOUT_END_DATE,
                                space_holdout_percentage=.5,
                                split_seed=42, 
                                space_column="id",
                                time_column='time')

df_train, df_test, _, df_holdout = split_fn(data)
df_train.shape, df_test.shape, df_holdout.shape```

Not working with newer SHAP versions

Problem description

When running lgbm with apply_shap=True, it throws a ValueError. Apparently there was a change on Shap's output (to a list of ndarray) on its newer versions. (https://github.com/slundberg/shap/blob/master/shap/explainers/tree.py#L194)

Issue location

fklearn/training/classification.py on lgbm_classification_learner

Expected behavior

f(df, apply_shap=True) used to return the original dataframe with a column "shap_values" with an array as value

Possible solutions

Adapt lgbm_classication_learner to newer (>=0.31.0) shap versions

“python_requires” should be set with “>=3.5”, as fklearn 2.0.0 is not compatible with all Python versions.

Currently, the keyword argument python_requires of setup() is not set, and thus it is assumed that this distribution is compatible with all Python versions.
However, I found it is not compatible with Python 2. My local Python version is 2.7, and I encounter the following error when executing “pip install fklearn”

Collecting fklearn
  Downloading https://files.pythonhosted.org/packages/1a/aa/b1af28f1d63958e50e82a64f69e5c9cf250b717cdf4a115a82fbc028094e/fklearn-2.0.0.tar.gz (60kB)
     |████████████████████████████████| 61kB 
…
Collecting pandas<2,>=0.24.1
  Downloading https://files.pythonhosted.org/packages/61/57/6c233cc63597c6aa6337e717bdeabf791e8b618e9c893922a223e4e41cf4/pandas-0.24.2-cp27-cp27m-win_amd64.whl (8.3MB)
     |████████████████████████████████| 8.3MB 393kB/s
ERROR: Could not find a version that satisfies the requirement scikit-learn<0.24.0,>=0.21.2 (from fklearn) (from versions: 0.9, 0.10, 0.11, 0.12, 0.12.1, 0.13, 0.13.1, 0.14, 0.14.1, 0.15.0b1, 0.15.0b2, 0.15.0, 0.15.1, 0.15.2, 0.16b1, 0.16.0, 0.16.1, 0.17b1, 0.17, 0.17.1, 0.18rc2, 0.18, 0.18.1, 0.18.2, 0.19b2, 0.19.0, 0.19.1, 0.19.2, 0.20rc1, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.20.4, 0.21rc2)
ERROR: No matching distribution found for scikit-learn<0.24.0,>=0.21.2 (from fklearn)

Dependencies of this distribution are listed as follows:

joblib>=0.13.2,<2
numpy>=1.16.4,<2
pandas>=0.24.1,<2
scikit-learn>=0.21.2,<0.24.0
statsmodels>=0.9.0,<1
toolz>=0.9.0,<1

I found that scikit-learn>=0.21.2,<0.24.0 requires Python>=3.5, which results in installation failure of fklearn in Python 2.7.

Way to fix:
modify setup() in setup.py, add python_requires keyword argument:

setup(…
     python_requires=">=3.5",
     …)

Thanks for your attention.
Best regrads,
PyVCEchecker

split_evaluator_extractor throws KeyError if not all values are present

Problem description

Whenever an innermost extractor doesn't find an split_value that is not defined on a leaf (ie — it exists for some splits, but not all of them) it throws out a dict KeyError.

Code sample

from fklearn.data.datasets import make_tutorial_data

from fklearn.metrics.pd_extractors import split_evaluator_extractor, evaluator_extractor
from fklearn.validation.evaluators import split_evaluator, r2_evaluator
import numpy as np

data = make_tutorial_data(50).dropna(subset=["feature3"]).assign(prediction=lambda d: d.target)

ev = split_evaluator(eval_fn=r2_evaluator, split_col="feature3")
ev2 = split_evaluator(eval_fn=ev, split_col="date")

results = ev2(data)

date_values = [
    np.datetime64("2015-01-06T00:00:00.000000000"),
    np.datetime64("2015-01-14T00:00:00.000000000"),
    np.datetime64("2015-01-22T00:00:00.000000000"),
    np.datetime64("2015-01-30T00:00:00.000000000"),
    np.datetime64("2015-03-08T00:00:00.000000000"),
    np.datetime64("2015-03-09T00:00:00.000000000"),
    np.datetime64("2015-04-04T00:00:00.000000000"),
]

e = evaluator_extractor(evaluator_name="r2_evaluator__target")
e1 = split_evaluator_extractor(base_extractor=e, split_col="feature3", split_values=["a", "b"])
e2 = split_evaluator_extractor(base_extractor=e1, split_col="date", split_values=date_values)

results_df = e2(results)  # this line throws a KeyError

Expected behavior

Instead of throwing an error, it should return a pandas dataframe filled with NaNs for whenever a value is not defined in this particular split, but is defined on its parent.

Possible solutions

evaluator_extractor should return nan when the value is not found.

Misleading date interval in splitting.py file

Instructions

The specified date inteval code in splitting.py could lead to errors.
Path: .\fklearn\src\fklearn\preprocessing\splitting.py

Code sample

In splitting.py we have:

train_period = dataset[
    (dataset[time_column] >= train_start_date) & (dataset[time_column] < train_end_date)]
outime_inspace_hdout = dataset[
    (dataset[time_column] >= train_end_date) & (dataset[time_column] < holdout_end_date)]

Problem description

We can see from source that "holdout_end_date" won't be included in "outime_inspace_hdout" dataset.
This could mislead users.
For example, in yours regression.ipynb notebook, the date "2016-12-31" had completely vanished when data was splitted.

Also, looking to the source, "train_end_date" won't be included in "train_period" dateset.
It is not wrong, but is not what we would expect.

Possible solutions

I would suggest to redefine this limits:

train_period = dataset[
    (dataset[time_column] >= train_start_date) & (dataset[time_column] <= train_end_date)]
outime_inspace_hdout = dataset[
    (dataset[time_column] > train_end_date) & (dataset[time_column] <= holdout_end_date)]

In the same file (splitting.py), the function "time_split_dataset" have similar behaviour.

Repetitive words in documentation

Instructions

Remove the duplicate words

Describe the documentation issue

The section "Working with the code" has duplicate words ("that").

Integration with other tools/frameworks

It'd be great to have a list of recommended tools and frameworks that could be easily integrated with fklearn to have production ready and reproducible pipeline.
Even better to have short examples of usage.
For example fklearn could be used with: https://pair-code.github.io/what-if-tool/ and mlflow.
How about kedro (https://github.com/quantumblacklabs/kedro), do you recommend it?

Dependency management

Hi ;)

This project uses requirements.txt as a dependency manager file. Wouldn't it be useful to use Pipenv + Pipfile or Poetry + Pyproject.toml as a substitute?

With any of the indicated you can manage all dependencies in 1 file and with features like locking dependencies with hash and the like...

More info:

Pipenv: https://pipenv.pypa.io/en/latest/
Pipfile: https://github.com/pypa/pipfile

Poetry: https://python-poetry.org/
Pyproject.toml: https://python-poetry.org/docs/pyproject/

Make things like xgboost, lgbm ... extra requirements

Describe the feature and the current state.

Today we choose to install all packages required by fklearn, for development purposes this sounds good, but in production could be better to install only what you need. My idea is to use extra requirements, this way we can have something like:

pip install fklearn[xgboost]

And we only install the common packages + xgboost, instead of install everything.
We should also include a fklearn[all] to install everything

Will this change a current behavior? How?

Yes, standalone install will not work with most models.

Additional Information

This needs to be well documented in order to avoid a bad user experience

training.transformation.custom_transformer transformation_function should receive and return Series, not dataframe

PROBLEM FOUND IN DOCS:

The training.transformation.custom_transformer receives a transformation_function as parameter.

The docs say this transformation_function receives a pd.DataFrame and returns a pd.DataFrame but, in fact, it is receiving a pd.Series and returning a pd.Series.

The docs should be update to reflect the implemented behavior.

A functional that takes a custom function and applies it to Data

I am afraid I am being a little annoying, but I am just trying to apply fklearn to my problem. I guess with a couple more issues I will be done.

Instructions

I need a functional that takes a custom function and applies it to my data. For example, create a new column adding other two.

Describe the feature and the current state.

The functional "custom_transformer" receives a function and applies it to a column. I need a functional that can create and delete columns just like onehot_categorizer do.
Keeping track of feature names could be a nuissance #58. I had done something about that for myself too #68.

Will this change a current behavior? How?

No, just add a functionality.

Proposed solution

Here is what I had done. Maybe could be useful as an inspiration.

(You could just copy and paste this code on a jupyter notebook with fklearn installed that it should work)

Generating data

import random
import pandas as pd
column1 = [random.choice(['a','b','c']) for x in range(100)]
column2 = [random.random() for x in range(100)]
column3 = [random.random() for x in range(100)]
training_data = pd.DataFrame({'cat_feature' : column1, 
                              'num_feature_1' : column2,
                              'num_feature_2' : column3})
FEATURES = training_data.columns.tolist()

DEFINING THE CUSTOM FUNCTION THAT ADDS TWO SPECIFIC COLUMNS:

'num_feature_1' + 'num_feature_2'
This function has two addicional atributes:
- log : information that will be added to the pipeline log
- update_features : function that keep track of feature names after the application of sum_two_features

import pandas as pd
from typing import Dict, List

def sum_two_features(df: pd.DataFrame) -> pd.DataFrame:
    assert ('num_feature_1' in df.columns) & ('num_feature_2' in df.columns)
    new_df = df.copy()
    new_df['num_sum'] = new_df.loc[:,'num_feature_1'] + new_df.loc[:,'num_feature_2']
    new_df = new_df.drop(['num_feature_1','num_feature_2'], axis=1)
    return new_df

def sum_two_features_log() -> Dict[str,Dict]:
    return {'sum_two_features': {
        'removed_columns': ['num_feature_1','num_feature_2'],
        'added_columns': ['num_sum'],
        'more_info': 'num_sum column is num_feature_1 plus num_feature_1'}}

def sum_two_features_update_features(features: List[str]) -> List[str]:
    new_features = list(features)
    if 'num_feature_1' in new_features:
        new_features.remove('num_feature_1')
    if 'num_feature_2' in new_features:
        new_features.remove('num_feature_2')
    new_features += ['sum_two_features']
    return new_features

setattr(sum_two_features,'log', sum_two_features_log)
setattr(sum_two_features,'update_features', sum_two_features_update_features)

DEFNING THE FUNCTIONAL TO BE ADDED AT "transformation.py"

from toolz import curry
import pandas as pd
from typing import Callable
from fklearn.training.utils import log_learner_time
from fklearn.types import LearnerReturnType
from fklearn.common_docstrings import learner_return_docstring, learner_pred_fn_docstring

@curry
@log_learner_time(learner_name='custom_data_transformer')
def custom_data_transformer(df: pd.DataFrame,
                       transformation_function: Callable[[pd.DataFrame],pd.DataFrame]) -> LearnerReturnType:
    """
    Applies a custom function to the dataset.

    Parameters
    ----------
    df : pandas.DataFrame
        A Pandas' DataFrame

    transformation_function : function(pandas.DataFrame) -> pandas.DataFrame
        A function that receives a DataFrame as input, performs a transformation
        and returns another DataFrame.
        
    Additional info
    ----------
    transformation_function must have log attribute generated by:
        
        setattr(transformation_function ,'log', transformation_function_log_function)

    This log will be saved in the pipeline logs
    
    Also, transformation_function should have its own way to update the feature names.
    """

    def p(new_data_set: pd.DataFrame) -> pd.DataFrame:
        new_data_set = transformation_function(new_data_set)

        return new_data_set

    p.__doc__ = learner_pred_fn_docstring("custom_data_transformer")

    log = transformation_function.log()

    return p, p(df), log

custom_data_transformer.__doc__ += learner_return_docstring("Custom Data Transformer")

TESTING THE FUNCTIONAL

my_feature_adder = custom_data_transformer(transformation_function = sum_two_features)
(function_out, data_applied, log) = my_feature_adder(training_data)
print(data_applied.head(3),'\n')
print(log)

Functional that adds a custom model to the pipeline

Instructions

I believe people should be free to use any model they want.
So, I guess should be some custom model functional that let people add any model they want to the pipeline.
I had done something for me here, maybe could be of some use.

Describe the feature and the current state.

For supervised machine learning there are a limited amount of avaible models.
I couldn't find a simple way to add a new model, like a GaussianProcessRegressor, to the pipeline.

Proposed solution

Here is what I had done. Maybe could be useful as an inspiration.

(You could just copy and paste this code on a jupyter notebook with fklearn installed that it should work)

FUNCTIONAL DEFINITION

import numpy as np
import pandas as pd
from typing import List, Any, Dict
from toolz import curry, merge, assoc

from fklearn.types import LearnerReturnType, LogType
from fklearn.common_docstrings import learner_return_docstring, learner_pred_fn_docstring
from fklearn.training.utils import log_learner_time


@curry
@log_learner_time(learner_name='custom_supervised_model_learner')
def custom_supervised_model_learner(df: pd.DataFrame,
                         features: List[str],
                         target: str,
                         model: Any,
                         log: Dict[str,Dict],
                         prediction_column: str = "prediction") -> LearnerReturnType:
    """
    Fits a custom model to the dataset. 
    Return the predict function, the predictions for the input dataset and a log describing the model.

    Parameters
    ----------

    df : pandas.DataFrame
        A Pandas' DataFrame with features and target columns.
        The model will be trained to predict the target column
        from features.

    features : list of str
        A list os column names that are used as features for the model. All this names
        should be in `df`.

    target : str
        The name of the column in `df` that should be used as target for the model.

    model: Object
        model must have ".fit" attribute that trains the data.
        For classification problems it also need ".predict_proba" attribute
        For regression problems it also need ".predict" attribute

    log : Dict[str,Dict]
        Log with additional information of the custom model used.
        It must start with just one element with the model name.
        On the second layer, it must define the supervised type.
        Example:
        log = {'GaussianProcessRegressor': {
            'type' : 'regression',  # mandatory (could be classification too)
            'features_used': FEATURES,  # optional
            'parameters_used': 'Default parameters', #optional
            ... # additional optional info
            
    prediction_column : str
        The name of the column with the predictions from the model.
        For classification problems, all probabilities wiill be added: for i in range(0,n_classes).
        For regression just prediction_column will be added.
    """
    
    model.fit(df[features].values, df[target].values)
    modelName = list(log.keys())[0]
    
    def p(new_df: pd.DataFrame) -> pd.DataFrame:
        if log[modelName]['type'] == 'classification':
            pred = model.predict_proba(new_df[features].values)
            col_dict = {}
            for (key, value) in enumerate(pred.T):
                col_dict.update({prediction_column + "_" + str(key): value})

        elif log[modelName]['type'] == 'regression':
            col_dict = {prediction_column: model.predict(new_df[features].values)}
        
        return new_df.assign(**col_dict)

    p.__doc__ = learner_pred_fn_docstring("custom_supervised_model_learner")

    return p, p(df), log

custom_supervised_model_learner.__doc__ += learner_return_docstring("Custom Supervised Model Learner")

TESTING THE FUNCTIONAL

DEFINING THE TEST DATA

import random
import pandas as pd
column1 = [random.random() for x in range(100)] # dummy feature
column2 = [random.random() for x in range(100)]
target =  [5 * x + 0.1*random.random() for x in column2]  # there is some noise

data = pd.DataFrame({'feat_1' : column1, 
                     'feat_2' : column2,
                     'target' : target})
FEATURES = data.columns.tolist()
TARGET = ['target']
FEATURES.remove(TARGET[0])

TESTING

from sklearn.gaussian_process import GaussianProcessRegressor
from fklearn.training.pipeline import build_pipeline
model = GaussianProcessRegressor()
log = {'GaussianProcessRegressor': {
            'type' : 'regression',  # it must have log[modelName]['type'] = 'classification' or 'regression'
            'features_used': FEATURES,
            'parameters_used': 'Default parameters',
            'why_GaussianProcessRegressor_was_chosen' : 'I just had chosen it randomly'}
      }
my_model = custom_supervised_model_learner(features = FEATURES, target = TARGET[0], model = model, log = log)
my_learner = build_pipeline(my_model)
(prediction_function, data_trained, logs) = my_learner(data)
print(data_trained.head(5))  # good enough

split_evaluator_extractor not fully compliant with split_evaluator

Problem description

split_evaluator has an optional parameter called eval_name, with a default value of None. If this parameter is used, split_evaluator_extractor, which is supposed to facilitate the extraction of information in the logs generated by split_evaluator, does not work. It only works for the default case.

Code sample

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from fklearn.validation.evaluators import split_evaluator, roc_auc_evaluator
from fklearn.metrics.pd_extractors import split_evaluator_extractor, evaluator_extractor

loaded_data = load_breast_cancer()
size = len(loaded_data['target'])
dataset = pd.DataFrame(loaded_data['data']).assign(
  target=loaded_data['target'], 
  prediction=np.random.rand(1, size)[0],
  split=np.random.randint(2, size=size)
)

eval_fn = roc_auc_evaluator(target_column='target', eval_name='roc_auc')
base_extractor = evaluator_extractor(evaluator_name='roc_auc')

# Default case
split_fn_default = split_evaluator(eval_fn=eval_fn, split_col="split")
default_logs = split_fn_default(dataset)
default_metrics = split_evaluator_extractor(default_logs, split_col="split", base_extractor=base_extractor, split_values=[0,1])
print(default_metrics)

#    roc_auc  split_evaluator__split
#0  0.488234                       0
#0  0.448012                       1

# Bug case - with eval_name
split_fn_named = split_evaluator(eval_fn=eval_fn, split_col="split", eval_name="named_eval")
named_logs = split_fn_named(dataset)
named_metrics = split_evaluator_extractor(named_logs, split_col="split", base_extractor=base_extractor, split_values=[0,1])
print(named_metrics)

#   roc_auc  split_evaluator__split
#0      NaN                       0
#0      NaN                       1

Expected behavior

split_evaluator_extractor should be able to extract logs generated by split_evaluator with an eval_name.

Possible solutions

Include a eval_name parameter on split_evaluator_extractor , just like temporal_split_evaluator_extractor has.

Make it possible to have a gap between training and holdout in time splitters

Describe the feature and the current state.

Currently, train_end_date is used as a "holdout_start_date", so it's not possible to define a different date to start the holdout period, while having a gap between the training period and the out of time period may be useful in a couple of analysis.

So the propose is to add a new parameter to time_split_dataset and space_time_split_dataset called holdout_start_date to make it possible to define a different date than train_end_date when needed.

Will this change a current behavior? How?

Since the creation of a holdout_start_date would include it as None and use train_end_date to define it, it shouldn't break these functions.

quantile_biner

Instructions

Quantile biner creates an extra class either for min or for max.

Code sample

test = pd.DataFrame({
    "id": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
    "value": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
})

transformation.quantile_biner(test, columns_to_bin=['value'], q=2, right=False)[1]

Problem description

This will result in 3 classes, with the min value in a separate class if right=True, and max value in a separate class otherwise.

If fact, it creates an extra class for min or max for all the values of q (both int and list of quantiles).

Expected behavior

Behave like pd.qcut, when setting q=n will create exactly n classes.

Possible solutions

include min and max values in the extreme intervals, regardless the right setting.

Code style preferences

Hey everyone!

I'm looking at the code and I saw that there are some inconsistencies between code formatting, mainly in the usage of quotes (" on some strings and ' in other cases).
So what do you think about establishing some preferences on code style? I mean, we already have flake8 to check whether the code follows some PEP8 recommendations, but I want to be sure before adding extra elements or something that could break the current approach?

I know that we have this section and that we are following this guide and PEP8, so I just want to be sure that we are in the same channel.

Also if we don't want to start adding a lot of manual/custom rules, we can work directly with an unopinionated framework like black, which usually follows everything in the PEP8 (except for the double quotes).

Thanks in advance!

Nested pipelines

Is it possible to created nested pipelines?

data_transformation_pipe = build_pipeline(...)
modelling_pipe = build_pipeline(...)

full_pipe = build_pipeline(data_transformation_pipe, modelling_pipe)

I could not find a way to do it.

Some imports from the example notebook don't work with the latest version.

Instructions

Update the example notebook to reflect correct package location.

Describe the documentation issue

If you run the example notebook referenced in the documentation, the following lines result in an ModuleNotFoundError:

In [1]: from fklearn.documentation.pd_extractors import evaluator_extractor as pd_evaluator_extractor, extract as pd_extract, reverse_learning_curve_evaluator_extractor

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-3-90ee9e939b26> in <module>
----> 1 from fklearn.documentation.pd_extractors import evaluator_extractor as pd_evaluator_extractor, extract as pd_extract, reverse_learning_curve_evaluator_extractor

ModuleNotFoundError: No module named 'fklearn.documentation'

And

from fklearn.documentation.extractors import evaluator_extractor, extract ## different from pd_extractor
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-5-a5e7caa37303> in <module>
----> 1 from fklearn.documentation.extractors import evaluator_extractor, extract ## different from pd_extractor

ModuleNotFoundError: No module named 'fklearn.documentation'