Comments (14)
Hi Hector, thank you for reaching out and for sharing your suggestions.
I agree that transformations to the data can lead to data leakage.
What is your proposal for adding sklearn pipelining and standard transforms to ppscore?
from ppscore.
Let me try over the week replacing the models, regressor/classifier, with a pipeline model including one standardization step. If this works it can be made a kwarg in predictors.
from ppscore.
I would like to protect your time, so before you start implementing the proposal, please provide a concept (aka some examples) for the API first. This way, we can first discuss the new API (aka user experience) and when we agree on a suitable API, we can talk about the implementation.
from ppscore.
Yes, Florian, minum changes if it works, could be a keyword argument addition for a list of transformations in the predictors function that reaches the VALID_CALCULATIONS dictionaries and replaces tree.DecissionTree*() with a a pipeline preprocessing using the input list of transformations. According to this:
from ppscore.
I think I got it - can you still please give 1 detailed example with the actual syntax. I would love to have a look at how the total code would look like
from ppscore.
As example only, in calculation:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
#Change the models to:
VALID_CALCULATIONS = {
"regression": {
"type": "regression",
"is_valid_score": True,
"model_score": TO_BE_CALCULATED,
"baseline_score": TO_BE_CALCULATED,
"ppscore": TO_BE_CALCULATED,
"metric_name": "mean absolute error",
"metric_key": "neg_mean_absolute_error",
"model": Pipeline([('scaler', StandardScaler()), ('tree', tree.DecisionTreeRegressor())]),
"score_normalizer": _mae_normalizer,
},
"classification": {
"type": "classification",
"is_valid_score": True,
"model_score": TO_BE_CALCULATED,
"baseline_score": TO_BE_CALCULATED,
"ppscore": TO_BE_CALCULATED,
"metric_name": "weighted F1",
"metric_key": "f1_weighted",
"model": Pipeline([('scaler', StandardScaler()), ('tree', tree.DecisionTreeClassifier())]),
"score_normalizer": _f1_normalizer,
},
This provides slightly different results for some data sets. The idea is to enable "predictors" function to alter the model keys to a constructed pipeline, whose constructor is a little bit awkward as it is a tuple (name, transformer). The pipeline should take care of non-leaking cross validation scores.
A call to predictors would look like:
transformers = [StandardScaler(), MinMaxScaler()]
predictors(df, 'column', transformers = transformers)
Here predictors (or other function) would have to create the pipeline list.
from ppscore.
Hi Hector, thank you for the example, and I like the transformers API
When I thought about this proposal, I was unsure which problem this should solve exactly? What is the scenario that the user is in and why does the user use ppscore in that scenario?
When did you have this scenario yourself the last time? How did you solve it then?
Maybe you can explain this a little bit more - this would help me in my understanding
from ppscore.
Hello Florian,
the use case is having feature data that may exhibit outliers, some skewed distribution or any other anomaly that can be improved by transformation instead of dropping the offenders. In this specific case I was looking for best predictors among thousands of time series with several anomalies and had to run transformations, I transformed them and the run PPS, contaminating the internal cross validation. I manually modified the cv PPS uses to time series split and pipelined the data. Users may also want to minmax scale the data, or perform more complex transformations that they could pipeline if they are looking for quick comparisons. There were changes in PPS score ranking with and without the transformations that may be significant.
As a sideline, the cv object could also be defined as kwarg in the predictors functions to accept other splits, stratified k folds comes to mind for very unbalanced datasets.
These are the two operations I had to manually perform in this case, kwarging transformations and cv object can automate it and make the PPS more flexible.
This can generate quick checks, PPS_standard is pps with the pipeline added:
import PPS as pps
import PPS_standard as pps_s
import pandas as pd
import numpy as np
import sklearn.datasets as ds
diabetes = ds.load_diabetes()
df= pd.DataFrame(data= np.c_[diabetes['data'], diabetes['target']],
columns= diabetes['feature_names'] + ['target'])
print(pps_s.predictors(df, y='target')[['x', 'ppscore']].head())
print(pps.predictors(df, y='target')[['x', 'ppscore']].head())
from ppscore.
Thank you for the explanation.
Wouldn't it make more sense then to just pipe the crossvalidation object into ppscore?
Because in the end you are concerned about an invalid crossvalidation.
Did you generate a cross-validation object at the end of your pipeline?
from ppscore.
Hello Florian, sklearn pipe requires the last element to be the estimator, which in PPS will be the automated choice of regression or classifier, so I have not found any other way to feed it in but to overwrite the whole model with the pipe that has the original estimator as last element.
The whole pipeline could be and input to PPS, the user would have to decide regression or classification in this case, or complicate the logic of _determine_case_and_prepare_df so that it selects, from multiple models that are either classifiers or regressors. This would allow on the other hand to compare PPS using multiple different models, not only tree.
from ppscore.
Hi Hector, I wish you a good start to the new year and sorry for the late reply - I have been on vacation.
Thank you for clarifying the point that the model is the last step for the crossvalidation object and thus it is not possible to pass the full cv object.
If you want you can go ahead and add a PR
from ppscore.
Happy New Year Florian. I will open the PR and propose the changes.
from ppscore.
Hello Florian,
The changes I made require the model (the tree regressor or classifier) within VALID_CALCULATIONS to be re-initialized every time the API is called to include the pipeline object.
This brings no noticeable computational cost but it cannot pass this test:
line 156 of tests:
assert pps.score(df, "x", "y", random_seed=1) == pps.score(
df, "x", "y", random_seed=1
)
As the model object at the 'model' entry of the dictionary is a different instance of a model with the same parameters. The contents are the same in every other entry of the dict, the model is not and fails to assert. This test (and subsequent result comparisons) could be modified to compare the dict excluding the 'model' entry, just as a suggestion.
from ppscore.
Thank you for the heads-up. We can easily adjust that test
from ppscore.
Related Issues (20)
- [SUGGEST] Release a verson supported GPU HOT 2
- ppscore when model_score>baseline_score HOT 3
- There should be an option to override the attribute type like PyCaret HOT 4
- Scikit-learn dependency < 1.0.0 HOT 14
- [Suggestion]: Plot the Decision Tree for pps.score HOT 3
- Readme / docs unclear about using ppscore on time series data HOT 3
- pytests failing with pandas==1.4.0 HOT 1
- Thought on a possible enhancement of the PPS HOT 2
- What does PPS score? HOT 4
- Add support to release Linux aarch64 wheels HOT 4
- Cannot install ppscore HOT 1
- Your package isn't compatible with scikit-learn 1.0.1 HOT 2
- How to report PPS HOT 1
- Question About Data Order HOT 12
- y predicted values given x HOT 3
- Performance HOT 3
- differnt baseline scores for the same y HOT 1
- How to deal with heavy imbalanced data? For example, when the target is 99 "negative" to 1 "positive"
- pandas >2 support
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ppscore.