Comments (9)
Doesn't it have the danger that imputation method is chosen based on the performance on the test set ?
Instead we can combine the most frequently used imputation methods for numerical and categorical variables and create a report with the cv scores .
The users can then choose see and choose which method works and use that in the final pipeline.
What do you think ?
from probatus.
-
Indeed if the user uses the test data to select that, it will cause the bias. However, the user can also use a randomly sampled validation data (and not touch the test holdout) in order to perform this experiment. Probably best way would indeed be cross-validation.
-
When it comes to the report, i think it is very dependent on the dataset you use and the type of model. It might be better to provide a module that will simply run an experiment. But we could indeed use the methods that are found as most efficient in the literature, and present as a similar plot, and the user should be able to choose from them.
from probatus.
1.The cross validated results performance should be used to evaluate the imputation method.
2.Yes,an experiment and a corresponding report should be the outcome. Then the users can decide the method, they want to go for.
from probatus.
We can create a new module probatus.impute
for comparing the performance of various imputation strategies.
The pseudo code can be as follows :
class CompareImputationStrategies:
def __init__(self):
def fit(X,y,clf,strategies=['No Imputation','Simple','KNN'],cv = 5,scoring='roc_auc'):
"""
The fit method mainly in the parameters and checks if the
data is correct etc.
As of now, we will deal with categorical variables missing data
with KNN,Missing value indicator. Later we can use more complex methods
like [MCA](https://napsterinblue.github.io/notes/stats/techniques/mca/)
X : training set
y : target
clf : classifier used to evaulate
strategy : List of strategies to use.
cv : cross validation to use.
scoring : Scoring parameters
"""
def compute():
"""
The major calculation is done in the compute method.
"""
for strategy in strategies :
create a pipeline with strategy and classifier.
evaluate the model performance using sklearn.model_selection.cross_val_score.
store the results .
plot the results
def fit_compute() :
"""
Fit compute method.
"""
fit()
compute()
def plot():
"""
Plot the results of the comparison.
"""
Very similar to sk-learn example
Thanks to this the users can :
- Check which imputation strategy works best for their data by calling a few lines of code.
- This implementation can then be extended to include more complex imputation strategies.
from probatus.
Overall looks good!
Some comments:
- for consistency with other modules, let's take clf and strategy as input to init. Please also check other parameters like cv and scoring whether they are typically in init or fit.
- We should allow the user to pass sklearn objects doing the imputation as the strategy parameter. This way we will allow for more flexibility, e.g. for simple imputation you can impute with 0, -1, etc. If the user would pass None, we would use no imputation as strategy, or something like that. Possibly we could also have a default set of imputation objects, that the user could use, with strategy = "default".
- I would do major computation in fit, and then compute, basically presents the report for the user. The idea is that the user could run fit once and compute multiple times to get the report.
- Compute should return the dataframe with the report e.g. val_score, train_score, and rows are names of the methods used.
- You can inherit from
BaseFitComputePlotClass
to ensure the consistency
from probatus.
Thanks for the quick comments.Good point about the sklearn objects.
from probatus.
Hi Anil, couple more points that just popped to my mind:
- good to document well and in tutorials that clf may be sklearn Pipeline, that performs e.g. onehotencoding of categoricals and then applies the model. Two simple use cases if we have a complex dataset is trying out clf=XGboost,. and clf=Pipeline(OneHotEncoding+LogisticRegression)
- A good idea would be to allow user to provide a list of clfs in
clf
parameter. This way you can compute the imputation in X only once per imputation method, and then try out multiple models on that datasets, instead of rerunning the whole computation for each model you want to try. You can also try to use the same cross-validation splits. The logic would be:
1. Use Cross-Validation to compute X_imputed for each method
2. Use Cross-Validation (same splits) to get scores for each clf on each X_imputed.
This way the user can provideclf=[XGBClassifier, Pipeline(OneHotEncoding+LogisticRegression)]
. One issue that needs to be solved is how to plot the names of the models correctly. Maybe we can add optional parameterclf_name=None
in the init just for convenience in report and plotting. - Another thing to consider if the user wants to try imputation_strategy=[None, ...], and multiple clfs, some models will allow for that and some not. We could try to detect that and run it only for the models that do e.g. XGBoost.
- I would propose using two extra parameters to be consistent with the other features:
verbose
for printing warnings andrandom_state
to ensure reproducibility of the results. Please have a look into other features, how these are used.
from probatus.
Good points. With the current implementation we would be able to achieve most of the above points. Plus it is inline with Probatus interfaces.
Point 2 is a good idea, however it will complicate the implementation and may confuse the users.
To keep the implementation simple and make it do only one thing, as of now the users can pass a single classifier and multiple strategies to test.
Incase a user plans to test many classifier they can run the comparison within a loop. In that case the users can keep track of models and imputation results.
from probatus.
It might complicate the clf
parameter indeed. However, I think if we allow this as an option, next to just passing the model normally it should not be that bad. Maybe we can pass it as a dict, the same way as we pass imputation strategies now.
The main advantage of having it like this, instead of the loop, is that, you only have to apply each imputation strategy once. If you use it in the loop for every model, then e.g. Iterative KNN imputation has to be run several times, and it is a very costly one. What do you think?
We could also have it as a possible future improvement.
from probatus.
Related Issues (20)
- Implement automatic feature selection methods (Finish work started by #173) HOT 2
- Unit tests should only contain assertions that make sense in the context of the functionality. HOT 2
- Mkdocs fails HOT 2
- Update Probatus to use the latest version of SHAP HOT 23
- Antivirus blacklisted and blocked Probatus website HOT 7
- Option early_stopping_rounds missing for LightGBM in ShapRFECV HOT 11
- Patch release v2.1.1 HOT 2
- Spark Support of ShapRFECV HOT 3
- python3.12 support HOT 2
- Support for shap==0.43.0 HOT 6
- AttributeError: module 'numpy' has no attribute 'bool'. HOT 2
- Random state not set consistently. HOT 1
- Add explicit support for regressors next to classifiers HOT 1
- Introduce dependabot for help with dependency updates
- Investigate if parts of the codebase can leverage other libraries code HOT 2
- Update all notebooks according to latest code. HOT 1
- Probatus v3.0.0+ missing features & issues.
- Add a notebook which shows the use of Probatus with pySpark
- Add seed to explainer + remove np.random.state() HOT 1
- Create a new tag HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from probatus.