This repo is a complement to my other git repository where I trained models with XGBoost and lightgbm to predict churn rate of sample insurance clients. In this repo the same data set has been used to train a model with catboost.
- Comparing catboost with xgboost and lightgbm
- Making a soft and hard voter manually, saving trained models to hard drive and applying an ensemble of trained models on test data.
- Catboost has been used together with cross-validation.
- scikit-learn's train_test_split is used to make 70% train, 15% validation and 15% test data.
- A model is trained with catboost for each kfold cross-validation.
- f1_score is used to check model accuracy for each fold.
- Feature importance is calculated for all the folds together.
- With seaborn the features are sorted based on importance on target value and visualized.
- Each trained model in each fold is saved seperately with pickle.
- All saved models are loaded back and used to predict target value of test data set.
- All models' predictions are averaged together and rounded to 0 if the average value is less than equal to 0.5 and 1 if otherwise is true.
- A single catboost performed more accurately than the ensemble of 5 catboosts each trained by a different kfold.
- Training a catboost model with all trained data with not split, improved the performance by testing it on the df_test data frame. The priblem with this method is, there is no way to be sure the final model is not overfit.
Catboost can handle categorical data and does not require encoding. Still we check the impact of different encodings on catboost.