SyriaTel Customer Churn Project

Author: Jason Lombino

For more information on this project please see my Presentation or Jupyter Notebook.

Business Problem

Reducing customer churn is an important part of running a successful business. It is more expensive to acquire new customers through advertising and promotions than it is to keep existing customers. In addition, it is often the customers who are paying the most who are the fastest to switch to a competitor for better pricing. Telecomunications company SyriaTel would like to focus on retaining customers by offering discounted rates to customers who are likely to leave soon. In order to do this, SyriaTel needs a model to predict which customers are likely to churn.

Provided with data on customers' accounts, the model should be helpful in answering the following questions:

Will a given customer leave SyriaTel soon?
Which account features best predict whether a customer will soon churn?

SyriaTel should be able to use this model to target all customers who will churn with discounted rates while avoiding discounting rates for customers who will not. The primary metric for the model will be Recall Score because it is most important that the model correctly identifies as many churning customers as possible. Precision Score will be a secondary focus to avoid giving out unnecessary discounts to customers who will not churn. Therefore an F-beta score weighted to favor recall will be the optimization target for models.

Data Understanding

The following dataset was provided by SyriaTel for modeling. It contains information on the account usage and history of 3300 SyriaTel customers in the United States. The target column, churn shows whether a given customer left SyriaTel during a one-month time frame. The dataset can be found in this repositiory at ./data/s_tel.csv.

Some of the features present in the data set are:

Churn (Target)
Account length
Total day charge
Total international charge
Customer service calls

Class Imbalance

One major consideration is that there are six times more non-churn than churn customers in the dataset. This can be a problem for many models and will need to be addressed using a method such as weighting the data points by class.

Exploratory Data Analysis

Before any models are run, it is always useful to take a look at the features of the dataset.

Feature Distributions

The following plot shows how each of the features in the dataset are distributed. Everything except voicemail messages, customer service calls, and international calls look close to normally distributed.

Feature Correlation Coefficients

The following plot shows the correlations between each of the predictor columns. It looks like some of the columns are perfectly correlated - total charges are just integer multiples of minutes. To avoid issues with multicolinearity all of the minutes features should be dropped in favor of the charge features.

Duplicate features to be dropped:

Total day minutes
Total eve minutes
Total night minutes
Total intl minutes

Selecting Useful Features

The difference between the churn and non-churn cusomters can be considered for each feature separately. Comparisons between the churn and non-churn groups can then be made to determine whether a given feature will be useful for separating the two groups. Features that can separate the groups well are more likely to be useful to a model predicting which group a customer belongs to.

Total Day Charge

Total Day Charge is an example of a feature that separates the churn and non-churn groups well as there is a significant difference in the mean value for each group.

Total Day Calls

Total Day Calls is an example of a feature that does not separate the churn and non-churn groups well. There is no significant difference in the mean value for each group.

Features that separate the data well include:

International plan
Voice mail plan
Number vmail messages
Total day charge
Total eve charge
Total night charge
Total intl calls
Total intl charge
Customer service calls

Features that do not separate the data well include:

State
Account length
Total day calls
Total eve calls
Total night calls

Feature Engineering

Creating new features based on existing features may be useful for separating the customers into churn and non-churn groups.

Total Charge

Total Charge is the sum of the total day, eve, night, and international charge features and appears to separate the customers into groups well.

Preliminary Modeling

The primary metric used to evaluate models for this project is F-beta Score weighted for recall. Class weights and SMOTE were used to correct for the ~6:1 class imbalance where appropriate.

The following preliminary models were used to attempt to classify customers into churn and non-churn groups:

Logistic Regression
Decision Tree
K Neighbors
Extra Trees
Random Forest
AdaBoost
XGBoost

The three top performing models with default parameters were:

Decision Tree
Random Forest
XGBoost

The random forest and XGBoost models were selected for optimization because they had the most parameters available to tune for performance. Both the random forest and XGBoost models performed better using SMOTE than class weights in cross validation.

Model Tuning

The main problem with both models is that they seem to be overfitting. This is evidenced by the scores on training being significantly higher than the scores on cross validation. This can be addressed by reducing the number of features each model has to train on and using a grid search to find the optimal hyperparameters for each model.

Features to drop were selected based on the exploratory data analysis and feature importances in the original models. Feature selection alone did appear to improve the predictions of the models, but did not solve the problem of overfitting to the training data.

Grid search with cross validation was used to loop over a veriety of hyperparameters and optimize each model. Feature selection and grid search together significantly improved the models' overfitting problems.

Final Model Evauluation

The XGBoost model was selected as the final model because it tends to overfit the data less than the random forest and the cross validation scores are nearly identical.

XGBoost test set scores:

Accuracy: 0.972
Precision: 0.964
Recall: 0.835
F1: 0.895

Here is the confusion matrix for the optimized XGBoost model on the test set.

Feature Importances

The most important features for predicting churn are:

Total Charge
Customer Service Calls
International Plan
Total International Calls
Total International Charge

Using SHAP (SHapley Additive exPlanations)

This shows how much each feature contributed to a prediction for the XGBoost model on average.

This plot shows the contribution of each feature for every prediction on the train set. Color is the value of the feature and position is the contribution of that feature for a given prediction.

This plot shows the feature contributions for one model prediction from the train set.

Conclusion

The final model can be used by SyriaTel to predict whether a customer will churn soon with 83% Recall and 96% Precision.

The most important features for predicting churn are:

Total Charge
Customer Service Calls
International Plan
Total International Calls
Total International Charge

SyriaTel can use this model to offer discounts to customers who are likely to churn soon while avoiding offering unnecessary discounts to customers who are unlikely to do so.

Repository Information

├── README.md                        <- The top-level README you are currently reading
├── Final_Notebook.ipynb             <- Jupyter notebook with my full analysis
├── Final_Notebook.pdf               <- PDF version of project Jupyter notebook
├── slides.pdf                       <- PDF version of project presentation
├── data                             <- Project data provided by upstream
└── images                           <- Graphs generated from code

jlombino / dsc-phase3-project Goto Github PK

dsc-phase3-project's Introduction