Code Monkey home page Code Monkey logo

dsc-phase3-project's Introduction

SyriaTel Customer Churn Project

Author: Jason Lombino

For more information on this project please see my Presentation or Jupyter Notebook.

Business Problem

Reducing customer churn is an important part of running a successful business. It is more expensive to acquire new customers through advertising and promotions than it is to keep existing customers. In addition, it is often the customers who are paying the most who are the fastest to switch to a competitor for better pricing. Telecomunications company SyriaTel would like to focus on retaining customers by offering discounted rates to customers who are likely to leave soon. In order to do this, SyriaTel needs a model to predict which customers are likely to churn.

Provided with data on customers' accounts, the model should be helpful in answering the following questions:

  1. Will a given customer leave SyriaTel soon?
  2. Which account features best predict whether a customer will soon churn?

SyriaTel should be able to use this model to target all customers who will churn with discounted rates while avoiding discounting rates for customers who will not. The primary metric for the model will be Recall Score because it is most important that the model correctly identifies as many churning customers as possible. Precision Score will be a secondary focus to avoid giving out unnecessary discounts to customers who will not churn. Therefore an F-beta score weighted to favor recall will be the optimization target for models.

Data Understanding

The following dataset was provided by SyriaTel for modeling. It contains information on the account usage and history of 3300 SyriaTel customers in the United States. The target column, churn shows whether a given customer left SyriaTel during a one-month time frame. The dataset can be found in this repositiory at ./data/s_tel.csv.

Some of the features present in the data set are:

  • Churn (Target)
  • Account length
  • Total day charge
  • Total international charge
  • Customer service calls

Class Imbalance

One major consideration is that there are six times more non-churn than churn customers in the dataset. This can be a problem for many models and will need to be addressed using a method such as weighting the data points by class.


Exploratory Data Analysis

Before any models are run, it is always useful to take a look at the features of the dataset.

Feature Distributions

The following plot shows how each of the features in the dataset are distributed. Everything except voicemail messages, customer service calls, and international calls look close to normally distributed.


Feature Correlation Coefficients

The following plot shows the correlations between each of the predictor columns. It looks like some of the columns are perfectly correlated - total charges are just integer multiples of minutes. To avoid issues with multicolinearity all of the minutes features should be dropped in favor of the charge features.

Duplicate features to be dropped:

  • Total day minutes
  • Total eve minutes
  • Total night minutes
  • Total intl minutes


Selecting Useful Features

The difference between the churn and non-churn cusomters can be considered for each feature separately. Comparisons between the churn and non-churn groups can then be made to determine whether a given feature will be useful for separating the two groups. Features that can separate the groups well are more likely to be useful to a model predicting which group a customer belongs to.

Total Day Charge

Total Day Charge is an example of a feature that separates the churn and non-churn groups well as there is a significant difference in the mean value for each group.


Total Day Calls

Total Day Calls is an example of a feature that does not separate the churn and non-churn groups well. There is no significant difference in the mean value for each group.


Features that separate the data well include:

  • International plan
  • Voice mail plan
  • Number vmail messages
  • Total day charge
  • Total eve charge
  • Total night charge
  • Total intl calls
  • Total intl charge
  • Customer service calls

Features that do not separate the data well include:

  • State
  • Account length
  • Total day calls
  • Total eve calls
  • Total night calls

Feature Engineering

Creating new features based on existing features may be useful for separating the customers into churn and non-churn groups.

Total Charge

Total Charge is the sum of the total day, eve, night, and international charge features and appears to separate the customers into groups well.


Preliminary Modeling

The primary metric used to evaluate models for this project is F-beta Score weighted for recall. Class weights and SMOTE were used to correct for the ~6:1 class imbalance where appropriate.

The following preliminary models were used to attempt to classify customers into churn and non-churn groups:

  1. Logistic Regression
  2. Decision Tree
  3. K Neighbors
  4. Extra Trees
  5. Random Forest
  6. AdaBoost
  7. XGBoost

The three top performing models with default parameters were:

  • Decision Tree
  • Random Forest
  • XGBoost

The random forest and XGBoost models were selected for optimization because they had the most parameters available to tune for performance. Both the random forest and XGBoost models performed better using SMOTE than class weights in cross validation.

Model Tuning

The main problem with both models is that they seem to be overfitting. This is evidenced by the scores on training being significantly higher than the scores on cross validation. This can be addressed by reducing the number of features each model has to train on and using a grid search to find the optimal hyperparameters for each model.

Features to drop were selected based on the exploratory data analysis and feature importances in the original models. Feature selection alone did appear to improve the predictions of the models, but did not solve the problem of overfitting to the training data.

Grid search with cross validation was used to loop over a veriety of hyperparameters and optimize each model. Feature selection and grid search together significantly improved the models' overfitting problems.

Final Model Evauluation

The XGBoost model was selected as the final model because it tends to overfit the data less than the random forest and the cross validation scores are nearly identical.

XGBoost test set scores:

  • Accuracy: 0.972
  • Precision: 0.964
  • Recall: 0.835
  • F1: 0.895

Here is the confusion matrix for the optimized XGBoost model on the test set.


Feature Importances

The most important features for predicting churn are:

  1. Total Charge
  2. Customer Service Calls
  3. International Plan
  4. Total International Calls
  5. Total International Charge

Using SHAP (SHapley Additive exPlanations)

This shows how much each feature contributed to a prediction for the XGBoost model on average.


This plot shows the contribution of each feature for every prediction on the train set. Color is the value of the feature and position is the contribution of that feature for a given prediction.


This plot shows the feature contributions for one model prediction from the train set.


Conclusion

The final model can be used by SyriaTel to predict whether a customer will churn soon with 83% Recall and 96% Precision.

The most important features for predicting churn are:

  1. Total Charge
  2. Customer Service Calls
  3. International Plan
  4. Total International Calls
  5. Total International Charge

SyriaTel can use this model to offer discounts to customers who are likely to churn soon while avoiding offering unnecessary discounts to customers who are unlikely to do so.

Repository Information

├── README.md                        <- The top-level README you are currently reading
├── Final_Notebook.ipynb             <- Jupyter notebook with my full analysis
├── Final_Notebook.pdf               <- PDF version of project Jupyter notebook
├── slides.pdf                       <- PDF version of project presentation
├── data                             <- Project data provided by upstream
└── images                           <- Graphs generated from code

dsc-phase3-project's People

Contributors

jlombino avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.