Code Monkey home page Code Monkey logo

ava33343 / risky_business Goto Github PK

View Code? Open in Web Editor NEW
0.0 3.0 0.0 59.93 MB

Machine learning for credit card default. Precision-recalls are calculated due to imbalanced data. Confusion matrices and test statistics are compared with each other based on Logit over and under-sampling methods, decision tree, SVM, ensemble learning using Random Forest, Ada Boost and Gradient Boosting. Easy Ensemble AdaBoost classifier appears to be the model of best fit for the given data.

License: Apache License 2.0

Jupyter Notebook 100.00%
machine-learning scikit-learn statstics recall-precision roc-curves svm random-forest-classifier ensemble-learning adaboost gradient-boosting

risky_business's Introduction

Risky Business?

Now worries. Machine Learning get you covered!

In this assignment, we predict loan status by classifying data through:

  • Logistic regresions
    • interpreted as probability for its output ranges from 0 to 1
    • in this case, 0 represents low risk and 1 stands for high risk
  • Process
    • we select models based on imbalanced data structure as the percentage of good loans far outweigh the bad loans
    • training the data to fit into the model with assiciated parameters
    • interpret the test statistics and compare the models to choose the best one
    • use the most robust model to predict future cases
      • before being used to make decisions, models need to be tested first

Test Statistics

Confusion Matrix

Predicted 0 Predicted 1
Actually 0 TN FP
Actually 1 FN TP
  • Accuracy = (TP+TN)/(TP+TN+FP+FN)

    • It is biased in this case as it treats FP and FN equally and out data if highly imbalanced with more weights on true negatives (TN) for loans of low risks.
    • Therefore, other test statistics need to be considered
  • Precision = TP/(TP+FP)

    • Out of all the "high-risk" predictions for loan statuses, how many are actually at "high risk".
    • It focuses on the "high-risk" data that there are significantly less observations and uses figuress in the second column of the confusion matrix.
  • Recall = TP/(TP+FN)

    • How many actual "high-risk" loans are predicted correctly?
    • It features the second row of the confusion matrix that is consisted of the light-weighted data for high-risk loans
    • Recall is also the sensitivity of the testing model
  • Specificity = TN/(TN+FP)

    • How many of the actuall low-risk loan statuses are predicted correctly?
    • It spotlight the first row of our confusion matrix and examine only the more weighted negative observations in our data.
  • F1 = 2 x (Precision x Recall)/(Precision + Recall)

    • F1 score is the harmonic mean of precision and recall.
    • As precission and recall usually go in opposite directions, f1 score is a good balance between the two.
    • F1 leverages the second row and column for actual and predicted "high risk" loan statuses.

Models


For imbalanced Data on Loan Statuses (more good than bad ones), we use the following practices and select classifiers to make logistic regression models more robust for our imbalanced observations

  1. Resampling
  2. Ensemble Learning

Resampling

We use the imbalanced learn library to resample the LendingClub data.

a. Oversample the data using

i. the `Naive Random Oversampler` 
  • The balanced accuracy score is 0.8301954901422769.

cm_oversample cr_oversample

ii. `SMOTE` algorithms.
  • The balanced accuracy score is 0.8085877231650871.

cm_smote_oversample cr_smote_oversample

b. Undersample the data using the Cluster Centroids algorithm.

  • The balanced accuracy score is 0.8085877231650871.

cm_undersample cr_undersample

c. Combined over- and under-sampling using SMOTEENN algorithm.

  • The balanced accuracy score is 0.7897719094249999.

cm_smoteenn_combined cr_smoteenn_combined

Conclusion on Resampled Classifiers

SMOTE and Naive Random sampler are our top two choices. All models feature a high precision score of 0.99.

The Naive Random Oversampler produces the highest balanced accuracy score of 0.83.

SMOTE Oversampler provides the best recall or sensitivity score of 0.88 by oversample from centroids of k-nearest neighborhood.

The highest geometric mean score of 0.93 is created by Naive Random Sampler.

SMOTE produces the best f1 score of 0.93.

Precision-Recall Curve

A comparison between SMOTE oversampling and SMOTEEN combined sampling models are carried out by plotting Precision-Recall curves. The greater the area under the Precision-Recall curve, the better, or more robust the model is. As shown below, we can see that the area under the orange line marked by [x] is slightly more than that of the blue line marked with [.]. Therefore, the SMOTEENN combined sampling is a better model. pr_smote_smoteenn

Discussion

SVM

SVM is not a good model for our data based on the balanced accuracy score and confusion matrix below.

  • The balanced accuracy score is 0.5904102549108688.

cm_svm_combined cr_svm_combined

Decision Tree

Even though recall, precision and balanced accuracy scores are all as high as 0.99, the decision tree model is not our best choice. A picture of the tree is show below. As we can see, this model is prone to over-fitting.

  • The balanced accuracy score is 0.9938390002906132.

cr_cm_decision_tree

decision_tree

Ensemble Learning

Ensembled learning combines weak and strong learners to build "hyper-models" for our imbalanced data from LendingClub.

a. Model of balanced random forest classifier

  • The balanced accuracy score is 0.748956988570169.

cm_brf cr_brf

Feature importance ranking from highest to the lowest are shown below. We can see that the top five influential features, i.e. variables, that impact the loan status to be rated as low or high risk are total_rec_prncp, total_pymnt_inv, total_rec_int, total_pymnt, last_pymnt_amnt.

Among featues that are most irrelevant, there are tax_liens, acc_now_delinq, delinq_amnt, pymnt_plan_n and recoveries.

Note that causal effect need to be rationalized for further justification of the model.

features_brf_i features_brf_ii

b. Model of easy ensemble AdaBoost classifier.

  • The balanced accuracy score is 0.933294499197078.

cm_adaboost cr_adaboost

Conclusion on Ensemble Learning:

The Easy Ensemble AdaBoost Classifier has a higher balanced accuracy score of 0.93.

Again, Easy Ensemble AdaBoost Classifier has a higher recall, in other words, sensitivity score of 0.94.

The geometric mean score of 0.93 generated by Easy Ensemble AdaBoost Classifier is the higher of the two.

Based on Balanced Random Forest classifier, the top three features affecting loan status predictions are total_rec_prncp, total_pymnt_inv and total_rec_int, in that order.

Precision-Recall Curve

A comparison between Balanced Random Forest and Easy Ensemble AdaBoost Classifier are carried out by plotting Precision-Recall curves. The greater the area under the Precision-Recall curve, the better, or more robust the model is. As shown below, we can see that the area under the green line marked by [x] is more than that under the purple line marked with [.]. Therefore, the Easy Ensemble AdaBoost Classifier is a better model. pr_brf_eec


Discussion

Gradient-Boosting Tree

  • The balanced accuracy score is 0.994768962510898.

cm_gradientboost cr_gradientboost

A GradientBoosting Tree is shown below: GradientBoosting_tree

Easy Emsemble AdaBoost Classifier is the best model as it covers the greatest area under its green curve marked by [x]. The pink curve represents Gradient_Boosting Tree classifier model. It is marked by [^] with the least area underneath it among the three, A comparison for the precision-recall curves on the three ensemble methods are shown below. Again, the conclusion pr_brf_eec_grad


Files

Resampling Notebook

Ensemble Notebook

Lending Club Loans Data


References:

risky_business's People

Contributors

ava33343 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.