Code Monkey home page Code Monkey logo

tec-data-17-ml's Introduction

Credit Risk Analysis

Overview

In this project we had the task to find a Machine Learning model that will help us decide on wheter or not lending loands based on credit card data from LendingClub. We decided to use various sampling techniques to overcome the unbalanced distribution of good to risky loans, as our model needs to be able to predict the later ones. We managed to collect enough information from evaluating different models so we can make decisions based on the results.

Results

Before training the models we had to clean the data, mostly separating them and converting them to numeric values so the model can process it.

img

The results were the following:

Naive Random Oversampling

Accuracy:

0.6488335193435153

Matrix:

[[   73    28]
 [ 6904 10200]]

Report:

pre rec spe f1 geo iba sup
high_risk 0.01 0.72 0.60 0.02 0.66 0.44 101
low_risk 1.00 0.60 0.72 0.75 0.66 0.43 17104
avg n total 0.99 0.60 0.72 0.74 0.66 0.43 17205

SMOTE Oversampling

Accuracy:

0.662861851549982

Matrix:

[[   64    37]
 [ 5267 11837]]

Report:

pre rec spe f1 geo iba sup
high_risk 0.01 0.63 0.69 0.02 0.66 0.44 101
low_risk 1.00 0.69 0.63 0.82 0.66 0.44 17104
avg n total 0.99 0.69 0.63 0.81 0.66 0.44 17205

Cluster Centroids Undersampling

Accuracy:

0.5447339051023905

Matrix:

[[   70    31]
 [10324  6780]]

Report:

pre rec spe f1 geo iba sup
high_risk 0.01 0.69 0.40 0.01 0.52 0.28 101
low_risk 1.00 0.40 0.69 0.57 0.52 0.27 17104
avg n total 0.99 0.40 0.69 0.56 0.52 0.27 17205

Combination (Over-Under sampling) SMOTEEN

Accuracy:

0.6488335193435153

Matrix:

[[  73   28]
 [7271 9833]]

Report:

pre rec spe f1 geo iba sup
high_risk 0.01 0.72 0.57 0.02 0.64 0.42 101
low_risk 1.00 0.57 0.72 0.73 0.64 0.41 17104
avg n total 0.99 0.58 0.72 0.73 0.64 0.41 17205

Balanced Random Forest

Accuracy:

0.7885466545953005

Matrix:

[[   71    30]
 [ 2153 14951]]

Report:

pre rec spe f1 geo iba sup
high_risk 0.03 0.70 0.87 0.06 0.78 0.60 101
low_risk 1.00 0.87 0.70 0.93 0.78 0.62 17104
avg n total 0.99 0.87 0.70 0.93 0.78 0.62 17205

AdaBoost

Accuracy:

0.9316600714093861

Matrix:

[[   93     8]
 [  983 16121]]

Report:

pre rec spe f1 geo iba sup
high_risk 0.09 0.92 0.94 0.16 0.93 0.87 101
low_risk 1.00 0.94 0.92 0.97 0.93 0.87 17104
avg n total 0.99 0.94 0.92 0.97 0.93 0.87 17205

When using the Balanced Random Forest, we were able to get a list of features sorted by importance.

img

Summary

The most successful models were the ones that used Ensemble techniques like the Balanced Random Forest and Ada Boost just because we get a much better Harmonic Mean / f1 score as well as highes precision on the under-represented “high risk” class. However we cannot recommend to use any of these models just now because the precision in the “high risk” class remains too low, even with the high scores in Recall and Accuracy. We could argue that these models can help us detect low risk loans very efficiently but we want to be able to predict the high risk ones.

tec-data-17-ml's People

Contributors

albertov5 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.