TITLE:
Predicting Loan Repayment Ability with Grade Using Machine Learning and Deep Learning.
AIM:
The principle focus of our project is to perform data analysis and train a model using the most popular Machine Learning algorithm – Regularized Logistic Regression, Random Forest and Neural Networks in order to analyse the historical data that is present regarding the Loan Repayment.
ABSTRACT:
Evaluating and predicting the repayment ability of the loaners is important for the banks to minimize the risk of loan payment default. By this reason, there is a system created by the banks to process the loan request based on the loaners’ status, such as employment status, credit history, etc.. However, the current existing evaluation system might not be appropriate to evaluate some loaners repayment ability, such as students or people without credit histories. In order to properly assess the repayment ability of all groups of people, we trained various machine learning models on a Lending club dataset and evaluated the importance of all the features used. Then, based on the importance score of the features, we analyze and select the most identifiable features to predict the repayment ability of the loaner.
INTRODUCTION:
Due to insufficient credit histories, many people are struggling to get loans from trustworthy sources, such as banks. These people are normally students or unemployed adults, who might not have enough knowledge to justify the credibility of the unidentified lenders. The untrustworthy lenders can take advantage of these borrowers by taking high interest rates or including hidden terms in the contract. Instead of evaluating the borrower based on their credit score, there are many other alternative ways to measure or predict their repayment abilities. For example, employment can be a big factor to affect the person’s repayment ability since an employed adult has more stable incomes and cash flow. Some other factors, such as real estates, marriage status and the city of residence, might also be useful in the study of the repayment ability. Therefore, in our project, we are planning to use machine learning algorithms to study the correlations between borrower status and their repayment ability. We found the dataset from Lending club, to be used in this project. This open dataset contains 100K anonymous client’s with 152 unique features. By studying the correlation between these features and repayment ability of the clients, our algorithm can help lenders evaluate borrowers from more dimensions and can also help borrowers, especially those who do not have sufficient credit histories, to find credible loaner, leading to a win-win situation.
OVERVIEW:
Data Segmentation and Data Cleaning
- Exploratory Data Analysis using python’s data visualisation libraries.
- Training the model based on the historical data available.
DATASET OVERVIEW:
For this project, we have taken Bank loan dataset from Lending Club. This dataset gives us the details about the loans that are either Fully Paid or Charged off by the customer. According to the dataset, there are 152 independent variables describing about a particular Loan Application. We analyzed Lending Club’s dataset of roughly 100k loans between 2007–18. We chose to only analyze loans that were paid off in full, charged off or defaulted in this case. Also the characteristics at time of application and loan characteristics at time of issuance.
File name | Description | Description |
---|---|---|
accepted_2017_to_2018q4.csv | Information about loan accepted when they submit the application. | 152 |
rejected_2017_to_2018q4.csv | Information about loan rejected when they submit the application. | 9 |
DATA SEGMENTATION AND DATA CLEANING:
- In this project, we have prepared a processed dataset by and collected the clear-cut data available online.
- Using pandas data frame, we have calculated the mean of every column.
- We have dropped the columns that containing more than 70% of missing values.
- By using the fillna we have filled all the cells with mean values for numeric data.
- While doing so, we have not included the countries having a zero value in their cells.
- We have manually replaced the zeros in a column with the mean of the column.
- The original format of the file was XLSX. We have converted into CSV format and proceeded.
We used Jupyter notebook and Python libraries (Matplotilb, Pandas, Seaborn) for data visualization.
We first began to look at our data to better understand our demographics. We started by taking a look at the length of employment for our customers. We also explored the distribution of the loan amounts and see when did the loan amount issued increased significantly and were able to draw the following conclusion:
- Most of the loans issued were in the range of 10,000 to 20,000 USD.
- The year of 2015 was the year were most loans were issued.
- Loans were issued in an incremental manner. (Possible due to a recovery in the U.S economy)
- The loans applied by potential borrowers, the amount issued to the borrowers and the amount funded by investors are similarly distributed, meaning that it is most likely that qualified borrowers are going to get the loan they had applied for.
Next, we took a look at what is the amount of bad loans Lending Club has declared so far keeping in mind that there were still loans that were at a risk of defaulting in the future. Also, the amount of bad loans could increment as the days pass by, since we still had a great amount of current loans. Average annual income was an important key metric for finding possible opportunities of investments in a specific region.
The conclusion that we drew from this were:
- Currently, bad loans consist 7.60% of total loans but remember that we still have current loans which have the risk of becoming bad loans. (So this percentage is subjected to possible changes.)
- The NorthEast region seemed to be the most attractive in term of funding loans to borrowers.
- The SouthWest and West regions have experienced a slight increase in the "median income" in the past years.
- Average interest rates have declined since 2012 but this might explain the increase in the volume of loans.
- Employment Length tends to be greater in the regions of the SouthWest and West
- Clients located in the regions of NorthEast and MidWest have not experienced a drastic increase in debt-to-income(dti) as compared to the other regions.
- Fully Paid loans tend to be smaller. This could be due to the age of the loans
- Default has the highest count among other loan status.
- In Grace Period and Late(16~30 days) have the highest loan amount and mean.
The next question we wanted to answer was "What kind of loans are being issued?". We decided to approach this by the grade that LendingClub assigns to the loan. The Grade is a value from A to G that is a culmination of LendingClub’s own analysis on the ability for the customer to repay the grade and the insights that we drew from this were:
- Interest rate varied wildly, reaching nearly 30% for high-risk loans
- Grade A has the lowest interest rate around 7%
- Grade G has the highest interest rate above 25%
In the next part we analyzed loans issued by region in order to see region patters that will allow us to understand which region gives Lending Club.
Summary:
- South-East , West and North-East regions had the highest amount lof loans issued.
- West and South-West had a rapid increase in debt-to-income starting in 2012.
- West and South-West had a rapid decrease in interest rates (This might explain the increase in debt to income)
Deeper Look into Bad Loans:
We looked at the number of loans that were classified as bad loans for each region by its loan status. We also had a closer look at the operative side of business by state. This gave us a clearer idea in which state we have a higher operating activity We focused on three key metrics: Loans issued by state (Total Sum), Average interest rates charged to customers and average annual income of all customers by state. The purpose of this analysis was to see states that give high returns at a descent risk.
And we concluded as follows:
- The regions of the West and South-East had a higher percentage in most of the "bad" loan statuses.
- The North-East region had a higher percentage in Grace Period and Does not meet Credit Policy loan status. However, both of these are not considered as bad as default for instance.
- Based on this small and brief summary we can conclude that the West and South-East regions have the most undesirable loan status, but just by a slightly higher percentage compared to the North-East region.
- California, Texas, New York and Florida were the states in which the highest amount of loans were issued.
- Interesting enough, all four states had an approximate interest rate of 13% which is at the same level of the average interest rate for all states (13.24%)
- California, Texas and New York were all above the average annual income (with the exclusion of Florida), this gave possible indication why most loans were issued in these states.
Team-B’s : After further analysis and cleaning of data and filling missing values we came down to 2011813 rows × 91 columns of our data and carefully observing the relevance, we out of that selected 33 features and one target value as listed below:
1. 'funded_amnt_inv',
2. 'int_rate',
3. 'emp_length',
4. 'annual_inc',
5. 'pymnt_plan',
6. 'dti',
7. 'deling_2yrs',
8. 'fico_range_low',
9. 'ing_last_6mths',
10. 'open_acc',
11. 'pub_rec',
12. 'revol_bal',
13. 'revol_util',
14. 'total_acc',
15. 'initial_list_status',
16. 'total_rec_late_fee',
17. 'last_pymnt_amnt',
18. 'last_fico_range_high',
19. 'last_fico_range_low',
20. 'collections_12_mths_ex_med',
21. 'policy_code',
22. 'application_type',
23. 'acc_now_dealing',
24. 'tot_coll_amt',
25. 'tot_cur_bal',
26. 'total_bal_il',
27. 'max_bal_bc',
28. Datetime columns: 'issue_d_month', 'earliest_cr_line_month','earliest_cr_line_year','last_pymnt_d_month', 'last_credit_pull_d_month','last_credit_pull_d_year'__
Target value: 'grade'
Next, after scaling our data using “standard scaler” we split our data into training data, validation data and test data with a ratio of 9:1 (90% training data and 10% validation and test data). Going further we first employed Regularized Logistic Regression and created a Baseline model. After fitting our training data we checked for its accuracy and it came out to be 86.74% , so in order to increase its accuracy we went for Hyper-parameterisation using Grid search CV module and got our best parameters but there was no drastic change in accuracy. We also calculated precision, recall and F1 score using baseline model on test data and finally the confusion matrix was made and plotted for various classes.
So, we went on to use a different ML algo this time. We used Random Forest and after fitting our data and tuning it using Hyper-parameterisation we got an excellent accuracy of 92.03%. We again calculated the same things as mentioned above. Next we calculated AUC (Area Under the Curve) for both regularized logistic regression model and random forest model and it came out to be 98.68% and 99.7% resp. We also plotted ROC curve for both our models.
Team-B’s : We finally chose to work with Neural Networks for our model which are computing systems inspired by the biological neural networks that constitute animal brains. An NN is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal to other neurons. An artificial neuron that receives a signal then processes it and can signal neurons connected to it.
For our model, Sequential is a model. Dense is a layer accepting no. of hidden units in each layer and some activation. Our model has one hidden unit only. The hidden unit has 64 units and an activation of relu. The output unit has 7 units has we have predict 7 classes A,B,C....,G and the loss is categorical cross-entropy. We are using categorical cross-entropy as output has 7 classes. Also the activation in output layer is softmax. What a softmax function does is it's sums up the probability of all classes to one.
Finally, we made a .h5 file of our model and used cloud services of Heroku to finally create our web application.
Deployed Online url : https://loangradeprediction.herokuapp.com/
Logistic Regression is a Machine Learning classification algorithnm that is used to predict the probability of categorical dependent variable.In logistic regression,the dependent variable is a binary or Multinomial, which involves having more than one category.
- The independent variables should be independent of each other. That is, the model should have little or no multicollinearity.
- The independent variables are linearly related to the log odds.
- Logistic regression requires quite large sample sizes.
🧮 Mathematics behind logistic regression
Remember how linear regression often used ordinary least squares to arrive at a value? Logistic regression relies on the concept of 'maximum likelihood' using sigmoid functions. A 'Sigmoid Function' on a plot looks like an 'S' shape. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like thus:
where the sigmoid's midpoint finds itself at x's 0 point, L is the curve's maximum value, and k is the curve's steepness. If the outcome of the function is more than 0.5, the label in question will be given the class '1' of the binary choice. If not, it will be classified as '0'.
The data set contains Loan grades:7 classes where each class refers to digit(1:A,2:B,3:C,4:D,5:E,6:F,G:7).Objective of our model is to predict the correct grade ,based on given data and deploy that model to predict Loan Grade on Heroku using Flask.
Building a model to predict these multiclass is straightforward in Scikit-learn.
-
Create Test and Train Dataset
select variables for classifiaction model and split dataset,so that we can use one set of data for training the model and one set for testing the model,split the training and test sets calling
train_test_split()
:from sklearn.model_selection import train_test_split features = [['funded_amnt_inv', 'int_rate', 'grade', 'emp_length','annual_inc', 'pymnt_plan', 'dti', 'deling_2yrs', 'fico_range_low','ing_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util','total_acc', 'initial_list_status', 'total_rec_late_fee','last_pymnt_amnt', 'last_fico_range_high', 'last_fico_range_low','collections_12_mths_ex_med', 'policy_code', 'application_type','acc_now_dealing', 'tot_coll_amt', 'tot_cur_bal', 'total_bal_il', 'max_bal_bc', 'issue_d_month', 'earliest_cr_line_month','earliest_cr_line_year', 'last_pymnt_d_year','last_credit_pull_d_month', 'last_credit_pull_d_year']] X=features.drop(['grade'],axis=1) y=features['grade'] # split data into training and testing data, for both features and target train_X, test_X, train_y, test_y = train_test_split(X, y, test_size= 0.2, random_state=1)
-
Normalization of independent features
we have normalize our data using using
StandardScaler()
:from sklearn.preprocessing import StandardScaler ss = StandardScaler() train_X = ss.fit_transform(train_X) test_X = ss.fit_transform(test_X)
Now train the model, by calling
fit()
with training data:from sklearn.metrics import accuracy_score,confusion_matrix, classification_report from sklearn.linear_model import LogisticRegression accepted_model = LogisticRegression(penalty='none') accepted_model.fit(train_X, train_y) val_predictions = accepted_model.predict(test_X)
Model Score
Check the model score for training and test data:
print('Accuray: {}'.format(accepted_model.score(train_X,train_y)))
print("Accuracy: {}".format(accepted_model.score(test_X, test_y)))
Accuray: 0.8697499145670882
Accuracy: 0.8702987103684982
Confusion Matrix
- Confusion Matrix helps us to visualize the performance of model.
- The diagonal elements represent the number of points for which the predicted label is equal to true label.
- Off-diagonal elements are those that are mislabeled by the classifier.
- The heigher the diagonal values of confusion matrix the better indicating many correct. SO for our model confusion matrix is-
confusion = confusion_matrix(test_y, val_predictions)
print(" \n Confusion Matrix is : \n", confusion)
Confusion Matrix is :
[[ 75116 4178 0 0 0 0 0]
[ 5511 105556 7105 0 0 0 0]
[ 40 6860 103954 4483 0 0 0]
[ 27 0 6297 46486 3950 0 0]
[ 8 0 1 7558 14610 1211 0]
[ 6 0 0 11 3020 3746 503]
[ 1 0 0 0 87 1330 708]]
Classification Report
Classification report is used to measure the quality of prediction from classification algorithm
- Precision:Indiactes how many classes are correctly classified.
- Recall:Indicates what proportions of actual positives was identified correctly.
- F-score:It is the harmonic mean between precision & recall.
- Support:It is the number of occurences of the given class in our dataset.
report = classification_report(test_y, val_predictions)
print(" \n Classification Report is : \n", report)
Classification Report is :
precision recall f1-score support
1 0.93 0.95 0.94 79294
2 0.91 0.89 0.90 118172
3 0.88 0.90 0.89 115337
4 0.79 0.82 0.81 56760
5 0.66 0.61 0.63 23388
6 0.55 0.46 0.50 7286
7 0.57 0.33 0.42 2126
accuracy 0.87 402363
macro avg 0.75 0.71 0.73 402363
weighted avg 0.86 0.87 0.87 402363
This is not a bad model; its accuracy is in the 87% range so ideally you could use it to predict the loan grade given a set of variables.
Let's do one more visualization to see the so-called 'ROC' score:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
# Get class probability scores
grade_prob = accepted_model.predict_proba(test_X)
# Get ROC metrics for each class
fpr = {}
tpr = {}
thresh ={}
for i in range(len(grade)):
fpr[i], tpr[i], thresh[i] = roc_curve(test_y, grade_prob[:,i], pos_label=i)
# Plot the ROC chart
plt.figure(figsize=(10,10))
plt.plot(fpr[0], tpr[0],color='orange', label=str(grade[0]) + ' vs Rest')
plt.plot(fpr[1], tpr[1],color='green', label=str(grade[1]) + ' vs Rest')
plt.plot(fpr[2], tpr[2],color='blue', label=str(grade[2]) + ' vs Rest')
plt.plot(fpr[3], tpr[3],color='red', label=str(grade[3]) + ' vs Rest')
plt.plot(fpr[4], tpr[4],color='yellow', label=str(grade[4]) + ' vs Rest')
plt.plot(fpr[5], tpr[5],color='pink', label=str(grade[5]) + ' vs Rest')
plt.plot(fpr[6], tpr[6],color='cyan', label=str(grade[6]) + ' vs Rest')
plt.title('Multiclass ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive rate')
plt.legend(loc='best')
plt.show()
ROC curves are often used to get a view of the output of a classifier in terms of its true vs. false positives. "ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis." to compute the actual 'Area Under the Curve' (AUC):
roc_auc = roc_auc_score(test_y, grade_prob, multi_class=('ovr'))
print(" \n AUC using ROC is : ", roc_auc)
AUC using ROC is : 0.9865396293036989
The result is 0.9865396293036989. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is pretty good.
- What is mean by model deployment? Deploying a machine learning model, known as model deployment, simply means to integrate a machine learning model and integrate it into an existing production environment, where it can take in an input and return an output.
- The Lending Club’s Loan Grade Prediction Machine Learning Model is trained and tested using Logistic Regression with 87.02% accuracy. And using flask framework, HTML, CSS and python all the necessary files have created along with Procfile and requirement.txt.
- Using Heroku, Platform as a service we have successfully deployed our model in the online web server and provided ease to end users.
- Deployed Online url:http://loangradeprediction-api.herokuapp.com/