-
Person or organization developing model: Agnes,
[email protected]
-
Model date: August, 2022
-
Model version: 1.0.2
-
License: MIT
-
Model implementation code: DNSC_6301_Project.ipynb
- Primary intended uses: This model is an example probability of default classifier, with an example use case for determining eligibility for a credit line increase.
- Primary intended users: Students in GWU DNSC 6301 bootcamp.
- Out-of-scope use cases: Any use beyond an educational example is out-of-scope.
- Data dictionary:
Name | Modeling Role | Measurement Level | Description |
---|---|---|---|
ID | ID | int | unique row indentifier |
LIMIT_BAL | input | float | amount of previously awarded credit |
SEX | demographic information | int | 1 = male; 2 = female |
RACE | demographic information | int | 1 = hispanic; 2 = black; 3 = white; 4 = asian |
EDUCATION | demographic information | int | 1 = graduate school; 2 = university; 3 = high school; 4 = others |
MARRIAGE | demographic information | int | 1 = married; 2 = single; 3 = others |
AGE | demographic information | int | age in years |
PAY_0, PAY_2 - PAY_6 | inputs | int | history of past payment; PAY_0 = the repayment status in September, 2005; PAY_2 = the repayment status in August, 2005; ...; PAY_6 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; ...; 8 = payment delay for eight months; 9 = payment delay for nine months and above |
BILL_AMT1 - BILL_AMT6 | inputs | float | amount of bill statement; BILL_AMNT1 = amount of bill statement in September, 2005; BILL_AMT2 = amount of bill statement in August, 2005; ...; BILL_AMT6 = amount of bill statement in April, 2005 |
PAY_AMT1 - PAY_AMT6 | inputs | float | amount of previous payment; PAY_AMT1 = amount paid in September, 2005; PAY_AMT2 = amount paid in August, 2005; ...; PAY_AMT6 = amount paid in April, 2005 |
DELINQ_NEXT | target | int | whether a customer's next payment is delinquent (late), 1 = late; 0 = on-time |
- Source of training data: GWU Blackboard, email
[email protected]
for more information - How training data was divided into training and validation data: 50% training, 25% validation, 25% test
- Number of rows in training and validation data:
- Training rows: 15,000
- Validation rows: 7,500
- Source of test data: GWU Blackboard, email [email protected] for more information
- Number of rows in test data: 7,500
- State any differences in columns between training and test data: None
- Columns used as inputs in the final model: 'LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6'
- Column(s) used as target(s) in the final model: 'DELINQ_NEXT'
- Type of model: Decision Tree
- Software used to implement the model: Python, scikit-learn
- Version of the modeling software: 3.7.13, 1.0.2
- Hyperparameters or other settings of your model:
DecisionTreeClassifier {'ccp_alpha': 0.0,'class_weight': None,'criterion': 'gini',
'max_depth': 12,'max_features': None,'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,'min_samples_leaf': 1,'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,'random_state': 12345,'splitter': 'best'}
Wider colors = positively correlated
darker colors = negatively correlated
So, when one variable goes up the other one goes down
i.e : there is a correlation between race and the outcome. There is a problem to figure out and fix.
This means people in certain race groups are not getting as many as other people
strong correlation between variables
- Metrics used to evaluate the final model (AUC and AIR): confusion matrix
Confusion matrix by RACE=1
actual: 1 actual: 0
predicted: 1 447 387
predicted: 0 139 501
(Hispanic)
Confusion matrix by RACE=2
actual: 1 actual: 0
predicted: 1 449 348
predicted: 0 157 537
(Black)
Confusion matrix by RACE=3
actual: 1 actual: 0
predicted: 1 176 813
predicted: 0 72 1228
(White)
Confusion matrix by RACE=4
actual: 1 actual: 0
predicted: 1 186 784
predicted: 0 59 1217
(Asian)
White proportion accepted: 0.568
Hispanic proportion accepted: 0.434
hispanic-to-white AIR: 0.76
White proportion accepted: 0.568
Black proportion accepted: 0.465
black-to-white AIR: 0.82
White proportion accepted: 0.568
Asian proportion accepted: 0.568
asian-to-white AIR: 1.00
Confusion matrix by SEX=1
actual: 1 actual: 0
predicted: 1 546 905
predicted: 0 179 1292
(Male)
Confusion matrix by SEX=2
actual: 1 actual: 0
predicted: 1 712 1427
predicted: 0 248 2191
(Female)
Male proportion accepted: 0.503
Female proportion accepted: 0.533
female-to-male AIR: 1.06
Confusion matrix by EDUCATION=1
actual: 1 actual: 0
predicted: 1 367 766
predicted: 0 144 1359
(Graduate School)
Confusion matrix by EDUCATION=2
actual: 1 actual: 0
predicted: 1 640 1115
predicted: 0 216 1551
(University)
Confusion matrix by EDUCATION=3
actual: 1 actual: 0
predicted: 1 249 409
predicted: 0 65 496
(High School)
Confusion matrix by EDUCATION=4
actual: 1 actual: 0
predicted: 1 0 9
predicted: 0 0 19
(Others)
Graduate School proportion accepted: 0.570
University proportion accepted: 0.502
university-to-graduate school AIR: 0.88
Graduate School proportion accepted: 0.570
High School proportion accepted: 0.460
high school-to-graduate school AIR: 0.81
Graduate School proportion accepted: 0.570
Others proportion accepted: 0.679
others-to-graduate school AIR: 1.19
Confusion matrix by MARRIAGE=1
actual: 1 actual: 0
predicted: 1 593 1004
predicted: 0 208 1573
(Married)
Confusion matrix by MARRIAGE=2
actual: 1 actual: 0
predicted: 1 647 1293
predicted: 0 213 1878
(Single)
Confusion matrix by MARRIAGE=3
actual: 1 actual: 0
predicted: 1 17 30
predicted: 0 6 29
(Others)
Married proportion accepted: 0.527
Married proportion accepted: 0.527
married-to-married AIR: 1.00
Married proportion accepted: 0.527
Single proportion accepted: 0.519
single-to-married AIR: 0.98
Married proportion accepted: 0.527
Others proportion accepted: 0.427
others-to-married AIR: 0.81
Confusion matrix by AGE=40
actual: 1 actual: 0
predicted: 1 39 59
predicted: 0 17 111
(Age In Years)
- State the final values, neatly -- as bullets or a table, of the metrics for all data: training, validation, and test data
Training AUC | Validation AUC | Test AUC | 5-Fold SD | Hispanic-to-White AIR | |
---|---|---|---|---|---|
1 | 0.645748 | 0.643880 | 0.639065 | 0.009275 | 0.894148 |
2 | 0.699912 | 0.687752 | 0.685590 | 0.012626 | 0.850871 |
3 | 0.742968 | 0.729490 | 0.728666 | 0.017375 | 0.799546 |
4 | 0.757178 | 0.741696 | 0.737322 | 0.017079 | 0.792435 |
5 | 0.769331 | 0.742480 | 0.739600 | 0.019886 | 0.829336 |
6 | 0.783722 | 0.749610 | 0.743847 | 0.017665 | 0.833205 |
7 | 0.795777 | 0.742115 | 0.737266 | 0.022466 | 0.835886 |
8 | 0.807291 | 0.739990 | 0.734446 | 0.015567 | 0.811300 |
9 | 0.822913 | 0.727224 | 0.728575 | 0.012042 | 0.811561 |
10 | 0.838052 | 0.720562 | 0.714933 | 0.013855 | 0.803621 |
11 | 0.855168 | 0.709864 | 0.702163 | 0.010405 | 0.837806 |
12 | 0.874251 | 0.688074 | 0.682614 | 0.00807 | 0.844889 |
- Plots related to the data or final model
-
Describe potential negative impacts of using your model:
-
Math or software problems: 70% accuracy rate, which means a 30% errors
-
Real-world risks: who, what, when or how: bias
-
-
Describe potential uncertainties relating to the impacts of using your model:
- Math or software problems: need for ongoing monitoring as we don't know how the model will function
- Real-world risks: who, what, when or how? Data privacy and security
-
Describe any unexpected or results: no missing values and PAY_0 being too important