Code Monkey home page Code Monkey logo

home-credit-default-risk's Introduction

Navi-P4

Home Credit Default Risk

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience; Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

This is a Kaggle Challenge. The link to the Challenge is as below.

  https://www.kaggle.com/c/home-credit-default-risk/

Goal: Predict if an applicant is capable of repaying a loan.

Prerequisites

List of requirements and links to install them:

Use the setup.py to install all the Prerequisites for the Project

  python setup.py install

Data

The Data set consist of 6 CSV files namely:

  • Application_train data (307,511 * 122)
  • Application_test data (48,744 * 121)
  • Bureau data (1,716,428 * 17))
    • Bureau_balance data (27,299,925 * 3)
  • Previous_application data (1,670,214 * 37)
    • Installments_payments data (13,605,401 * 8)
    • Credit_card_balance data (3,840,312 * 23)
    • POS_CASH_balance data (10,001,358 * 8)

The data are all available for download on:

  Kaggle : kaggle competitions download -c home-credit-default-risk

This link was provided by Kaggle .com

Approach

This project features two end-to-end approaches taken to solve the Kaggle challenge 'Home Credit Default Risk'. The first approach is using features created manually and the second approach is using an automated feature creation tool. The results of both these methods are compared and it is found that automated feature engineering can create superior features, in a shorter amount of time.

Training models:

  • Gradient Boosting Machines(GBM)
  • RandomForest

Directory Specifications

  • Input data should follow the below structure:
    The project directory should contain a folder named tier1 containing all the 6 raw csv files.

parent folder(project directory)
     |- tier1
         |--Application_test.csv
         |--Bureau.csv
         |--Bureau_balance.csv
         |--Previous_application.csv
         |--Installments_payments.csv
         |--Credit_card_balance.csv
         |--POS_CASH_balance.csv

  • Output once the entire code is completed will follow the below structure:
    The submission file is written inside the project directory.

parent folder(project directory)
     |- tier1
         |--Application_test.csv
         |--Bureau.csv
         |--Bureau_balance.csv
         |--Previous_application.csv
         |--Installments_payments.csv
         |--Credit_card_balance.csv
         |--POS_CASH_balance.csv
     |- tier2
         |--Application_test.csv
         |--Bureau.csv
         |--Bureau_balance.csv
         |--Previous_application.csv
         |--Installments_payments.csv
         |--Credit_card_balance.csv
         |--POS_CASH_balance.csv
     |- tier3
         |--Application_train.csv
         |--Application_test.csv
     |- p4sub_gbm.csv

Running the code

main.py -d \<data_path>\ -m \<mode>\ -ft \<feature_type>\ -p \<primitive_set>\ -i \<imp_thresh>\

Parameters:

<data_path> Path to the parent folder created as per the directory specifications.
Default : Current working directory.

<mode> Mode to run the code in.
Default : 'all'
Choices : 'all' - To generate features and also train the model and generate predictions,
            'features' - To only generate the features,
            'model' - To only run the model

<feature_type> Type of feature selection to implement
Default : 'auto'
Choices : 'auto' - To generate features using feature tools
            'manual' - To generate features based on manual feature engineering

<primitive_set> Set of primitives to consider while using feature tools
Default : 'some'
Choices : 'some' - To use some of the primitives while using feature tools
            'all' - To use all the primitives while using feature tools

<imp_thresh> Importance threshold to consider while doing feature selection
Default : '0'

Output:

The program will output p4sub_gbm.csv in the given parent directory.

Sample Run Commands and Code Flow:

  • To run the entire code using auto feature selection and gbm for model predictions:

The below command will use the data from tier1, generate features using featuretools, and save the feature matrix into tier3. Then tier3 data is used to train gbm and generate predictions
python main.py -d /Users/hemanth/Desktop/MSAI/DataSciencePracticum/Projects/p4 -m all -ft auto

  • To run the entire code using manual feature selection and gbm for model predictions:

The below command will use the data from tier1, generate features using manual feature engineering techniques, and save the transformed data into tier2. Tier2 data is then joined according to the data model and saved into tier3. Then, tier3 data is used to train gbm and generate predictions.
python main.py -d /Users/hemanth/Desktop/MSAI/DataSciencePracticum/Projects/p4 -m all -ft manual

  • To generate features alone using manual/auto feature selection:

The below command will use the data from tier1, generate features using manual feature engineering techniques, and save the transformed data into tier2. Tier2 data is then joined according to the data model and saved into tier3.
python main.py -d /Users/hemanth/Desktop/MSAI/DataSciencePracticum/Projects/p4 -m features -ft manual

The below command will use the data from tier1, generate features using featuretools, and save the feature matrix into tier3.
python main.py -d /Users/hemanth/Desktop/MSAI/DataSciencePracticum/Projects/p4 -m features -ft auto

  • To train the model for features that are already generated:

The below command uses features generated by auto feature selection in tier3 data to train gbm and generate predictions
python main.py -d /Users/hemanth/Desktop/MSAI/DataSciencePracticum/Projects/p4 -m model -ft auto

The below command uses features generated by manual feature selection in tier3 data to train gbm and generate predictions
python main.py -d /Users/hemanth/Desktop/MSAI/DataSciencePracticum/Projects/p4 -m model -ft manual

References

See the references Wiki page for details.

References Wiki

Ethics Considerations

This project could be used as a part of a study on the loan repayment abilities for an individual. With this context in mind, we have undertaken certain ethics considerations to ensure that this project cannot be misused for purposes other than the ones intended.

See the ETHICS.md file for details. Also see the Wiki Ethics page for explanations about the ethics considerations.

Contibutors

See the contributors file for details.

Contributors

License

This project is licensed under the MIT License- see the LICENSE.md file for details

home-credit-default-risk's People

Contributors

hemanthme22 avatar srs96 avatar sushanth-kathirvelu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.