Code Monkey home page Code Monkey logo

dsc_study's Introduction

dsc_study's People

Contributors

yukiregista avatar rkondo3 avatar sinsukehlab avatar dependabot[bot] avatar

Stargazers

 avatar

Watchers

 avatar

dsc_study's Issues

Exploratory data analysis

  • Read paper and find what is preserved in the artificial data
  • What can we expect to hold in the test data
  • We do not need to focus that much on improving the score on the validation dataset
  • train data / validation data / test data / original data
  • List up and compare all important features in train / original data

original kaggle data
{private_drive}/FDUAcomp/SBAnational.csv

Correlation

  • df_train except for one-hot columns
  • df_train including one-hot columns

Tableau

  • Install Tableau and activate account
  • Read data
  • List up what can be done with Tableau
  • Check what we need to create for the prize.

NA Which tool or not?

I am Takashi OTSU. I have been absent before 2 sessions.
Which tool will we use or not?

I am new to utilize and analyze such data.
In this competition, we first need to visualize and preprocess data. I suggest we use a common method or tools in this team.

My opinion is to use Dataiku, which we can write protocols in no code, it might be appropriate to make initial models.
Do you have any suggestion?

Team formation

plan A

  • Mathematical and Statistical team
  • Neural network and Applied techs team

plan B

  • Pure Data Team
  • Alternative Data team

plan C

  • language driven

Dataiku

  • join dataiku slack channel
  • install
  • Data import
  • check functions of dataiku
  • calculate statstical information
  • try AutoML

Data preprocessing

  • dollar to float
  • split date to year, month, day
  • 0, Y, N -> 0, 1, -1
  • onehot encoding for place

MS (takazawa) - Try first simple model

Implement the first simple model using some natural encoding.

  • Target-encode some categorical variables with a large number of categories.
  • Use LightGBM
  • Upload the code and record submission performance
  • Make data-cleaning function

MS LASSO

  • Apply LASSO with some fixed regularization
  • Hyperparameter tuning

Exploratory data analysis (EDA) and baseline of three models

  • Except the date features, I have analyzed the left features and got the following conclusions

  • 1.'NewExist','RevLineCr' and 'Sector' are the features which show difference in terms of the number of repayors and defaulters and the differences are as follows:

  • 1) Companies which already exist are more likely not to repay the loan.
    2) Companies who are considered to have no credit value are the most likely group to repay the loan.
    3) The companies in sector 11 are the most likely group to repay the loan, and companies in sector 49 are the least likely group to repay the loan.

  • Things I am going to do next:

  • 1. Do more EDA about the date features

  • 2. Tune the tried models and try voting classifier

  • 3. Find and try other extra data (such as the data Kondo-san recommended) outside the provided data

Data Profiling (ydata_profiling)

Profile data using ydata_profiling.

  • Profile report training dataset and upload widget (or html) here
  • Add correlation plots as well (if they don't output it)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.