minerva-ml / open-solution-home-credit Goto Github PK

Open solution to the Home Credit Default Risk challenge :house_with_garden:

Home Page: https://www.kaggle.com/c/home-credit-default-risk

License: MIT License

Python 56.25% Jupyter Notebook 43.75%

machine-learning deep-learning kaggle pipeline feature-engineering reproducible-experiments reproducibility pipeline-framework lightgbm xgboost neptune competition credit-scoring credit-risk open-source python python3 python35

open-solution-home-credit's Introduction

Home Credit Default Risk: Open Solution

This is an open solution to the Home Credit Default Risk challenge 🏡.

More competitions 🎇

Check collection of public projects 🎁, where you can find multiple Kaggle competitions with code, experiments and outputs.

Our goals

We are building entirely open solution to this competition. Specifically:

Learning from the process - updates about new ideas, code and experiments is the best way to learn data science. Our activity is especially useful for people who wants to enter the competition, but lack appropriate experience.
Encourage more Kagglers to start working on this competition.
Deliver open source solution with no strings attached. Code is available on our GitHub repository 💻. This solution should establish solid benchmark, as well as provide good base for your custom ideas and experiments. We care about clean code 😃
We are opening our experiments as well: everybody can have live preview on our experiments, parameters, code, etc. Check: Home Credit Default Risk 📈 and screens below.

Train and validation results on folds 📊	LightGBM learning curves 📊

Disclaimer

In this open source solution you will find references to the neptune.ml. It is free platform for community Users, which we use daily to keep track of our experiments. Please note that using neptune.ml is not necessary to proceed with this solution. You may run it as plain Python script 🐍.

Note

As of 1.07.2019 we officially discontinued neptune-cli client project making neptune-client the only supported way to communicate with Neptune. That means you should run experiments via python ... command or update loggers to neptune-client. For more information about the new client go to neptune-client read-the-docs page.

How to start?

Learn about our solutions

Check Kaggle forum and participate in the discussions.
Check our Wiki pages 🏡, where we document our work. See solutions below:

link to code	name	CV	LB	link to description
solution 1	chestnut 🌰	?	0.742	LightGBM and basic features
solution 2	seedling 🌱	?	0.747	Sklearn and XGBoost algorithms and groupby features
solution 3	blossom 🌼	0.7840	0.790	LightGBM on selected features
solution 4	tulip 🌷	0.7905	0.801	LightGBM with smarter features
solution 5	sunflower 🌻	0.7950	0.804	LightGBM clean dynamic features
solution 6	four leaf clover 🍀	0.7975	0.806	priv. LB 0.79804, Stacking by feature diversity and model diversity

Start experimenting with ready-to-use code

You can jump start your participation in the competition by using our starter pack. Installation instruction below will guide you through the setup.

Installation (fast track)

Clone repository and install requirements (use Python3.5)

pip3 install -r requirements.txt

Register to the neptune.ml (if you wish to use it)
Run experiment based on LightGBM:

🔱

neptune account login
neptune run --config configs/neptune.yaml main.py train_evaluate_predict_cv --pipeline_name lightGBM

🐍

python main.py -- train_evaluate_predict_cv --pipeline_name lightGBM

Installation (step by step)

Step by step installation 🖥️

Hyperparameter Tuning

Various options of hyperparameter tuning are available

Random Search

configs/neptune.yaml

  hyperparameter_search__method: random
  hyperparameter_search__runs: 100

src/pipeline_config.py

    'tuner': {'light_gbm': {'max_depth': ([2, 4, 6], "list"),
                            'num_leaves': ([2, 100], "choice"),
                            'min_child_samples': ([5, 10, 15 25, 50], "list"),
                            'subsample': ([0.95, 1.0], "uniform"),
                            'colsample_bytree': ([0.3, 1.0], "uniform"),
                            'min_gain_to_split': ([0.0, 1.0], "uniform"),
                            'reg_lambda': ([1e-8, 1000.0], "log-uniform"),
                            },
              }

Get involved

You are welcome to contribute your code and ideas to this open solution. To get started:

Check competition project on GitHub to see what we are working on right now.
Express your interest in paticular task by writing comment in this task, or by creating new one with your fresh idea.
We will get back to you quickly in order to start working together.
Check CONTRIBUTING for some more information.

User support

There are several ways to seek help:

Kaggle discussion is our primary way of communication.
Read project's Wiki, where we publish descriptions about the code, pipelines and supporting tools such as neptune.ml.
Submit an issue directly in this repo.

open-solution-home-credit's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger pknut kstrzala abusharkhm slacklining se7enzhou ninoko primeston suresk sysujayce bejustawsome hqbiao dromosys manba831 ashufet celeoct10 ash-datalytica starhaox monti4ml moneyml rmcersherry yjsgcjdfz123 chicm-ms sahoopa lingchensun cxl923cc rpygamer rajneesh-tiwari wep56 bielrv kebitmatf west5678 jiajunmao luckycluo vishaljindal09 jamesliao2016 buypolarbear hendra-herviawan yotamco100 neuron888 herlobster vskanukolanu lbf4616 lgb12356 databill86 ode233 wushuaida cbarcelon hyeongseokson arc144 alexkillgur zm66260 duoan wangkanger guillaumepl wujia0 kagglegogogo coreacasa kagglesolutions guitarmind einsteininict jihang-zhang salomefu yangruipis kantapithm qiletan mmejdoubi fangduan taniajacob jerrycatleung kongdzh bsmanit wml1993 kant m0tao0 sainiudit lolhjlolhj meddulla kefyr 24flyman gpwner etheleon utsav37 gatescao pluketic amimul snowdj kstepanmpmg le773 sororf jxcross vinaykus anhmaivu88 dukering fakhraddin kevinnate mlking15 brianlchu chandansinha wavyiceman79

how it works?
which parameters are important and why?
rules of thumb for grid search

Use new adapter syntax
run experiment in order to make sure that it works correctly.
assume that steppy-toolkit is pip installable

remaining data is data without train/test IDs
write down your ideas as a comment to this Issue

inside single step?
or n steps for n-fold CV?

look for similar competitions and analyse it

look for useful features
tricks
models
validation strategies, etc.

build first pipeline from these pieces
run grid search
submit to Kaggle :)

Data augmentation

prepare release of the solution 1, once code is ready

test with / without neptune
write Wiki documentation
write info on Kaggle
report on results

investigate age buckets
check for important dates in Spain (Studies, retirement)

port algorithms from TalkingData to steppy-toolkit

XGBoost
LogReg

Check their implementation in TalkingData and port to steppy-toolkit

Combine External sources

create features based on EXT_SOURCE_1,2,3
sum
average (weighted perhaps)

implement transformers for: random forest and support vector regression

When #4 is done:

implement steppy transformer that train random forest
implement steppy transformer that train support vector regression
handle NaN

this is preparations Step
use our steppy lib. -> ask @pknut or @mromaniukcdl for help, as needed.

first features for training purposes

Analyze dataset to identify good candidates for simple features
- simple, direct, easy to implement with minimal effort.
Implement steppy Transformer that prepares features for sklearn regression algorithms:
- random forest
- support vector regression
- others(?)

build 2-level ensembling

1st level:

rf
svc
xgb
lightGBM

outliers
mislabeling