Light

yukiregista / dsc_study Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 3.0 3.69 MB

Jupyter Notebook 75.04% Python 0.05% HTML 24.92%

dsc_study's Introduction

DSC_STUDY

次回までにやりたいこと

Tickの計算を試せるデータセットを探してくる
- {GoogleDrive}/SP500_2000-2019_CMEGroup.csv https://www.kaggle.com/datasets/finnhub/sp-500-futures-tick-data-sp?resource=download
金融のテストデータを何かしらの分布でランダムに作る方法を調べる

dsc_study's People

Contributors

Stargazers

Watchers

Forkers

rkondo3 eiei7 sinsukehlab

dsc_study's Issues

Baseline with target encoding

Organize code and upload

Exploratory data analysis

Read paper and find what is preserved in the artificial data
What can we expect to hold in the test data
We do not need to focus that much on improving the score on the validation dataset
train data / validation data / test data / original data
List up and compare all important features in train / original data

original kaggle data
{private_drive}/FDUAcomp/SBAnational.csv

Compare models as baseline

Databricks

Todo

Signate に登録 ⇒ Azureアカウントの発行を待つ
Azure Databricksへログイン

リンク

1月30日 18時~20時ハンズオン会
クラウド版
（利用方法の参考URL）
製品概要
導入事例
Global金融事例

MS (takazawa) - EDA: relationship to dependent variable of newly generated features!

Correlation

df_train except for one-hot columns
df_train including one-hot columns

List up todo for the initial data analysis

Please check when you create the issue

Tableau

Install Tableau and activate account
Read data
List up what can be done with Tableau
Check what we need to create for the prize.

Enable dependabot and CodeQL

Enable dependabot and CodeQL in Security tab.

NA Which tool or not?

I am Takashi OTSU. I have been absent before 2 sessions.
Which tool will we use or not?

I am new to utilize and analyze such data.
In this competition, we first need to visualize and preprocess data. I suggest we use a common method or tools in this team.

My opinion is to use Dataiku, which we can write protocols in no code, it might be appropriate to make initial models.
Do you have any suggestion?

Team formation

plan A

Mathematical and Statistical team
Neural network and Applied techs team

plan B

Pure Data Team
Alternative Data team

plan C

language driven

[MS] Features for Ooyama-san's dataiku

Please add comments below if you passed your features to Ooyama-san, and write the location of the code.

Baseline writeups shared in the FDUA community

Shared baseline writeup

nishimoto
takaito
- writeup
- colab link

OverSamling

SMOTE
ADASYN
GAN

Common Check Data Distribution

Check distribution of each variable
Check correlation between variables

Dataiku

Data preprocessing

dollar to float
split date to year, month, day
0, Y, N -> 0, 1, -1
onehot encoding for place

MS (takazawa) - Try first simple model

Implement the first simple model using some natural encoding.

Target-encode some categorical variables with a large number of categories.
Use LightGBM
Upload the code and record submission performance
Make data-cleaning function

nishimoto

Geo Encoding

MS LASSO

Apply LASSO with some fixed regularization
Hyperparameter tuning

Normalization

MS (takazawa) - EDA: relationship to dependent variable

Organize and upload the code
(Get county transformation done for a better visualization).

Exploratory data analysis (EDA) and baseline of three models

Except the date features, I have analyzed the left features and got the following conclusions
1.'NewExist','RevLineCr' and 'Sector' are the features which show difference in terms of the number of repayors and defaulters and the differences are as follows:
1) Companies which already exist are more likely not to repay the loan.
2) Companies who are considered to have no credit value are the most likely group to repay the loan.
3) The companies in sector 11 are the most likely group to repay the loan, and companies in sector 49 are the least likely group to repay the loan.
Things I am going to do next:
1. Do more EDA about the date features
2. Tune the tried models and try voting classifier
3. Find and try other extra data (such as the data Kondo-san recommended) outside the provided data

Data Profiling (ydata_profiling)

Profile data using ydata_profiling.

Profile report training dataset and upload widget (or html) here
Add correlation plots as well (if they don't output it)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.