- Tickの計算を試せるデータセットを探してくる
- {GoogleDrive}/SP500_2000-2019_CMEGroup.csv https://www.kaggle.com/datasets/finnhub/sp-500-futures-tick-data-sp?resource=download
- 金融のテストデータを何かしらの分布でランダムに作る方法を調べる
dsc_study's Introduction
dsc_study's People
dsc_study's Issues
Baseline with target encoding
- Organize code and upload
Exploratory data analysis
- Read paper and find what is preserved in the artificial data
- What can we expect to hold in the test data
- We do not need to focus that much on improving the score on the validation dataset
- train data / validation data / test data / original data
- List up and compare all important features in train / original data
original kaggle data
{private_drive}/FDUAcomp/SBAnational.csv
Compare models as baseline
Databricks
Databricks
Todo
- Signate に登録 ⇒ Azureアカウントの発行を待つ
- Azure Databricksへログイン
リンク
- 1月30日 18時~20時 ハンズオン会
- クラウド版
(利用方法の参考URL) - 製品概要
- 導入事例
- Global金融事例
MS (takazawa) - EDA: relationship to dependent variable of newly generated features!
Correlation
- df_train except for one-hot columns
- df_train including one-hot columns
List up todo for the initial data analysis
Please check when you create the issue
- draw histgrams
- ydata_profiling
https://docs.profiling.ydata.ai/latest/ - dataiku
- databricks
- tableau
- create baseline with famous models
Tableau
- Install Tableau and activate account
- Read data
- List up what can be done with Tableau
- Check what we need to create for the prize.
Enable dependabot and CodeQL
Enable dependabot and CodeQL in Security tab.
NA Which tool or not?
I am Takashi OTSU. I have been absent before 2 sessions.
Which tool will we use or not?
I am new to utilize and analyze such data.
In this competition, we first need to visualize and preprocess data. I suggest we use a common method or tools in this team.
My opinion is to use Dataiku, which we can write protocols in no code, it might be appropriate to make initial models.
Do you have any suggestion?
Team formation
plan A
- Mathematical and Statistical team
- Neural network and Applied techs team
plan B
- Pure Data Team
- Alternative Data team
plan C
- language driven
[MS] Features for Ooyama-san's dataiku
Please add comments below if you passed your features to Ooyama-san, and write the location of the code.
Baseline writeups shared in the FDUA community
Shared baseline writeup
- nishimoto
- takaito
OverSamling
- SMOTE
- ADASYN
- GAN
Common Check Data Distribution
- Check distribution of each variable
- Check correlation between variables
Dataiku
- join dataiku slack channel
- install
- Data import
- check functions of dataiku
- calculate statstical information
- try AutoML
Data preprocessing
- dollar to float
- split date to year, month, day
- 0, Y, N -> 0, 1, -1
- onehot encoding for place
MS (takazawa) - Try first simple model
Implement the first simple model using some natural encoding.
- Target-encode some categorical variables with a large number of categories.
- Use LightGBM
- Upload the code and record submission performance
- Make data-cleaning function
nishimoto
Geo Encoding
MS LASSO
- Apply LASSO with some fixed regularization
- Hyperparameter tuning
Normalization
MS (takazawa) - EDA: relationship to dependent variable
- Organize and upload the code
- (Get county transformation done for a better visualization).
Exploratory data analysis (EDA) and baseline of three models
-
Except the date features, I have analyzed the left features and got the following conclusions
-
1.'NewExist','RevLineCr' and 'Sector' are the features which show difference in terms of the number of repayors and defaulters and the differences are as follows:
-
1) Companies which already exist are more likely not to repay the loan.
2) Companies who are considered to have no credit value are the most likely group to repay the loan.
3) The companies in sector 11 are the most likely group to repay the loan, and companies in sector 49 are the least likely group to repay the loan. -
Things I am going to do next:
-
1. Do more EDA about the date features
-
2. Tune the tried models and try voting classifier
-
3. Find and try other extra data (such as the data Kondo-san recommended) outside the provided data
Data Profiling (ydata_profiling)
Profile data using ydata_profiling.
- Profile report training dataset and upload widget (or html) here
- Add correlation plots as well (if they don't output it)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.