This is repo for a Machine learning Model-building project that takes cryptocurrency data as input and used some supervised(regression,classification) and unsupervised(density astimation, clustering) machine learning algorithms to find out some interesting patterns in the dataset.
Reggression (1 MODEL- liner regression: 2-3 people):
- Fuction: As a benchmark to prove clusterings will be better.
- Tasks: a. We need to have factors which bring influence on our cryptocurrencies' price. (S&P500, etc...) b. Choose three cryptocurrency to present c. Finish in one week
1.one model
2.focus on one kind of crypocurrency or top 50/100 market cap
3.Deadline of presentation: 11/17/2022
Machine learning in finance is now considered a key aspect of several financial services and applications, including managing assets, evaluating levels of risk, calculating credit scores, and even approving loans. Machine learning is a subset of data science that provides the ability to learn and improve from experience without being programmed.
In this project, we will explore some possible ways that how unsupervised learnig algorithms(Clustering) could be applied on cryptocurrency and access their performance to find out whether there are some interesting discoveries.
You can find out dataset here: web_link
For dataset, we have 1243590
entries and 12
columns:
time_open
:time_high
: Time cryptocurrency reachs highest price.time_low
:Time cryptocurrency reachs lowest price.quote.USD.open
:quote.USD.high
:quote.USD.low
:quote.USD.close
:quote.USD.volume
:quote.USD.market_cap
: The total market value of a cryptocurrency's circulating supply. It is analogous to the free-float capitalization in the stock market.quote.USD.timestamp
:symbol
: The symbol of cryptocurrencyid
: With symbol, they are the unique id for cryptocurrency.
We might not try out all machine learning algorithms at the first stage. We might focus on unsupervised learning algorithm such as clustering
.
- README file and some EDA work
- Assign works
- Building models
- Interpret the results
- Make PPT
- K-means finished
- I did some EDA work and feature engineering on our data
- extract minute and sec as new features from time_high and time_low
- drop other categorical columns
- stanardize all numerical columns since distance matters in our model
- Remain to do:
- result interpretation
- What we mainly did are the steps before Stacey's fancy plots!
- checked the raw data and dropped the missing values after testing.
- added a new column representing the symbol and id.
- extracted the date from time stamp.
- drawing the rough plot and made a few assumptions about clustering.
- Remain to do:
- building up new models
- worked with Stacey and Chenxi for EDA before clustering
- Did GMM and DBSCAN model with the data frame after Petrick's feature engineering
- GMM Package(model and probability)
- DBSCAN: find and visualize the best EPS and min_samples
- DBSCAN result: With eps=1.5, min samples=4, and data= df[0: 10000], we have 3 clusters: cluster 0, cluster 1, and cluster 2
- DBSCAN result: Cluster -1 is the noise
- Wrote Powerpoint slides for introduction, interpretation, and conclusion of DBSCAN model and revised some format problems of the presentation slides
-EDA for whole dataset finished
- On the basis of Steven's multi-model fitting, random forest was selected for further optimization.
- Pull BTC price data directly in the parquet file
- Routine and targeted data processing
- Select a group of seven days for data restructuring in order to extract feature values
- Extract feature values using tsfresh
- Use train_test_split to partition the data into training and testing sets
- Training Model
- Using the model to make predictions
- Evaluate models through numerical evaluation and visualization
Selected Coins- "BTC_1","ETH_1027", "BNB_1839", "ADA_2010"