timothyyu / wsae-lstm Goto Github PK

implementation of WSAE-LSTM model as defined by Bao, Yue, Rao (2017)

License: Other

Jupyter Notebook 99.69% Python 0.31%

lstm time-series wavelet-transform financial wsae-lstm stacked-autoencoder autoencoders stock-price-prediction machine-learning deep-learning

wsae-lstm's Introduction

wsae-lstm

Repository that aims to implement the WSAE-LSTM model and replicate the results of said model as defined in "A deep learning framework for financial time series using stacked autoencoders and long-short term memory" by Wei Bao, Jun Yue, Yulei Rao (2017).

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0180944

This implementation of the WSAE-LSTM model aims to address potential issues in the implementation model as defined by Bao et al. (2017) while also simultaneously addressing issues in previous attempts to implement and replicate results of said model (i.e. mlpanda/DeepLearning_Financial).

Source journal (APA)

Bao W, Yue J, Rao Y (2017). "A deep learning framework for financial time series using stacked autoencoders and long-short term memory". PLOS ONE 12(7): e0180944. https://doi.org/10.1371/journal.pone.0180944

Diagram Illustrating the WSAE-LSTM model on an abstract level:

Source journal data (saved into `data/raw` folder as `raw_data.xlsx`):

DOI:10.6084/m9.figshare.5028110 https://figshare.com/articles/Raw_Data/5028110

Repository structure

This repository uses a directory structured based upon Cookiecutter Datascience.

Repository package requirements/dependencies are defined in requirements.txt for pip and/or environment.yml for Anaconda/Conda.

`mlpanda/DeepLearning_Financial`:

Repository of an existing attempt to replicate above paper in PyTorch (mlpanda/DeepLearning_Financial), checked out as a git-subrepo for reference in thesubreposdirectory. This repository, subrepos/DeepLearning_Financial, will be used as a point of reference and comparison for specific components in wsae-lstm.

wsae-lstm's People

Contributors

Stargazers

Watchers

wsae-lstm's Issues

Denoising with wavelet transform

I've been thinking about the data leakage from the wavelet transform, not sure how to apply it to a live data stream. Denoise with the same modes? Scaling has a similar problem...
I think I'll try building the denoising into the AE: fitting the noisy (or even raw) data with the scaled denoised data on the other side. Maybe its a bit voodoo, but it's one of the main usages of AEs . What do you think?

move scale_periods() out of models/wavelet.py; same applies to denoise_periods()

the scale_periods() function should not be in wsae_lstm/models/wavelet.py; it should be under wsae_lstm/features/scale_dataset.py .

The same applies to the denoise_periods() function - it shouldn't be under wavelet.py.

https://github.com/timothyyu/wsae-lstm/blob/master/wsae_lstm/models/wavelet.py#L46

def scale_periods(dict_dataframes):
    
    ddi_scaled = dict()
    for key, index_name in enumerate(dict_dataframes):
        ddi_scaled[index_name] = copy.deepcopy(dict_dataframes[index_name])
    for key, index_name in enumerate(ddi_scaled): 

        scaler = preprocessing.RobustScaler(with_centering=True)

        for index,value in enumerate(ddi_scaled[index_name]):
            X_train = ddi_scaled[index_name][value][1]
            X_train_scaled = scaler.fit_transform(X_train)
            X_train_scaled_df = pd.DataFrame(X_train_scaled,columns=list(X_train.columns))
            
            X_val = ddi_scaled[index_name][value][2]
            X_val_scaled = scaler.transform(X_val)
            X_val_scaled_df = pd.DataFrame(X_val_scaled,columns=list(X_val.columns))
            
            X_test = ddi_scaled[index_name][value][3]
            X_test_scaled = scaler.transform(X_test)
            X_test_scaled_df = pd.DataFrame(X_test_scaled,columns=list(X_test.columns))
            
            ddi_scaled[index_name][value][1] = X_train_scaled_df
            ddi_scaled[index_name][value][2] = X_val_scaled_df
            ddi_scaled[index_name][value][3] = X_test_scaled_df
    return ddi_scaled```

error in formula 29, linear correlation

error brought to my attention by S. Mamont for formula 29, linear correlation:

Regression epoch / autoencoder train epoch / total training epoch

Hi:)

First of all, thanks so much for sharing this code. Its very helpful :)
I was a bit confused with number of training epochs though,,

It seems like the outer loop that starts with "for n in range(num iterations):"
gives out a patch of 600 days data (rolling with step size 60) each time (1 iteration) .

so if I'm done with all the iterations, that would be '1 train epoch' right? (Which is different from regression epoch).

since thats when we go through the whole data.

I have been looking at the code for a while but I couldnt figure it out :'(

I'd be grateful for your help .

New complementary tool

My name is Luis, I'm a big-data machine-learning developer, I'm a fan of your work, and I usually check your updates.

I was afraid that my savings would be eaten by inflation. I have created a powerful tool that based on past technical patterns (volatility, moving averages, statistics, trends, candlesticks, support and resistance, stock index indicators).
All the ones you know (RSI, MACD, STOCH, Bolinger Bands, SMA, DEMARK, Japanese candlesticks, ichimoku, fibonacci, williansR, balance of power, murrey math, etc) and more than 200 others.

The tool creates prediction models of correct trading points (buy signal and sell signal, every stock is good traded in time and direction).
For this I have used big data tools like pandas python, stock market libraries like: tablib, TAcharts ,pandas_ta... For data collection and calculation.
And powerful machine-learning libraries such as: Sklearn.RandomForest , Sklearn.GradientBoosting, XGBoost, Google TensorFlow and Google TensorFlow LSTM.

With the models trained with the selection of the best technical indicators, the tool is able to predict trading points (where to buy, where to sell) and send real-time alerts to Telegram or Mail. The points are calculated based on the learning of the correct trading points of the last 2 years (including the change to bear market after the rate hike).

I think it could be useful to you, to improve, I would like to share it with you, and if you are interested in improving and collaborating I am also willing, and if not file it in the box.

KL Implementation - Complementary Addition

[REDACTED]

dataset scaling/normalization before wavelet transform

The author of DeepLearning_Financial decided to forgo automated scaling/normalization and instead scaled the input features/dataset manually before applying the wavelet transform:

https://github.com/mlpanda/DeepLearning_Financial/blob/7e846144629d8b49b8fd74a87d5ff047b7af55d1/run_training.py#L55 :

 # This is a scaling of the inputs such that they are in an appropriate range    
    feats["Close Price"].loc[:] = feats["Close Price"].loc[:]/1000
    feats["Open Price"].loc[:] = feats["Open Price"].loc[:]/1000
    feats["High Price"].loc[:] = feats["High Price"].loc[:]/1000
    feats["Low Price"].loc[:] = feats["Low Price"].loc[:]/1000
    feats["Volume"].loc[:] = feats["Volume"].loc[:]/1000000
    feats["MACD"].loc[:] = feats["MACD"].loc[:]/10
    feats["CCI"].loc[:] = feats["CCI"].loc[:]/100
    feats["ATR"].loc[:] = feats["ATR"].loc[:]/100
    feats["BOLL"].loc[:] = feats["BOLL"].loc[:]/1000
    feats["EMA20"].loc[:] = feats["EMA20"].loc[:]/1000
    feats["MA10"].loc[:] = feats["MA10"].loc[:]/1000
    feats["MTM6"].loc[:] = feats["MTM6"].loc[:]/100
    feats["MA5"].loc[:] = feats["MA5"].loc[:]/1000
    feats["MTM12"].loc[:] = feats["MTM12"].loc[:]/100
    feats["ROC"].loc[:] = feats["ROC"].loc[:]/10
    feats["SMI"].loc[:] = feats["SMI"].loc[:] * 10
    feats["WVAD"].loc[:] = feats["WVAD"].loc[:]/100000000
    feats["US Dollar Index"].loc[:] = feats["US Dollar Index"].loc[:]/100
    feats["Federal Fund Rate"].loc[:] = feats["Federal Fund Rate"].loc[:]

https://github.com/mlpanda/DeepLearning_Financial/blob/7e846144629d8b49b8fd74a87d5ff047b7af55d1/run_training.py#L96 :

 # REMOVED THE NORMALIZATION AND MANUALLY SCALED TO APPROPRIATE VALUES ABOVE

    """
    scaler = StandardScaler().fit(feats_train)
    feats_norm_train = scaler.transform(feats_train)
    feats_norm_validate = scaler.transform(feats_validate)
    feats_norm_test = scaler.transform(feats_test)
    """
    """
    scaler = MinMaxScaler(feature_range=(0,1))
    scaler.fit(feats_train)
    feats_norm_train = scaler.transform(feats_train)
    feats_norm_validate = scaler.transform(feats_validate)
    feats_norm_test = scaler.transform(feats_test)
    """

My my main issues/concerns are the following:

Manual scaling can work when you know the exact range of the dataset you're going to be working with, but this kind of scaling would not work on a live model (whether online or continuously batch-trained). In this case, a few values outside of the defined manual ranges for OHLC and the rest of the Panel B Technical Indicators would throw the scaling off.
The source article/journal (Bao et al., 2017) does not go into detail about preprocessing their dataset beyond using the wavelet transform to denoise the dataset.
Scaling != normalization, and there are different ways to scale and/or normalize data depending on the nature of the problem and model (and the nature of the dataset itself)

Thus:

More research needed on scaling/normalization in respect/context of time series data for machine learning. In terms of code/practical implementation, I will mostly like code multiple options for different training runs (with different scaling/normalization options), and then compare.

Additionally, I may decide to contact one of the Author(s) of the source article/journal for more insight into how they preprocessed the raw data (besides using the wavelet transform to denoise).

Confusion with outer loop...

Hi:)

First of all, thanks so much for sharing this code. Its very helpful :)
I was a bit confused with number of training epochs though,,

It seems like the outer loop that starts with "for n in rang(num iterations):"
gives patches of 600 days data (rolling with step size 60) each time .

so if I'm done with the outer loops -- using up all num iterations, that would be '1 train epoch' right? (Which is different from regression epoch)

I have been looking at the code for a while but I couldnt figure it out :'(

I'd be grateful for your help .

convert date column to datetime object for all data clean

dual stage normalization and scaling

Careful attention is required for proper scaling/normalization of the Panel B and Panel C indicators in relation to the OHLC data (Panel A). When visualizing the train-validate-split with matplotlib, some of the index types show two lines or more, which shouldn't be possible, or at least that is what I thought was the case (I was, very, very wrong):

It turns out the panel C indicators and panel B indicators for the other index types are so far out range that they flatten the other features in terms of visualization/plotting:

why clean future close not equal clean index close at same date

error in formula 33, predicted value for the following time period

Formula 33 from source article/journal:

Comment from journal article comment section pointing out error in formula 33:
https://journals.plos.org/plosone/article/comment?id=10.1371/annotation/47a85d42-541f-4f22-9a98-6b794406e961

make sure implementation of formula corrects for error in calculation

conda-build + anaconda env rebuild

conda-build out of date causing further environment updates/changes to be out of sync with build-index/package manager

needs to be fixed before going foward - pyyml/parso needs new conda-build update + anaconda base

"level" parameter in waveletSmooth function

Hi Timothy.
In reviewing your code I ran into issues using the waveletSmooth function (in the directory subrepos/models/wavelet). I think it might be the difference in our pywt versions, but the function was doing the wavelet decomposition along the features axis and not separately for each feature along its time series.
After fixing it, I noticed that the "level" parameter was only in charge of thresholding
the detail coefficients using the "level" detail coefficient's median.
I'm hardly a wavelet expert, and have learned it only now for this algorithm, but I changed your code to threshold coefficients according to their level's median, because that was done in all denoising sources I have seen.
Could you explain your consideration in choosing one level for all cD thresholding?
cheers!
Danny

KL 散度作为损失函数一部分怎么理解

how to load the scaled_denoised_data into lstm?

Hi,

Thanks for your amazing work. Appreciate the revised content a lot.

The revised pieces are successfully handled the scaled, waveletsmooth and stacked auto-encoder portion.

May i ask ,how to integrate the output into the final LSTM portion??

Marcus

WT in the paper leaks info

Hey Timothy, I added a comment on DeepLearning_Financial too about this and tried to expand here. There's no other way they get to the results they do.

Interested about your thoughts

missing data for certain date ranges/index data (CSI300 index)

From source article defining the train-validate-test split arrangement for continuous training (Fig 7):
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0180944#pone-0180944-g007

From taking a closer look at the dataset for the CSI300 index:

May be an error in the author's data scrape methodology, a data endpoint issue, or no market data was available for those ranges.

Action:
Investigate/look into further - if possible query same data source that authors used and compare results.

create changelog document for ongoing log

Meaning of Data Columns - time, Ntime and BOLL

Hi Timothy. I've looked at the dataset, and couldn't understand what these columns mean. I know what bollinger means, but manually calculating the top and bottom bands didn't yield the series. 'time' and 'Ntime' are counted from before there were humans... Any idea?

timothyyu / wsae-lstm Goto Github PK

wsae-lstm's Introduction

wsae-lstm

Source journal (APA)

Source journal data (saved into data/raw folder as raw_data.xlsx):

Repository structure

mlpanda/DeepLearning_Financial:

wsae-lstm's People

Contributors

Stargazers

Watchers

Forkers

wsae-lstm's Issues

Recommend Projects

Recommend Topics

Recommend Org

Source journal data (saved into `data/raw` folder as `raw_data.xlsx`):

`mlpanda/DeepLearning_Financial`: