Code Monkey home page Code Monkey logo

wsae-lstm's Introduction

wsae-lstm

Repository that aims to implement the WSAE-LSTM model and replicate the results of said model as defined in "A deep learning framework for financial time series using stacked autoencoders and long-short term memory" by Wei Bao, Jun Yue, Yulei Rao (2017).

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0180944

This implementation of the WSAE-LSTM model aims to address potential issues in the implementation model as defined by Bao et al. (2017) while also simultaneously addressing issues in previous attempts to implement and replicate results of said model (i.e. mlpanda/DeepLearning_Financial).

Source journal (APA)

Bao W, Yue J, Rao Y (2017). "A deep learning framework for financial time series using stacked autoencoders and long-short term memory". PLOS ONE 12(7): e0180944. https://doi.org/10.1371/journal.pone.0180944

Diagram Illustrating the WSAE-LSTM model on an abstract level:

wsae lstm model funnel diagram

Source journal data (saved into data/raw folder as raw_data.xlsx):

DOI:10.6084/m9.figshare.5028110 https://figshare.com/articles/Raw_Data/5028110

Repository structure

This repository uses a directory structured based upon Cookiecutter Datascience.

Repository package requirements/dependencies are defined in requirements.txt for pip and/or environment.yml for Anaconda/Conda.

mlpanda/DeepLearning_Financial:

Repository of an existing attempt to replicate above paper in PyTorch (mlpanda/DeepLearning_Financial), checked out as a git-subrepo for reference in thesubreposdirectory. This repository, subrepos/DeepLearning_Financial, will be used as a point of reference and comparison for specific components in wsae-lstm.

wsae-lstm's People

Contributors

timothyyu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wsae-lstm's Issues

Denoising with wavelet transform

I've been thinking about the data leakage from the wavelet transform, not sure how to apply it to a live data stream. Denoise with the same modes? Scaling has a similar problem...
I think I'll try building the denoising into the AE: fitting the noisy (or even raw) data with the scaled denoised data on the other side. Maybe its a bit voodoo, but it's one of the main usages of AEs . What do you think?

move scale_periods() out of models/wavelet.py; same applies to denoise_periods()

the scale_periods() function should not be in wsae_lstm/models/wavelet.py; it should be under wsae_lstm/features/scale_dataset.py .

The same applies to the denoise_periods() function - it shouldn't be under wavelet.py.

https://github.com/timothyyu/wsae-lstm/blob/master/wsae_lstm/models/wavelet.py#L46

def scale_periods(dict_dataframes):
    
    ddi_scaled = dict()
    for key, index_name in enumerate(dict_dataframes):
        ddi_scaled[index_name] = copy.deepcopy(dict_dataframes[index_name])
    for key, index_name in enumerate(ddi_scaled): 

        scaler = preprocessing.RobustScaler(with_centering=True)

        for index,value in enumerate(ddi_scaled[index_name]):
            X_train = ddi_scaled[index_name][value][1]
            X_train_scaled = scaler.fit_transform(X_train)
            X_train_scaled_df = pd.DataFrame(X_train_scaled,columns=list(X_train.columns))
            
            X_val = ddi_scaled[index_name][value][2]
            X_val_scaled = scaler.transform(X_val)
            X_val_scaled_df = pd.DataFrame(X_val_scaled,columns=list(X_val.columns))
            
            X_test = ddi_scaled[index_name][value][3]
            X_test_scaled = scaler.transform(X_test)
            X_test_scaled_df = pd.DataFrame(X_test_scaled,columns=list(X_test.columns))
            
            ddi_scaled[index_name][value][1] = X_train_scaled_df
            ddi_scaled[index_name][value][2] = X_val_scaled_df
            ddi_scaled[index_name][value][3] = X_test_scaled_df
    return ddi_scaled```

Regression epoch / autoencoder train epoch / total training epoch

Hi:)

First of all, thanks so much for sharing this code. Its very helpful :)
I was a bit confused with number of training epochs though,,

It seems like the outer loop that starts with "for n in range(num iterations):"
gives out a patch of 600 days data (rolling with step size 60) each time (1 iteration) .

so if I'm done with all the iterations, that would be '1 train epoch' right? (Which is different from regression epoch).

since thats when we go through the whole data.

I have been looking at the code for a while but I couldnt figure it out :'(

I'd be grateful for your help .

New complementary tool

My name is Luis, I'm a big-data machine-learning developer, I'm a fan of your work, and I usually check your updates.

I was afraid that my savings would be eaten by inflation. I have created a powerful tool that based on past technical patterns (volatility, moving averages, statistics, trends, candlesticks, support and resistance, stock index indicators).
All the ones you know (RSI, MACD, STOCH, Bolinger Bands, SMA, DEMARK, Japanese candlesticks, ichimoku, fibonacci, williansR, balance of power, murrey math, etc) and more than 200 others.

The tool creates prediction models of correct trading points (buy signal and sell signal, every stock is good traded in time and direction).
For this I have used big data tools like pandas python, stock market libraries like: tablib, TAcharts ,pandas_ta... For data collection and calculation.
And powerful machine-learning libraries such as: Sklearn.RandomForest , Sklearn.GradientBoosting, XGBoost, Google TensorFlow and Google TensorFlow LSTM.

With the models trained with the selection of the best technical indicators, the tool is able to predict trading points (where to buy, where to sell) and send real-time alerts to Telegram or Mail. The points are calculated based on the learning of the correct trading points of the last 2 years (including the change to bear market after the rate hike).

I think it could be useful to you, to improve, I would like to share it with you, and if you are interested in improving and collaborating I am also willing, and if not file it in the box.

dataset scaling/normalization before wavelet transform

The author of DeepLearning_Financial decided to forgo automated scaling/normalization and instead scaled the input features/dataset manually before applying the wavelet transform:

https://github.com/mlpanda/DeepLearning_Financial/blob/7e846144629d8b49b8fd74a87d5ff047b7af55d1/run_training.py#L55 :

 # This is a scaling of the inputs such that they are in an appropriate range    
    feats["Close Price"].loc[:] = feats["Close Price"].loc[:]/1000
    feats["Open Price"].loc[:] = feats["Open Price"].loc[:]/1000
    feats["High Price"].loc[:] = feats["High Price"].loc[:]/1000
    feats["Low Price"].loc[:] = feats["Low Price"].loc[:]/1000
    feats["Volume"].loc[:] = feats["Volume"].loc[:]/1000000
    feats["MACD"].loc[:] = feats["MACD"].loc[:]/10
    feats["CCI"].loc[:] = feats["CCI"].loc[:]/100
    feats["ATR"].loc[:] = feats["ATR"].loc[:]/100
    feats["BOLL"].loc[:] = feats["BOLL"].loc[:]/1000
    feats["EMA20"].loc[:] = feats["EMA20"].loc[:]/1000
    feats["MA10"].loc[:] = feats["MA10"].loc[:]/1000
    feats["MTM6"].loc[:] = feats["MTM6"].loc[:]/100
    feats["MA5"].loc[:] = feats["MA5"].loc[:]/1000
    feats["MTM12"].loc[:] = feats["MTM12"].loc[:]/100
    feats["ROC"].loc[:] = feats["ROC"].loc[:]/10
    feats["SMI"].loc[:] = feats["SMI"].loc[:] * 10
    feats["WVAD"].loc[:] = feats["WVAD"].loc[:]/100000000
    feats["US Dollar Index"].loc[:] = feats["US Dollar Index"].loc[:]/100
    feats["Federal Fund Rate"].loc[:] = feats["Federal Fund Rate"].loc[:]

https://github.com/mlpanda/DeepLearning_Financial/blob/7e846144629d8b49b8fd74a87d5ff047b7af55d1/run_training.py#L96 :

 # REMOVED THE NORMALIZATION AND MANUALLY SCALED TO APPROPRIATE VALUES ABOVE

    """
    scaler = StandardScaler().fit(feats_train)
    feats_norm_train = scaler.transform(feats_train)
    feats_norm_validate = scaler.transform(feats_validate)
    feats_norm_test = scaler.transform(feats_test)
    """
    """
    scaler = MinMaxScaler(feature_range=(0,1))
    scaler.fit(feats_train)
    feats_norm_train = scaler.transform(feats_train)
    feats_norm_validate = scaler.transform(feats_validate)
    feats_norm_test = scaler.transform(feats_test)
    """    

My my main issues/concerns are the following:

  1. Manual scaling can work when you know the exact range of the dataset you're going to be working with, but this kind of scaling would not work on a live model (whether online or continuously batch-trained). In this case, a few values outside of the defined manual ranges for OHLC and the rest of the Panel B Technical Indicators would throw the scaling off.

  2. The source article/journal (Bao et al., 2017) does not go into detail about preprocessing their dataset beyond using the wavelet transform to denoise the dataset.

  3. Scaling != normalization, and there are different ways to scale and/or normalize data depending on the nature of the problem and model (and the nature of the dataset itself)

Thus:

More research needed on scaling/normalization in respect/context of time series data for machine learning. In terms of code/practical implementation, I will mostly like code multiple options for different training runs (with different scaling/normalization options), and then compare.

Additionally, I may decide to contact one of the Author(s) of the source article/journal for more insight into how they preprocessed the raw data (besides using the wavelet transform to denoise).

Confusion with outer loop...

Hi:)

First of all, thanks so much for sharing this code. Its very helpful :)
I was a bit confused with number of training epochs though,,

It seems like the outer loop that starts with "for n in rang(num iterations):"
gives patches of 600 days data (rolling with step size 60) each time .

so if I'm done with the outer loops -- using up all num iterations, that would be '1 train epoch' right? (Which is different from regression epoch)

I have been looking at the code for a while but I couldnt figure it out :'(

I'd be grateful for your help .

dual stage normalization and scaling

Careful attention is required for proper scaling/normalization of the Panel B and Panel C indicators in relation to the OHLC data (Panel A). When visualizing the train-validate-split with matplotlib, some of the index types show two lines or more, which shouldn't be possible, or at least that is what I thought was the case (I was, very, very wrong):

image

image

image

It turns out the panel C indicators and panel B indicators for the other index types are so far out range that they flatten the other features in terms of visualization/plotting:

image

image

image

conda-build + anaconda env rebuild

conda-build out of date causing further environment updates/changes to be out of sync with build-index/package manager

needs to be fixed before going foward - pyyml/parso needs new conda-build update + anaconda base

"level" parameter in waveletSmooth function

Hi Timothy.
In reviewing your code I ran into issues using the waveletSmooth function (in the directory subrepos/models/wavelet). I think it might be the difference in our pywt versions, but the function was doing the wavelet decomposition along the features axis and not separately for each feature along its time series.
After fixing it, I noticed that the "level" parameter was only in charge of thresholding
the detail coefficients using the "level" detail coefficient's median.
I'm hardly a wavelet expert, and have learned it only now for this algorithm, but I changed your code to threshold coefficients according to their level's median, because that was done in all denoising sources I have seen.
Could you explain your consideration in choosing one level for all cD thresholding?
cheers!
Danny

how to load the scaled_denoised_data into lstm?

Hi,

Thanks for your amazing work. Appreciate the revised content a lot.

The revised pieces are successfully handled the scaled, waveletsmooth and stacked auto-encoder portion.

May i ask ,how to integrate the output into the final LSTM portion??  

Marcus

missing data for certain date ranges/index data (CSI300 index)

From source article defining the train-validate-test split arrangement for continuous training (Fig 7):
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0180944#pone-0180944-g007
image

image
image

From taking a closer look at the dataset for the CSI300 index:
image

May be an error in the author's data scrape methodology, a data endpoint issue, or no market data was available for those ranges.

Action:
Investigate/look into further - if possible query same data source that authors used and compare results.

Meaning of Data Columns - time, Ntime and BOLL

Hi Timothy. I've looked at the dataset, and couldn't understand what these columns mean. I know what bollinger means, but manually calculating the top and bottom bands didn't yield the series. 'time' and 'Ntime' are counted from before there were humans... Any idea?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.