dhopp1 / nowcast_lstm Goto Github PK

LSTM neural networks for nowcasting economic data.

License: MIT License

Python 100.00%

pytorch deep-learning machine-learning neural-networks forecasting nowcasting

nowcast_lstm's Introduction

nowcast_lstm

New in v0.2.6: ability to produce logistic/binary classification estimates by passing torch.nn.BCELoss() to the criterion parameter.

New in v0.2.2: ability to get uncertainty intervals for predictions and predictions on synthetic vintages.

New in v0.2.0: ability to get feature contributions to the model and perform automatic hyperparameter tuning and variable selection, no need to write this outside of the library anymore.

Installation: from the command line run:

# you may have pip3 installed, in which case run "pip3 install..."
pip install dill numpy pandas pmdarima

# pytorch has a little more involved install command, this for windows
pip install torch==1.8.1+cpu torchvision==0.9.1+cpu torchaudio===0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

# this for linux
pip install torch==1.8.1+cpu torchvision==0.9.1+cpu torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

# then finally
pip install nowcast-lstm

Example: nowcast_lstm_example.zip contains a jupyter notebook file with a dataset and more detailed example of usage.

LSTM neural networks have been used for nowcasting before, combining the strengths of artificial neural networks with a temporal aspect. However their use in nowcasting economic indicators remains limited, no doubt in part due to the difficulty of obtaining results in existing deep learning frameworks. This library seeks to streamline the process of obtaining results in the hopes of expanding the domains to which LSTM can be applied.

While neural networks are flexible and this framework may be able to get sensible results on levels, the model architecture was developed to nowcast growth rates of economic indicators. As such training inputs should ideally be stationary and seasonally adjusted.

Further explanation of the background problem can be found in this paper. Further explanation and results can be found in this paper in the Journal of Official Statistics.

R, MATLAB, and Julia wrappers

R, MATLAB, and Julia wrappers exist for this Python library. Python and some Python libraries still need to be installed on your system, but full functionality from R, MATLAB, and Julia can be obtained with the wrappers without any Python knowledge.

Quick usage

The main object and functionality of the library comes from the LSTM object. Given data = a pandas DataFrame of a date column + monthly data + a quarterly target series to run the model on, usage is as follows:

from nowcast_lstm.LSTM import LSTM

# note that if a column has no data in it, i.e., is all NAs, its values will be replaced with 0. This won't affect model performance and will ensure that the model can still be trained
# a list of columns with no data in them can be accessed with `model.no_data_cols`
model = LSTM(data, "target_col_name", n_timesteps=12) # default parameters with 12 timestep history

model.X # array of the transformed training dataset
model.y # array of the target values

model.mv_lstm # list of trained PyTorch network(s)
model.train_loss # list of training losses for the network(s)

model.train()
model.predict(model.data) # predictions on the training set

# predicting on a testset, which is the same dataframe as the training data + newer data
# this will give predictions for all dates, but only predictions after the training data ends should be considered for testing
model.predict(test_data)

# to gauge performance on artificial data vintages
model.ragged_preds(pub_lags, lag, test_data)

# save a trained model using dill
import dill
dill.dump(model, open("trained_model.pkl", mode="wb"))

# load a previously trained model using dill
trained_model = dill.load(open("trained_model.pkl", "rb", -1))

Model selection

To ease variable and hyperparameter selection, the library provides provisions for this process to be carried out automatically. See the example file or run help() on the functions for more information.

from nowcast_lstm.model_selection import variable_selection, hyperparameter_tuning, select_model

# case where given hyperparameters, want to select which variables go into the model
selected_variables = variable_selection(data, "target_col_name", n_timesteps=12) # default parameters with 12 timestep history

# case where given variables, want to select hyperparameters
performance = hyperparameter_tuning(data, "target_col_name", n_timesteps_grid=[12], n_hidden_grid=[10,20])

# case where want to select both variables and hyperparameters for the model
performance = select_model(data, "target_col_name", n_timesteps_grid=[12], n_hidden_grid=[10,20])

Prediction uncertainty

Produce estimates along with lower and upper bounds of an uncertainty interval. See the example Jupyter Notebook for more information on the methodology employed.

from nowcast_lstm.LSTM import LSTM

# where model = a trained model
model.interval_predict(
        test_data,
        interval = 0.95 # float from 0 to 1, how large to make intervals (higher = larger)
    )
    
# predictions on synthetic vintages
model.ragged_interval_predict(
	pub_lags,
	lag,
	test_data,
	interval = 0.95
)

LSTM parameters

data: pandas DataFrame of the data to train the model on. Should contain a target column. Any non-numeric columns will be dropped. It should be in the most frequent period of the data. E.g. if I have three monthly variables, two quarterly variables, and a quarterly series, the rows of the dataframe should be months, with the quarterly values appearing every three months (whether Q1 = Jan 1 or Mar 1 depends on the series, but generally the quarterly value should come at the end of the quarter, i.e. Mar 1), with NAs or 0s in between. The same logic applies for yearly variables.
target_variable: a string, the name of the target column in the dataframe.
n_timesteps: an int, corresponding to the "memory" of the network, i.e. the target value depends on the x past values of the independent variables. For example, if the data is monthly, n_timesteps=12 means that the estimated target value is based on the previous years' worth of data, 24 is the last two years', etc. This is a hyper parameter that can be evaluated.
fill_na_func: a function used to replace missing values. Should take a column as a parameter and return a scalar, e.g. np.nanmean or np.nanmedian.
fill_ragged_edges_func: a function used to replace missing values at the end of series. Leave blank to use the same function as fill_na_func, pass "ARMA" to use ARMA estimation using pmdarima.arima.auto_arima.
n_models: int of the number of networks to train and predict on. Because neural networks are inherently stochastic, it can be useful to train multiple networks with the same hyper parameters and take the average of their outputs as the model's prediction, to smooth output.
train_episodes: int of the number of training episodes/epochs. A short discussion of the topic can be found here.
batch_size: int of the number of observations per batch. Discussed here
decay: float of the rate of decay of the learning rate. Also discussed here. Set to 0 for no decay.
n_hidden: int of the number of hidden states in the LSTM network. Discussed here.
n_layers: int of the number of LSTM layers to include in the network. Also discussed here.
dropout: float of the proportion of layers to drop in between LSTM layers. Discussed here.
criterion: PyTorch loss function. Discussed here, list of available options in PyTorch here.
optimizer: PyTorch optimizer. Discussed here, list of available options in PyTorch here. E.g. torch.optim.SGD.
optimizer_parameters: dictionary. Parameters for a particular optimizer, including learning rate. Information here. For instance, to change learning rate (default 1e-2), pass {"lr":1e-2}, or weight_decay for L2 regularization, pass {"lr":1e-2, "weight_decay":1e-3}. Learning rate discussed here.

LSTM outputs

Assuming a model has been instantiated and trained with model = LSTM(...):

model.train(): trains the network. Set quiet=True to suppress printing of losses per epoch during training.
model.X: transformed data in the format the model was/will actually be trained on. A numpy array of dimensions n observations x n timesteps x n features.
model.y: one-dimensional list target values the model was/will be trained on.
model.predict(model.data): given a dataframe with the same columns the model was trained on, returns a dataframe with date, actuals, and predictions, pass model.data for performance on the training set.
model.predict(new_data): generate dataframe of predictions on a new dataset. Generally should be the same dataframe as the training set, plus additional dates/datapoints.
model.mv_lstm: a list of length n_models containing the PyTorch networks.
model.train_loss: a list of length n_models containing the training losses of each of the trained networks.
model.ragged_preds(pub_lags, lag, new_data, start_date, end_date): adds artificial missing data then returns a dataframe with date, actuals, and predictions. This is especially useful as a testing mechanism, to generate datasets to see how a trained model would have performed at different synthetic vintages or periods of time in the past. pub_lags should be a list of ints (in the same order as the columns of the original data) of length n_features (i.e. excluding the target variable) dictating the normal publication lag of each of the variables. lag is an int of how many periods back we want to simulate being, interpretable as last period relative to target period. E.g. if we are nowcasting June, lag = -1 will simulate being in May, where May data is published for variables with a publication lag of 0. It will fill with missings values that wouldn't have been available yet according to the publication lag of the variable + the lag parameter. It will fill missings with the same method specified in the fill_ragged_edges_func parameter in model instantiation.
model.gen_news(target_period, old_data, new_data): Generates news between one data release to another, adding an element of causal inference to the network. Works by holding out new data column by column, recording differences between this prediction and the prediction on full data, and registering this difference as the new data's contribution to the prediction. Contributions are then scaled to equal the actual observed difference in prediction in the aggregate between the old dataset and the new dataset.
model.feature_contribution(): Generates a dataframe showing the relative feature importance of variables in the model using the permutation feature contribution method via RMSE on the train set.

nowcast_lstm's People

Contributors

Stargazers

Watchers

Forkers

eamooc fgvillarreal jjhuwylerold allisterh yangkedc1984 edvardas-vabuolas-personal tvogel20 yfanli clic-ethiopia pedrojma

nowcast_lstm's Issues

missing values in mix frequencies data

In the context of LSTM applied to mixed-frequency data, the initial step involves addressing missing data. This is particularly crucial when dealing with a target variable that follows a monthly frequency, while the features are recorded daily. Before any computations take place, it becomes necessary to impute the missing data in the target variable. To illustrate, consider a scenario where, within a one-month timeframe, there exist 29 instances of missing data. My specific inquiry pertains to the underlying logic employed by the LSTM model in handling such a dataset.

list index out of range

No module named 'pmdarima'

Hello Daniel,

I would appreciate your help. After installation via pip and the following command

from nowcast_lstm.LSTM import LSTM

I get the this dependency error

raceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/etc/share/code-server/extensions/ms-python.python-2020.10.332292344/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "/etc/share/code-server/extensions/ms-python.python-2020.10.332292344/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/etc/share/code-server/extensions/ms-python.python-2020.10.332292344/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 267, in run_file
    runpy.run_path(options.target, run_name=compat.force_str("__main__"))
  File "/opt/conda/lib/python3.8/runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/opt/conda/lib/python3.8/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/jovyan/test_nowcast_lstm.py", line 3, in <module>
    from nowcast_lstm.LSTM import LSTM
  File "/opt/conda/lib/python3.8/site-packages/nowcast_lstm/LSTM.py", line 6, in <module>
    import nowcast_lstm.data_setup
  File "/opt/conda/lib/python3.8/site-packages/nowcast_lstm/data_setup.py", line 5, in <module>
    from pmdarima.arima import auto_arima, ARIMA
ModuleNotFoundError: No module named 'pmdarima'

forecasting out of test date

Suppose the daily features span from 2010 to 2023, and the target variable covers the same time period. Now, the objective is to predict target values for the initial month of 2024. Could you please guide me on implementing this for an LSTM model, considering that the target variable is recorded monthly while the features are captured daily, thus involving mixed frequencies?

Lag of Target Variable

Hi. Great add-in. This is more of a question than issue. When specifying lags (n_timesteps) does it apply this lookback window automatically to just the X variables/features or to the target variable as well? Or would I need to create another variable based on a copy of the target to factor in the target's own lag? Also where can I see details the architecture/parameters of the LTSM model is built on (number of layers, neurons etc.)

Many Thanks

Lucas

Code error help

pred_dict = {k: [] for k in lags}
for date in dates:
# training the actual model
train = test.loc[test.date <= str(pd.to_datetime(date) - pd.tseries.offsets.DateOffset(months=3))[:10],:] # data as it would have appeared at beginning of prediction period

model = LSTM(
    data = train,
    target_variable = target_variable,
    n_timesteps = 6,
    fill_na_func = np.nanmean,
    fill_ragged_edges_func = np.nanmean,
    n_models = 10,
    train_episodes = 100,
    batch_size = 50,
    decay = 0.98,
    n_hidden = 10,
    n_layers = 1,
    dropout = 0.0,
    criterion = torch.nn.MSELoss(),
    optimizer = torch.optim.Adam,
    optimizer_parameters = {"lr":1e-2, "weight_decay":0.0}
)
model.train(quiet=True)

for lag in lags:
    # the data available for this date at this artificial vintage
    tmp_data = gen_lagged_data(metadata, test, date, lag)
    
    # the predict function will give a whole dataframe, only interested in the prediction for this date
    pred = model.predict(tmp_data).loc[lambda x: x.date == date, "predictions"].values[0]
    pred_dict[lag].append(pred)

There is an error when running the above code, how should I modify it? Thank you!
RuntimeError Traceback (most recent call last)
Cell In[9], line 23
4 train = test.loc[test.date <= str(pd.to_datetime(date) - pd.tseries.offsets.DateOffset(months=3))[:10],:] # data as it would have appeared at beginning of prediction period
6 model = LSTM(
7 data = train,
8 target_variable = target_variable,
(...)
21 optimizer_parameters = {"lr":1e-2, "weight_decay":0.0}
22 )
---> 23 model.train(quiet=True)
25 for lag in lags:
26 # the data available for this date at this artificial vintage
27 tmp_data = gen_lagged_data(metadata, test, date, lag)

File D:\Anaconda3\envs\Pytorch\lib\site-packages\nowcast_lstm\LSTM.py:144, in LSTM.train(self, num_workers, shuffle, quiet)
142 optimizer = instantiated["optimizer"]
143 # train the model
--> 144 trained = self.modelling.train_model(
145 self.X,
146 self.y,
147 mv_lstm,
148 criterion,
149 optimizer,
150 train_episodes=self.train_episodes,
151 batch_size=self.batch_size,
152 decay=self.decay,
153 num_workers=num_workers,
154 shuffle=shuffle,
155 quiet=quiet,
156 )
157 self.mv_lstm.append(trained["mv_lstm"])
158 self.train_loss.append(trained["train_loss"])

File D:\Anaconda3\envs\Pytorch\lib\site-packages\nowcast_lstm\modelling.py:139, in train_model(X, y, mv_lstm, criterion, optimizer, train_episodes, batch_size, decay, num_workers, shuffle, quiet)
136 batch_X, batch_y = batch_X.to(device), batch_y.to(device)
138 mv_lstm.init_hidden(batch_X.size(0))
--> 139 output = mv_lstm(batch_X)
140 loss = criterion(output.view(-1), batch_y)
142 loss.backward()

File D:\Anaconda3\envs\Pytorch\lib\site-packages\torch\nn\modules\module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []

File D:\Anaconda3\envs\Pytorch\lib\site-packages\nowcast_lstm\mv_lstm.py:49, in MV_LSTM.forward(self, x)
46 batch_size, n_timesteps, _ = x.size()
48 # model layers
---> 49 x, self.hidden = self.l_lstm(x, self.hidden)
50 x = x.contiguous().view(batch_size, -1) # make tensor of right dimensions
51 x = self.l_linear(x)

File D:\Anaconda3\envs\Pytorch\lib\site-packages\torch\nn\modules\rnn.py:812, in LSTM.forward(self, input, hx)
810 self.check_forward_args(input, hx, batch_sizes)
811 if batch_sizes is None:
--> 812 result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
813 self.dropout, self.training, self.bidirectional, self.batch_first)
814 else:
815 result = _VF.lstm(input, batch_sizes, hx, self._flat_weights, self.bias,
816 self.num_layers, self.dropout, self.training, self.bidirectional)

RuntimeError: Input and hidden tensors are not at the same device, found input tensor at cuda:0 and hidden tensor at cpu

ModuleNotFoundError: No module named 'nowcast_lstm.LSTM'; 'nowcast_lstm' is not a package

Hello, when I use the following command to import a package:

from nowcast_lstm.LSTM import LSTM

the following error message appears:

ModuleNotFoundError: No module named 'nowcast_lstm.LSTM'; 'nowcast_lstm' is not a package

How to solve it? Thanks a lot.

evaluation and accuracy of model.

How can I assess the performance of the model? Typically, I employ the model.evaluate() method. However, with your modified LSTM, I'm uncertain about how to obtain the accuracy of the model. because we have Nan values for actual target value

criterion requirement

Hi, I need help fixing an error. My target value is either 0 or 1. I already tried scaling my data with MinMaxScaler, but the error persists.

Thank you in advance.

def lstm_nowcast(train_data,target,new_data):
from nowcast_lstm.LSTM import LSTM
import pandas as pd

train_data['date'] = pd.to_datetime(train_data.index)
new_data['date'] = pd.to_datetime(new_data.index)

train_data['target'] = target['high_c']
test_data = pd.concat([train_data, new_data], ignore_index=True)

import torch.nn as nn

loss = nn.BCELoss()
model = LSTM(train_data, "target", n_timesteps=12,n_models=3,train_episodes=20,batch_size=32,criterion=loss)

model.train()
result = model.predict(test_data)
print(result)

error :
Traceback (most recent call last):
File "c:/Users/User/Documents/GitHub/test_lstm/main.py", line 175, in
lstm_nowcast(train_data,target,new_data)
File "c:/Users/User/Documents/GitHub/test_lstm/main.py", line 91, in lstm_nowcast
model.train()
File "C:\Users\User.conda\envs\practice\lib\site-packages\nowcast_lstm\LSTM.py", line 144, in train
trained = self.modelling.train_model(
File "C:\Users\User.conda\envs\practice\lib\site-packages\nowcast_lstm\modelling.py", line 138, in train_model
loss = criterion(output.view(-1), batch_y)
File "C:\Users\User.conda\envs\practice\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\User.conda\envs\practice\lib\site-packages\torch\nn\modules\loss.py", line 613, in forward
return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
File "C:\Users\User.conda\envs\practice\lib\site-packages\torch\nn\functional.py", line 2762, in binary_cross_entropy
return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum)
RuntimeError: all elements of input should be between 0 and 1

Early stop to prevent overfitting

Hi,

would like to suggest adding support for early stopping and choose best weight based on lowest loss or highest accuracy, or highest validation accuracy.

ARDL and LSTM

The integration of ARMA (AutoRegressive Moving Average) models with LSTM (Long Short-Term Memory) models is a known approach in time series forecasting. ARMA models capture linear dependencies in time series data, while LSTM models are effective at capturing non-linear and sequential patterns. Combining them can potentially improve forecasting accuracy.
Regarding your mention of ARDL (AutoRegressive Distributed Lag), VAR (Vector AutoRegression), and GARCH (Generalized AutoRegressive Conditional Heteroskedasticity) models with LSTM, it's possible to explore such combinations, but it largely depends on the specific characteristics of your data and the forecasting objectives.
the question is that is it possible to set hybrid model(ARDL-LSTM, GHARGH-LSTM, VAR-LSTM)?

LSTM model parameters

Dear Professor Daniel Hopp, when I used your LSTM model code to make prediction, the RMSE I got was about 45,000, which was much larger than the ARMA model. I used the weekly degree data and daily degree data to predict the weekly degree data. May I ask which parameters in the LSTM model should I modify to improve the prediction accuracy? I would appreciate it if you could read and answer my questions carefully!