robjhyndman / m4metalearning Goto Github PK

View Code? Open in Web Editor NEW

116.0 116.0 49.0 12.27 MB

R 100.00%

m4metalearning's Introduction

🧑 I'm a statistician/data scientist at Monash University, Australia.
🔭 I analyse and forecast collections of time series, and find anomalies in data sets.
🎓 I am part of Monash NUMBATs.
💻 I support open science and open data.
📦 I write R packages.
✍ I write the Hyndsight blog.
📚 I write papers and books.
🎾 I play tennis.

m4metalearning's People

Contributors

Stargazers

Watchers

Forkers

georgeathana thiyangt gehusi echo-gsimmons edwardhl1 joshuac3 qtests carachan98 gfanz djpedregal fabianmax ninush zhuanglineu dvillalobos someone894 drehar rrhellstorm jmablans bradjcheme skanderhn nguyenkaos fenice420 tebynd valeman arsa-nik agamat lily940703 simeb lixixibj jnuvenus abhishek-badsha jmsebastiancarino oguzhanka yangspeaking nyankoneko sonamtripathi cwebkas ssh352 lwe-speers rvsantos2000 zju-lznb wgova robwiederstein punkybella amrofi shy-b webert6 gabrielkaiserqfin gehuiyun

m4metalearning's Issues

hyperparamter_search not running over all folds?

for loop running 1:1

for (i in 1:1) {
  dtrain <- xgboost::xgb.DMatrix(train_feat[[i]]$data)
  attr(dtrain, "errors") <- train_feat[[i]]$errors

  bst <- xgboost::xgb.train(param, dtrain, nrounds)
  preds <- M4metalearning::predict_selection_ensemble(bst, test_feat[[i]]$data)
  er <- M4metalearning::summary_performance(preds,
                                            test_ds[[i]],
                                            print.summary = FALSE)

  final_error <- c(final_error, er$weighted_error)
  final_preds <- rbind(final_preds, preds)
}

Should maybe be 1.length(folds) ?

Target variable for finding weights

I am not sure I understood the methodology regarding using xgboost to find the weights for each forecast method. Is it correct that the independent variables are the features generated and the targets are the owa errors from the forecasts?

hyperparameter_search can't compute more than 145000 timeseries

In the last few weeks I tried to use the FFORMA-System for forecasting a massive block of data (about 110000 timeseries, sometimes just zeros). In doing so I found a bunch of problems. Most of them could be solved, but now I'm stuck.

Here is a short list of the fixed problems:

The parallelisation of THA_features and calc_forecasts did not work well with Linux on IBM-Power.
1. Fix it by using foreach and doParallel.
THA_features produces NaN if the timeseries is completely filled with zeros.
1. Fixed it by returning zeros instead of NaN from stl_features in the tsfeatures package if no season could be computed.
train_interval_weights did not work for timeseries with always h=24.
1. Fixing it by adding check for NULL values.

Hopefully I can upload the changed code soon.
The main reason for this issue is the hyperparameter_search function. I am trying to train the FFORMA-system with the 100000 original M4 timeseries and my 110000 own timeseries combined.
But apparently the hyperparameter search only works with < 145000 timeseries.
If I try to use more timeseries I get the following error:

cannot open compressed file '<PATH>/M4_Hyper.RData', probable reason 'No such file or directory'
Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection
elapsed = 488.97        Round = 1       max_depth = 10.0000     eta = 0.4000    subsample = 0.9000      colsample_bytree = 0.6000       nrounds = 200.0000      Value = -0.9861
elapsed = 605.43        Round = 2       max_depth = 12.0000     eta = 0.7395    subsample = 0.8938      colsample_bytree = 0.5587       nrounds = 228.0000      Value = -0.9827
elapsed = 110.10        Round = 3       max_depth = 7.0000      eta = 0.2818    subsample = 0.5933      colsample_bytree = 0.9659       nrounds = 59.0000       Value = -0.9876
elapsed = 248.60        Round = 4       max_depth = 8.0000      eta = 0.0042    subsample = 0.8924      colsample_bytree = 0.9966       nrounds = 126.0000      Value = -1.0407
elapsed = 62.96         Round = 5       max_depth = 13.0000     eta = 0.7589    subsample = 0.7186      colsample_bytree = 0.8478       nrounds = 68.0000       Value = NaN
elapsed = 84.47         Round = 6       max_depth = 7.0000      eta = 0.8266    subsample = 0.6338      colsample_bytree = 0.8825       nrounds = 110.0000      Value = NaN
Error in GP_deviance(beta = row, X = X, Y = Y, nug_thres = nug_thres,  :
  Infinite values of the Deviance Function,
            unable to find optimum parameters
Calls: source ... eval -> eval -> <Anonymous> -> apply -> FUN -> GP_deviance

Do you have any idea or direction as to how I can fix this issue?

temp_holdout removes ts start date

The temp_holdout function resets the input time series to year 1, month 1. Since it requires the input to be a time series object, it should retain the start parameter of the input.

I've modified the function below.

function (dataset) 
{
    lapply(dataset, function(seriesentry) {
        frq <- stats::frequency(seriesentry$x)
        st <- stats::start(seriesentry$x)
        if (length(seriesentry$x) - seriesentry$h < max(2 * frq + 
            1, 7)) {
            length_to_keep <- max(2 * stats::frequency(seriesentry$x) + 
                1, 7)
            seriesentry$h <- length(seriesentry$x) - length_to_keep
            if (seriesentry$h < 2) {
                warning(paste("cannot subset series by", 2 - 
                  seriesentry$h, " observations, adding a mean constant"))
                seriesentry$x <- stats::ts(c(seriesentry$x, rep(mean(seriesentry$x), 
                  2 - seriesentry$h)), frequency = frq)
                seriesentry$h <- 2
            }
        }
        seriesentry$xx <- utils::tail(seriesentry$x, seriesentry$h)
        seriesentry$x <- stats::ts(utils::head(seriesentry$x, 
            -seriesentry$h), frequency = frq, start=st)
        if (!is.null(seriesentry$n)) {
            seriesentry$n <- length(seriesentry$x)
        }
        seriesentry
    })
}

Error using THA_features()

I was following the code listed here

The code that generate the error
library(M4metalearning)
library(M4comp2018)
set.seed(31-05-2018)
#we start by creating the training and test subsets
indices <- sample(length(M4))
M4_train <- M4[ indices[1:15]]
M4_test <- M4[indices[16:25]]
#we create the temporal holdout version of the training and test sets
M4_train <- temp_holdout(M4_train)
M4_test <- temp_holdout(M4_test)
#this will take time
M4_train <- calc_forecasts(M4_train, forec_methods(), n.cores=3)
#once we have the forecasts, we can calculate the errors
M4_train <- calc_errors(M4_train)
M4_train <- THA_features(M4_train)
<simpleError in flist[[i]][[1]]: subscript out of bounds>
<simpleError in flist[[i]][[1]]: subscript out of bounds>
<simpleError in flist[[i]][[1]]: subscript out of bounds>
<simpleError in flist[[i]][[1]]: subscript out of bounds>
<simpleError in flist[[i]][[1]]: subscript out of bounds>
<simpleError in flist[[i]][[1]]: subscript out of bounds>
<simpleError in flist[[i]][[1]]: subscript out of bounds>
<simpleError in flist[[i]][[1]]: subscript out of bounds>
<simpleError in flist[[i]][[1]]: subscript out of bounds>
<simpleError in flist[[i]][[1]]: subscript out of bounds>
<simpleError in flist[[i]][[1]]: subscript out of bounds>
<simpleError in flist[[i]][[1]]: subscript out of bounds>
<simpleError in flist[[i]][[1]]: subscript out of bounds>
<simpleError in flist[[i]][[1]]: subscript out of bounds>
<simpleError in flist[[i]][[1]]: subscript out of bounds>

Here is the sessionInfo():
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
[1] C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] tsfeatures_0.1 M4comp2018_0.1.0
[3] M4metalearning_0.0.0.9000

loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 urca_1.3-0 pillar_1.3.1 compiler_3.4.4
[5] plyr_1.8.4 iterators_1.0.10 tseries_0.10-46 tools_3.4.4
[9] xts_0.11-2 tibble_2.0.1 gtable_0.2.0 nlme_3.1-137
[13] lattice_0.20-38 pkgconfig_2.0.2 rlang_0.3.1 foreach_1.4.4
[17] curl_3.3 parallel_3.4.4 lmtest_0.9-36 grid_3.4.4
[21] nnet_7.3-12 forecast_8.5 purrr_0.3.1 ggplot2_3.1.0
[25] TTR_0.23-4 magrittr_1.5 scales_1.0.0 codetools_0.2-15
[29] quantmod_0.4-13 timeDate_3043.102 colorspace_1.4-0 fracdiff_1.4-2
[33] quadprog_1.5-5 lazyeval_0.2.1 munsell_0.5.0 crayon_1.3.4
[37] zoo_1.8-4

Error Calculation Question

Regarding error calculation, I was wondering why the OWA error for snaive method is not equal to one for each series.
For instance, for the first series of monthly data, the snaive error, $errors['snaive_forec'] is 2.459 instead of 1.

Thanks,
Arsa

Error when running in M1 MAC

What code produced the error?

library(forecast)
library(M4metalearning)
load("model_M4.rda")
tr <- window(AirPassengers, end=c(1959, 12))
forecast_meta_M4(model_M4, tr, h=12)

What error produced?

Error in if (class(newdata) != "xgb.DMatrix") newdata <- xgb.DMatrix(newdata,  : 
  the condition has length > 1

This error seems to come from line 184 of file https://github.com/pmontman/customxgboost/blob/master/R/xgb.Booster.R .

Environment?

MAC book Pro. M1
RStudio 2022.07.2 Build 576
xgboost compiled after executing devtools::install_github("pmontman/customxgboost"). To compile the code, I used Xcode command line tools 14. I installed this package by downloading it from https://download.developer.apple.com/Developer_Tools/Command_Line_Tools_for_Xcode_14.0_Release_Candidate/Command_Line_Tools_for_Xcode_14_Release_Candidate.dmg

Include forecast prediction intervals

It would be nice to have the option to specify prediction intervals in ensemble_forecast. I realize PIs for ensembles can be dodgy, but to the best of our ability having this functionality similar to the basic forecast package would be helpful. It could be included as another object in the list, so in addition to y_hat you also have hi_95 or low_80, etc.

calc_errors() returns inconsistent results errors

calc_errors() sometimes returns NA for $errors, while sometimes returns valid errors
I have observed this particularly happening depending on the batch size

Ex 1: $errors has errors

z1 <- list(list(x=ts(c(20000,40000,10000,150000,650000), frequency = 4), h = 2))
z1 <- temp_holdout(z1)
z1 <- calc_forecasts(z1, forec_methods(), n.cores=4)
z1 <- calc_errors(z1)

Ex 2: using same series as in earlier example, with an additional series errors are now valid

z2 <- list(list(x=ts(c(20000,40000,10000,150000,650000), frequency = 4), h = 2),list(x=ts(c(20000,40000,10000,150000,650000,130000,125000,325000,160000,150000), frequency = 4), h = 2))
z2 <- temp_holdout(z2)
z2 <- calc_forecasts(z2, forec_methods(), n.cores=4)
z2 <- calc_errors(z2)

this seems to be happening, coz total_snaive_errors is an average across all values in the dataset. However, if there is an issue in any one of the series then all $errors values are NA, causing the subsequent code to fail.

Ex 3: to series from Ex 2, we add one more series. Resulting errors for all three series will be NA

z2 <- list(
          list(x=ts(c(20000,40000,10000,150000,650000), frequency = 4), h = 2),
          list(x=ts(c(20000,40000,10000,150000,650000,130000,125000,325000,160000,150000), frequency = 4), h = 2),
          list(x=ts(c(50,50,50,50,50,50,50,50,50,50), frequency = 4), h = 2)
      )
z2 <- temp_holdout(z2)
z2 <- calc_forecasts(z2, forec_methods(), n.cores=4)
z2 <- calc_errors(z2)

Doubt about OWA and OWI erros

Good morning Dears,
In the FFORMA paper, the authors show the OWA (overall weighted average) as the Forecast loss measure. But, in the 'Forecasting Metalearning Example' on GitHub, they use the OWI error to measure the performance. I just would like to know if the errors has the same meaning. They represent the same idea?

Thank you for your time.

error in train_selection_ensemble()

I'm following the example in metalearning_example, and I got to the steps of training the learning model. When running:
meta_model <- train_selection_ensemble(train_data$data, train_data$errors)
I encountered the following error:
Error in rowSums(preds) : 'x' must be an array of at least two dimensions

I'm not sure if the problem is with the customxhboost package or with the error_softmax_obj function that calls rowSums(preds). Any help would be appreciated.

Trying to use on kaggle, but get error when downloading.

I put this into a kaggle r notebook:
devtools::install_github("robjhyndman/M4metalearning")
But i get this error.

It seems like the problem is with rlang, so maybe I should be posting there. I know nothing of R, but
I would really like to try your package on kaggle to predict time series. Thanks for all your contributions to the field of forecasting.

Trained mode

Where can I find the model trained on M4 data and how can I use it to forecast different time series.
Thank you.

forecast_meta_M4() not running as expected

Objective: trying to run the reproduction example under "Simple Forecasting" section

Input:
reset.seed(10-06-2019)
truex = (rnorm(60)) + seq(60)/10

#we subtract the last 10 observations to use it as 'true future' values
#and keep the rest as the input series in our method
h = 10
x <- head(truex, -h)
x <- ts(x, frequency = 1)

#forecasting with our method using our pretrained model in one line of code
#just the input series and the desired forecasting horizon
forec_result <- forecast_meta_M4(model_M4, x, h=h)

Output:
Error in is.constant(y) :
(list) object cannot be coerced to type 'double'.

Any context as to why forecast_meta_M4() is not working is greatly appreciated.

Include predicted weights in list output of ensemble_forecast

It would be helpful to include the model weights in the output of ensemble_forecast. Something like:

function (predictions, dataset, clamp_zero = TRUE) 
{
    for (i in 1:length(dataset)) {
        weighted_ff <- as.vector(t(predictions[i, ]) %*% dataset[[i]]$ff)
        if (clamp_zero) {
            weighted_ff[weighted_ff < 0] <- 0
        }
        dataset[[i]]$y_hat <- weighted_ff

       dataset[[i]]$weights <- predictions[i, ] # added this line
    }
    dataset
}

User defined forecasting methods can not be found in parallel

As calc_forecasts uses get() to find forecasting methods, user defined forecasting methods are not in the search path when evaluated in parallel.

It is safer if the functions are provided directly, rather than names of the functions to be found.

MRE:

library(M4metalearning)

croston_forec <- function(x, h){
  forecast::croston(x, h = h)$mean
}

calc_forecasts(M4comp2018::M4[1:3], append(forec_methods(), "croston_forec"), n.cores=1)
#> ...successful...
calc_forecasts(M4comp2018::M4[1:3], append(forec_methods(), "croston_forec"), n.cores=2)
#> Error in checkForRemoteErrors(val): 2 nodes produced errors; first error: object 'croston_forec' not found

^{Created on 2018-11-28 by the reprex package (v0.2.1)}

Memory limit for massive amount of timeseries

It seems to me that at the moment the RAM of the computer I'm using is the limiting factor regarding the amount of timeseries to be trained with. If I want train the system with e.g. 16 GB of timeseries-data I need to have at least 16 GB of RAM.

Is there a way to get around this issue? Maybe it is possible to train the system in smaller Batches or use some kind of iterator. I'm trying to train the system with a lot of data obtained from a database, where I get the timeseries in small chunks.