ImputeBench implements SOTA recovery techniques for blocks of missing values in time series and evaluates their precision and runtime on various real-world time series datasets using different recovery scenarios. Technical details can be found in our PVLDB 2020 paper: Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series . The benchmark can be easity extended with new algorithms (C/C++, Python or Matlab), new datasets and new scenarios.
-
The benchmark implements the following algorithms (in C++): CDRec, DynaMMo, GROUSE, ROSL, SoftImpute, SPIRIT, STMVL, SVDImpute, SVT, TeNMF, TRMF, and TKCM.
-
We recently added these new algorithms (in Python): SSA, MRNN and BRITS.
-
All the datasets used in this benchmark can be found here.
-
The full list of recovery scenarios can be found here.
Prerequisites | Build | Execution | Extension | Contributors | Award | Citation
- Ubuntu 16 or Ubuntu 18 (including Ubuntu derivatives, e.g., Xubuntu) or the same distribution under WSL.
- Clone this repository.
- Mono: Install mono from https://www.mono-project.com/download/stable/ and reboot.
- Build the Testing Framework using the installation script located in the root folder (takes several minutes)
$ sh install_linux.sh
- [Optional] This script installs all the extra Python packages required by the newly added algorithms (SSA, MRNN and BRITS):
$ sh install_extra.sh
$ cd TestingFramework/bin/Debug/
$ mono TestingFramework.exe [arguments]
-alg | -d | -scen |
---|---|---|
cdrec | airq | miss_perc |
dynammo | bafu | ts_length |
grouse | chlorine | ts_nbr |
rosl | climate | miss_disj |
softimp | drift10 | miss_over |
svdimp | electricity | mcar |
svt | meteo | blackout |
stmvl | temp | all |
spirit | bafu_red | |
tenmf | drift10_red | |
tkcm | all | |
trmf | ||
all | ||
-------- | -------- | -------- |
New algs | ||
-------- | -------- | -------- |
ssa | ||
m-rnn | ||
brits |
All results and plots will be added to Results
folder. The accuracy results of all algorithms will be sequentially added for each scenario and dataset to: Results/.../.../error/
. The runtime results of all algorithms will be added to: Results/.../.../runtime/
. The plots of the recovered blocks will be added to the folder Results/.../.../recovery/plots/
.
- Run a single algorithm (cdrec) on a single dataset (drift10) using one scenario (missing percentage)
$ mono TestingFramework.exe -alg cdrec -d drift10 -scen miss_perc
- Run two algorithms (cdrec, spirit) on a single dataset (drift10) using one scenario (missing percentage)
$ mono TestingFramework.exe -alg cdrec,spirit -d drift10 -scen miss_perc
- Run point 2 without runtime results
$ mono TestingFramework.exe -alg cdrec,spirit -d drift10 -scen miss_perc -nort
- Run the whole VLDB'20 benchmark (all algorithms, all datasets, all scenarios, precision and runtime)
$ mono TestingFramework.exe -alg all -d all -scen all
Warning: Running the whole benchmark will take a sizeable amount of time (up to 4 days depending on the hardware) and will produce up to 15GB of output files with all recovered data and plots unless stopped early.
- Additional command-line parameters
$ mono TestingFramework.exe --help
Remark: Algorithms tkcm
, spirit
, ssa
, brits
and mr-nn
cannot handle multiple incomplete time series. These allgorithms will not produce results for the following scenarios: miss_disj
, miss_over
, mcar
and blackout
.
- You can parametrize each algorithm using the command
-algx
. For example, you can run the svdimp algorithm with a reduction value of 4 on the drift dataset and by varying the sequence length as follows:
$ mono TestingFramework.exe -algx svdimp 4 -d drift10 -scen ts_nbr
- If you want to run some algorithms with default parameters, and some with customized ones, you can use
-alg
and-algx
together. For example, you can run stmvl algorithm with default parameter and cdrec algorithm with a reduction value of 4 on the airq dataset and by varying the sequence length as follows:
$ mono TestingFramework.exe -alg stmvl -algx cdrec 4 -d airq -scen ts_nbr
Remark: The command -algx
cannot be executed in group and thus must preceed the name of each algorithm.
- To add new algorithms:
- To add new datasets:
- import the file to
TestingFramework/bin/Debug/data/{name}/{name}_normal.txt
(name
is the name of your data). - Requirements: rows>= 1'000, columns>= 10, column separator: empty space, row separator: newline
- import the file to
Mourad Khayati ([email protected]) and Zakhar Tymchenko ([email protected]).
Imputebench has received the VLDB 2020 Most Reproducible Paper Award.
@inproceedings{imputebench2020vldb,
author = {Mourad Khayati and Alberto Lerner and Zakhar Tymchenko and Philippe Cudr{\'{e}}{-}Mauroux},
title = {Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series},
booktitle = {Proceedings of the VLDB Endowment},
volume = {13},
number = {5},
year = {2020}
}