ImputeBench: Benchmark of Imputation Techniques in Time Series

ImputeBench implements SOTA recovery techniques for blocks of missing values in time series and evaluates their precision and runtime on various real-world time series datasets using different recovery scenarios. Technical details can be found in our PVLDB 2020 paper: Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series . The benchmark can be easity extended with new algorithms (C/C++, Python or Matlab), new datasets and new scenarios.

The benchmark implements the following algorithms (in C++): CDRec, DynaMMo, GROUSE, ROSL, SoftImpute, SPIRIT, STMVL, SVDImpute, SVT, TeNMF, TRMF, and TKCM.
We recently added these new algorithms (in Python): SSA, MRNN and BRITS.
All the datasets used in this benchmark can be found here.
The full list of recovery scenarios can be found here.

Prerequisites

Ubuntu 16 or Ubuntu 18 (including Ubuntu derivatives, e.g., Xubuntu) or the same distribution under WSL.
Clone this repository.
Mono: Install mono from https://www.mono-project.com/download/stable/ and reboot.

Build

Build the Testing Framework using the installation script located in the root folder (takes several minutes)

    $ sh install_linux.sh

[Optional] This script installs all the extra Python packages required by the newly added algorithms (SSA, MRNN and BRITS):

    $ sh install_extra.sh

Execution

    $ cd TestingFramework/bin/Debug/
    $ mono TestingFramework.exe [arguments]

Arguments

-alg	-d	-scen
cdrec	airq	miss_perc
dynammo	bafu	ts_length
grouse	chlorine	ts_nbr
rosl	climate	miss_disj
softimp	drift10	miss_over
svdimp	electricity	mcar
svt	meteo	blackout
stmvl	temp	all
spirit	bafu_red
tenmf	drift10_red
tkcm	all
trmf
all
--------	--------	--------
New algs
--------	--------	--------
ssa
m-rnn
brits

Results

All results and plots will be added to Results folder. The accuracy results of all algorithms will be sequentially added for each scenario and dataset to: Results/.../.../error/. The runtime results of all algorithms will be added to: Results/.../.../runtime/. The plots of the recovered blocks will be added to the folder Results/.../.../recovery/plots/.

Execution examples

Run a single algorithm (cdrec) on a single dataset (drift10) using one scenario (missing percentage)

    $ mono TestingFramework.exe -alg cdrec -d drift10 -scen miss_perc

Run two algorithms (cdrec, spirit) on a single dataset (drift10) using one scenario (missing percentage)

    $ mono TestingFramework.exe -alg cdrec,spirit -d drift10 -scen miss_perc

Run point 2 without runtime results

    $ mono TestingFramework.exe -alg cdrec,spirit -d drift10 -scen miss_perc -nort

Run the whole VLDB'20 benchmark (all algorithms, all datasets, all scenarios, precision and runtime)

    $ mono TestingFramework.exe -alg all -d all -scen all

Warning: Running the whole benchmark will take a sizeable amount of time (up to 4 days depending on the hardware) and will produce up to 15GB of output files with all recovered data and plots unless stopped early.

Additional command-line parameters

    $ mono TestingFramework.exe --help

Remark: Algorithms tkcm, spirit, ssa, brits and mr-nn cannot handle multiple incomplete time series. These allgorithms will not produce results for the following scenarios: miss_disj, miss_over, mcar and blackout.

Parametrized execution

You can parametrize each algorithm using the command -algx. For example, you can run the svdimp algorithm with a reduction value of 4 on the drift dataset and by varying the sequence length as follows:

    $ mono TestingFramework.exe -algx svdimp 4 -d drift10 -scen ts_nbr

If you want to run some algorithms with default parameters, and some with customized ones, you can use -alg and -algx together. For example, you can run stmvl algorithm with default parameter and cdrec algorithm with a reduction value of 4 on the airq dataset and by varying the sequence length as follows:

    $ mono TestingFramework.exe -alg stmvl -algx cdrec 4 -d airq -scen ts_nbr

Remark: The command -algx cannot be executed in group and thus must preceed the name of each algorithm.

Extension

To add new algorithms:
To add new datasets:
- import the file to TestingFramework/bin/Debug/data/{name}/{name}_normal.txt (name is the name of your data).
- Requirements: rows>= 1'000, columns>= 10, column separator: empty space, row separator: newline

Contributors

Mourad Khayati ([email protected]) and Zakhar Tymchenko ([email protected]).

Award

Imputebench has received the VLDB 2020 Most Reproducible Paper Award.

Citation

@inproceedings{imputebench2020vldb,
 author    = {Mourad Khayati and Alberto Lerner and Zakhar Tymchenko and Philippe Cudr{\'{e}}{-}Mauroux},
 title     = {Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series},
 booktitle = {Proceedings of the VLDB Endowment},
 volume    = {13},
 number    = {5},
 year      = {2020}
}

akheli / bench-vldb20 Goto Github PK

bench-vldb20's Introduction

ImputeBench: Benchmark of Imputation Techniques in Time Series

Prerequisites

Build

Execution

Arguments

Results

Execution examples

Parametrized execution

Extension

Contributors

Award

Citation

bench-vldb20's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent