white-link / unsupervisedscalablerepresentationlearningtimeseries Goto Github PK

Unsupervised Scalable Representation Learning for Multivariate Time Series: Experiments

License: Apache License 2.0

Jupyter Notebook 80.02% Python 19.98%

deep-learning machine-learning neural-networks neurips-2019 pytorch time-series time-series-classification triplet-loss ucr-archive uea-archive

unsupervisedscalablerepresentationlearningtimeseries's People

Contributors

Stargazers

Watchers

Forkers

amirunpri2018 ryfan-rs mindis rschlotterbeck hlibt qsguo fcc-roy yishingene lhvu2 bhattg yanlirock ferdous014 happyseone statmixedml hsd1503 wzpy llppff maxbos varun-ml wintercho nachtsky1077 jimmy-inl dymat kaimaoge muleina gebob19 yasab27 sgokul1 theaperdeng mrpr3ntice lan-ping tokaiailab syyunn junayed dani2112 jinyang88 wickstrom willchen05 aprilcal xuanli19 zionsteiner zcmail dwhdai dajiaozhu langzippkk odinn13 next-mooon huilin-zhu jbxing jumbokh aunagar verystrongjoe bobzwr b-deforce hiroki1112 jimmyiskandar hanjie-jiang mlia swaileh loicomeliau freekang yanxinyi620 hawksilent fatcatzf miladalipour99 mo-arvan thabangdlebese carlosfel qiaofreedom danelee2601 nok-halfspace yangmindidemajia clairethkim xjw-wade hx804722948 flaviagiammarino wenhuiwang93 liubohahaha lab-ant siaer lzx-buaa stevenboa scxsunchenxi plumonito liesgame shivam-grover asherbond chuyu-jpg xyztd yuyangli0606 sandy4321

unsupervisedscalablerepresentationlearningtimeseries's Issues

encode_window with varying length data

I am using the same configuration as in #34, except using a batch size of 1. I have padded all input data with nans to the same length. The fitting of the encoder works without problems. However, I get the following error when using encode_window on my test data:

RuntimeError: Only zero batch or zero channel inputs are supported, but got input shape: [1, 6, 1, 0]

For context, my data is 6-dimensional. Should the test data not be multiple samples? Or should they not be padded?

Thanks in advance!

How to recover constant length TS training performance with varying length TS.

Hello again,

so it turns out the training with varying time series is much slower (as you hinted at in the paper).
So now I'm looking at ways to recover that performance, and wondered whether you have some intuition/preference for the following possibilities.

0 pad all time series
repeat short time series
make all time series multivariate
- dim0: the original TS 0 padded
- dim1: 0 if dim0 is pad otherwise 1

Since the model uses global max pooling all 0 seems a relatively neutral and straight forward "fix".
Of course the choices can be empirically validated, however, maybe you have some intuition/preference or theoretical insight.

Thx in advance!

Best,
Aaron

General questions regarding hyper parameter.

Hello,

Thank you for your great paper!

During application of your software some questions arose to me:

Have you tried tuning the gamma parameter of the SVM? (paper only mentions C)
2)The paper recommends 1500-2000 optimization steps, however, what if my dataset has 40k time series
2.1) Any recommendation on how to set/tune the number of optimization steps?
In the paper representations of length 80-160 were trained, however, to me that seems a quite long, given that many datasets in UCR have time series shorter than that.
3.1) Have you done any test with 16-32 dimensional representations?
3.2) Have you examined the correlation between performance and underlying dataset avg. time series length?
3.3) Any recommendations on lower bound on dimensionality?

Any insight is highly appreciated!

Performance of Transfer Learning: Train encoder on different datasets

Hi, I'm doing research on transfer learning of the representations learned by the encoder.

I have noticed that in your paper, only FordA dataset was used to perform transfer learning. I wonder have you tried other datasets to train the encoder and then train SVM on other UCR datasets? What's more, is there any patterns on pair of datasets which the target dataset benefits a lot from the training dataset through transfer learning? Another thinking is training the encoder on some of the datasets can get a better performance of transfer learning than others.

Thank you very much for discussing with me about those thinking. Your experience will help me a lot.

Visualization of classification result

Hello,

I successfully finished to make trained files thanks to your explanation.

I could check the accuracy, however, I don't know how to check the classification result in visualized version

Could you give me some advice about this issue?

Kinds and regards,

Hyoon

I don't know how to compile this code

Hi, I want to compile this code but there are few problems with it.

When I try to compile 'HouseholdPowerConsumption.ipynb' code,
the error message 'CUDA out of memory' appears. I add torch.cuda.empty_cache() on scikit_wrappers.py. Then the number of epochs is increased a bit, but I still have same error message 'CUDA out of memory'. How can I solve this problem?
I tried the usage part, but after the end of epoch, the error message below appears.
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
Epoch: 250
/home/sjkim/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py:667: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=5.
% (min_groups, self.n_splits)), UserWarning)
/home/sjkim/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/search.py:823: FutureWarning: The parameter 'iid' is deprecated in 0.22 and will be removed in 0.24.
"removed in 0.24.", FutureWarning
Traceback (most recent call last):
File "ucr.py", line 193, in
os.path.join(args.save_path, args.dataset)
File "/home/sjkim/work/NIPS2019/scikit_wrappers.py", line 112, in save
self.save_encoder(prefix_file)
File "/home/sjkim/work/NIPS2019/scikit_wrappers.py", line 101, in save_encoder
prefix_file + '' + self.architecture + '_encoder.pth'
File "/home/sjkim/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 327, in save
with _open_file_like(f, 'wb') as opened_file:
File "/home/sjkim/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 212, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/sjkim/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 193, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/path/to/save/models/Mallat_CausalCNN_encoder.pth'
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
I don't know what to do. Please let me know how to compile the whole code.
Thank you.

Parameter Estimation for Linear Regression for IHEPC dataset

Is there a reason for not using deterministic OLS-Estimation (via Moore-Penrose-Pseudoinverse, e.g. in sklearn.linear_model.LinearRegression.fit()) in case of the linear regression for the IHEPC dataset? Especially for reproducibility purposes, this may be benefitial (even if it is assumed here that the errors are normally distributed, so that the OLS-estimator also corresponds to the Maximum-Likelihood-optimal-estimator). Maybe OLS is unfavorable because of hypothetical high matrix dimensions and therefore the here used sequential adam optimizer is computationally superior?

Thanks in advance!

About the extraction of positive and negative samples

Your paper helped me a lot, but when I was reading the code, I was confused about the choice of positive and negative samples. For one-dimensional time series, representation and positive-representation seem to me to be the same time series selected , the negative samples are ten time series randomly selected from the entire training set. I don’t know if I understand it right. I look forward to your reply.

Excellent work

Hi,

I just finished reading your paper and also saw this repository.

This repository is an excellent example of how experiment-driven research should be conducted.

Good paper; very well crafted code; a lot of experiments & datasets; archived resuls.

Really job well done!

Regards & thanks
Kapil

Each training result on the test set

Hello, I have encountered a new problem. If I want to output the results of the test set for each epoch training, how should I do it?

Missing Requirements

I wanted to test your code within a docker container and noticed that the requirements for Java and GCC were not listed. Installing these two on my docker container made the code work and I haven't found any errors yet.

I have attached my Dockerfile and docker-compose.yml for anyone who might want to use it. The sample code in the README executes fine. I haven't tested anything else yet.

Dockerfile

FROM pytorch/pytorch

# Install OpenJDK-8
RUN apt-get update && \
    apt-get install -y openjdk-8-jdk && \
    apt-get install -y ant && \
    apt-get clean;

# Fix certificate issues
RUN apt-get update && \
    apt-get install ca-certificates-java && \
    apt-get clean && \
    update-ca-certificates -f;

# Setup JAVA_HOME -- useful for docker commandline
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
RUN export JAVA_HOME

# Install gcc
RUN apt-get update && \
	apt-get install -y gcc

# install python dependencies
RUN pip install --upgrade pip
RUN pip install \
	numpy \
	matplotlib \
	PyQt5 \
	PyQtWebEngine \
	orange3 \
	pandas \
	python-weka-wrapper3 \
	scikit-learn==0.20.0 \
	scipy

docker-compose.yml

version: '2.3'
services:
  dev:
    container_name: jeanyves
    build:
      context: . # path to folder containing Dockerfile
      dockerfile: Dockerfile
    image: jeanyves:1.0.0
    volumes:
    - path/to/repo:/work # path to this git repo
    - path/to/data:/data # path to data, e.g. UCR
    - /tmp/.X11-unix:/tmp/.X11-unix
    environment:
    - DISPLAY=$DISPLAY
    runtime: nvidia
    environment:
    - NVIDIA_VISIBLE_DEVICES=all
    working_dir: /work
    user: 1001:1001 # can probably be discarded
    tty: true
    stdin_open: true

To run the container make sure to install docker as well as docker-compose. Then:

cd path/to/docker-compose-yml-folder
docker-compose up dev

In another terminal:

docker exec -it jeanyves bash
python ucr.py --dataset Mallat --path /data/UCRArchive_2018 --save_path /work/results/models --hyper default_hyperparameters.json --cuda --gpu 0

Make sure to manually create the output folder specified under --save_path, if it does not already exist. Otherwise the script will fail with a no such file or directory exception.

Example of Embedding Learning

Thanks for the nice approach and for making it available!

Assume I have a set of ~100 different monthly time series, each with different lengths. I want to use your approach to learn embeddings for each of them to find groups of similarly behaved time series. It should give me something like this:

{1, 3, 8, 9} are cluster identifiers and all time series in each group should look similar.

It would be great if you could provide a minimal working example. If necessary, I can also provide the dataset. Many thanks!

args.folders

I am very interested in your article. I would like to ask what is the default value of args.folders in combine_urc.py in your code. Thank you for your reply

classifiers = [
load_classifier(
os.path.join(args.model_path, folder), args.dataset, args.cuda,
args.gpu
) for folder in args.folders
]

Time series which have unequal lengths

hello, thanks for your perfect job! Could you please give an example for time serises which have unequal lengths? Thanks you very much

About nb_steps and early_stopping

Hello, thank you for your paper and the code. I am really interested in your paper!
Did you use a same value of nb_steps, which is 1500, to all dataset?
And you don't apply early_stopping?

out_channels vs reduced_size

Hi,

In the class CausalCNNEncoder( ), line 207: linear = torch.nn.Linear(reduced_size, out_channels). Before this line, the size of tensor should be (batch_size, out_channels). When passing this tensor to the linear layer torch.nn.Linear(in_features, out_features, bias=True), shouldn't in_features= out_channels and out_features=reduced_size? So line 207 should be: linear = torch.nn.Linear(out_channels, reduced_size).

Perhaps I am misinterpreting the meaning of these two variables. Hope you could help me with this question. Thanks in advance!

Comparison with Timenet

Hi,

I found the reference and comparison to the Timenet paper

Malhotra, P., TV, V., Vig, L., Agarwal, P., and Shroff, G. TimeNet: Pre-trained deep recurrent neural network for time series classification.

and I found the way they construct their dataset for training, validation & testing is different from yours.

In your experiments, you train an encoder only on one dataset at a time and use it to generate the representations for any other dataset.

In Timenet, they selected 18 datasets to be used for training. Here is a paragraph from their paper

The training dataset is diverse as it contains time series belonging to 151 different classes from the 18 datasets with T varying from 24 to 512 (refer Table 2 for details). When using a dataset as part of the training or validation set, all the train and test time series from the dataset are included in the set. Each time series is normalized to have zero mean and unit variance

Unless I have misunderstood your paper & code, the size of the training dataset for your encoder(s) would be far less than the ones they used. Please correct me if my interpretation is incorrect.

If my interpretation is correct, do you think having a training setup like Timenet is of advantage to your network architecture & loss function as well?

Regards & thanks
Kapil

Curious about the loss function

Hi, I'm sorry to have one more question again.

I saw the loss function and it does not seem like a normal triplet loss.
To me, it is more like kind of crossentropy loss.

Could you explain why you make this formula of loss function?(dot product and sigmoid, etc)

Wrong data link in HouseholdPowerConsumption.ipynb (?)

In the example notebook "HouseoldPowerConsumption.ipynb" you wrote:

The dataset can be found here: https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014.

This does not seem to be the data you are referring to in your code later (as you use daily windows of length 1440, which suggests a minutely sampling frequency. However, the source mentioned above has a sampling frequency of 15 minutes).

I assume that this source (possibly modified) is meant: https://www.kaggle.com/uciml/electric-power-consumption-data-set

Multivariate dataset s.t. each variable has a distinct value pattern

Hi, I am using your work to encode multivariate time series data generated from various sensors of a machine. The values patterns of these sensors are quite different from each other. For example, one sensor would record temperature (continuous data), another sensor would record on/off status (0 or 1 data). In this case, is it suitable to train one encoder for all the sensors, or one encoder per sensor? Since it's a triplet loss model, I imagine that one encoder for all the sensors would suffice? Looking forward to hear your suggestions. Thank you!

BUG: some variant length dataset cannot be trained successfully

Hi
I'm a student who is doing research about Time Series Representation Learning. I thought your paper and its code is a really a good start. Thank you for your wonderful work!

But when I run transfer_ucr.py, when the code runs to DodgerLoopDay dataset, it corrupts, and says the input contains 'NaN'.

I opened DodgerLoopDay dataset and see some NaNs in the TRAIN and TEST csv file, but I think that the code has already processed with variant length dataset and I don't know why it would come to an ERROR.

Then I see into the code file scikit_wrappers.py, and found in function encoder() (around line 370), some codes handling variant length datasets. But it seems that the code only handles NaNs appearing at the end of the data, but in DodgerLoopDay, NaNs are appearing actually in the middle part of the data, so I think that's why the code comes to an ERROR.

And I found that in your paper, some datasets in UCRArchive, including DodgerLoopDay, is not listed in the result chart, does it have something to do with the bug?

Sorry, my English is not so good, I hope you can understand me, I'm trying to explain it as explicitly as I can do.

Pretrained Model

Hello,
I am a student and currently i am exploring ways of transfer learning in multivariate time series datasets, could you please share your Pre-trained model? I will cite your work appropriately.

How to treat incomplete time-series of unequal lengths

Many thanks for the interesting repo and paper!

Description

Going through the issues section of your repo, I would like to know how a data set with time series of unequal lenths and NAs should be treated.

Example

Suppose we have a set of univariate time series, where each series is of varying length, also with some observations randomly missing. This is illustrated as follows

The NaNs are random and do not occur at the same point in time across time series. Also, each time series is of different length. Based on these issues #19 and #18 I am not sure how to deal with this problem.

Questions

Q1:

In #18, you are stating

Note that, in the case of time series of unequal lengths, we require to pad the series with NaNs up to max_length.

Does this mean that say, if we have two time series, starting at the same date, one with length1 = 200, and the other length2 = 180 that we need to pad series2 with 20 NAs so as to recover the max_length of 200?
How shall we deal with this example if the time-series have different start dates? This has some implications since we cannot assume both having the same length, since one would start at say, January-2020, while the other starting at April-2020. Padding the second, shorter time series with 20 NAs would then force the model to assume them to have the same seasonality, even though they start differently.

Q2:

In #19 you are stating

We do not plan on supporting time series containing missing values, but we are interested in possible extensions to this end. As a workaround, you might for instance replace the missing values by zeros or removing them. This might be sufficient to learn satisfying representations for these datasets.

Given the above example of time series with varying lengths and NAs, shall we impute the in-between missing values with 0 and in case the time series is shorter than the maximum length series, pad the series with NaNs up to max_length?
What would be the difference in terms of representations if we padded the series with 0s up to max_length instead of NAs?
Does your above statement imply that we should simply omit the NAs and then either pad the shorter series with NA or 0 up to max_length? If so, how can we ever approach to learn seasonalities with this? Say, for one arbitrary time-series, as below, we omit the missing values and put the ends together, which would be May-2020 and August-2020. For the model, this then is one complete time-series, even though we miss June-2020, July-2020. Also we assume that August-2020 directly follows May-2020.

In summary, these are the options:

This is the original time-series, with some missing observations missing at random
Impute the missing values with 0s
Drop the in-between NAs and create a new time-series by putting the ends together, risking that there is a jump (in red) due to the discontinuity and pad the missing observations at the end with Na up to max_length (orange)
Drop the in-between NAs and create a new time-series by putting the ends together, risking that there is a jump (in red) due to the discontinuity and pad the missing observations at the end with 0 up to max_length (orange)

Multi-GPU parallel training

Hello, after reading your code, I found that it is a single GPU training. If I want to train multiple GPUs in parallel, where can I change the code to achieve this effect, looking forward to your reply

How to output the results of downstream tasks after each epoch training

Because I don’t know how many epochs to set is the best setting and the result is stable, how to change the code so that each epoch outputs the results of downstream tasks?Looking forward to your reply

Training fails on samples of unequal length

Hi, I'm having problems executing the training on a dataset with samples of different lengths.
My hyperparameters:

batch_size = 5,
channels = 40,
compared_length = 225000,
depth = 10,
in_channels = 6,
nb_steps = 1500,
kernel_size = 3,
penalty = null,
early_stopping = null
lr = 0.001,
nb_random_samples = 10,
negative_penalty = 1,
out_channels = 320,
reduced_size = 160

As described in other issues as well, I have transformed the dataset to be in the shape of (, , ). I have also padded all samples to the length of the longest sample with NaNs. Interestingly, the training works on the same dataset when using two samples of the same length. With this configuration, however, and using 57 samples, I get the following error:

Traceback (most recent call last):
  File "/home/user/Dokumente/models/USRLTS/script.py", line 102, in <module>
    encoder.fit_encoder(train, save_memory=True, verbose=True)
  File "/home/user/Dokumente/models/USRLTS/scikit_wrappers.py", line 263, in fit_encoder
    loss = self.loss_varying(
  File "/home/user/.venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/Dokumente/models/USRLTS/losses/triplet_loss.py", line 238, in forward
    lengths_pos[j] = numpy.random.randint(
  File "numpy/random/mtrand.pyx", line 782, in numpy.random.mtrand.RandomState.randint
  File "numpy/random/_bounded_integers.pyx", line 1334, in numpy.random._bounded_integers._rand_int64
ValueError: low >= high

I have tried diagnosing this and found a couple of interesting things:

The batch shape given to the "TripletLossVaryingLength" loss class in triplet_loss.py is (5,6,3434578) even though I padded all samples to a length of 4293222
The lengths calculated for each batch (lengths_batch) and lengths_samples in lines 210-221 of triplet_loss.py are all 0 (this is causing the error)

However, I don't understand why the calculated length is 0, as it is subtracting the number of values being NaNs, and I have only padded the end of the samples with NaNs. Is there anything I am missing?

The error appears in uea.py ,it has "JavaException: Attribute isn't relation-valued!", but I found all arff files has the same form.

(venv1) D:\Downloads\paper\paper>python uea.py --dataset Mallat --path Mallat1/ --save_path D:/Downloads/paper/UEAmodels --hyper default_hyperparamet ers.json DEBUG:weka.core.jvm:Adding bundled jars DEBUG:weka.core.jvm:Classpath=['D:\Anaconda\envs\venv1\lib\site-packages\javabridge\jars\rhino-1.7R4.jar', 'D:\Anaconda\envs\venv1\lib\s ite-packages\javabridge\jars\runnablequeue.jar', 'D:\Anaconda\envs\venv1\lib\site-packages\javabridge\jars\cpython.jar', 'D:\Anaconda\en vs\venv1\lib\site-packages\weka\lib\python-weka-wrapper.jar', 'D:\Anaconda\envs\venv1\lib\site-packages\weka\lib\weka.jar'] DEBUG:weka.core.jvm:MaxHeapSize=default DEBUG:weka.core.jvm:Package support disabled WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by weka.core.WekaPackageClassLoaderManager (file:/D:/Anaconda/envs/venv1/Lib/site-packages/weka/lib/weka.jar) to m ethod java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain) WARNING: Please consider reporting this to the maintainers of weka.core.WekaPackageClassLoaderManager WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release Exception in thread "Thread-0" java.lang.IllegalArgumentException: Attribute isn't relation-valued!
at weka.core.AbstractInstance.relationalValue(AbstractInstance.java:621) at weka.core.AbstractInstance.relationalValue(AbstractInstance.java:598) Traceback (most recent call last): File "uea.py", line 158, in args.path, args.dataset
File "uea.py", line 55, in load_UEA_dataset nb_dims = train_weka.get_instance(0).get_relational_value(0).num_instances
File "D:\Anaconda\envs\venv1\lib\site-packages\weka\core\dataset.py", line 704, in get_relational_value return Instances(javabridge.call(self.jobject, "relationalValue", "(I)Lweka/core/Instances;", index))
File "D:\Anaconda\envs\venv1\lib\site-packages\javabridge\jutil.py", line 887, in call result = fn(*nice_args)
File "D:\Anaconda\envs\venv1\lib\site-packages\javabridge\jutil.py", line 854, in fn
raise JavaException(x) javabridge.jutil.JavaException: Attribute isn't relation-valued!