Code Monkey home page Code Monkey logo

kimmeen / time-llm Goto Github PK

View Code? Open in Web Editor NEW
1.1K 13.0 199.0 1.05 MB

[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming Large Language Models"

Home Page: https://arxiv.org/abs/2310.01728

License: Apache License 2.0

Python 85.47% Shell 14.53%
cross-modal-learning cross-modality deep-learning language-model large-language-models machine-learning multimodal-deep-learning multimodal-time-series prompt-tuning time-series time-series-analysis time-series-forecast time-series-forecasting

time-llm's Introduction

(ICLR'24) Time-LLM: Time Series Forecasting by Reprogramming Large Language Models


🙋 Please let us know if you find out a mistake or have any suggestions!

🌟 If you find this resource helpful, please consider to star this repository and cite our research:

@inproceedings{jin2023time,
  title={{Time-LLM}: Time series forecasting by reprogramming large language models},
  author={Jin, Ming and Wang, Shiyu and Ma, Lintao and Chu, Zhixuan and Zhang, James Y and Shi, Xiaoming and Chen, Pin-Yu and Liang, Yuxuan and Li, Yuan-Fang and Pan, Shirui and Wen, Qingsong},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2024}
}

Updates

🚩 News (May 2024): Time-LLM has been included in NeuralForecast. Special thanks to the contributor @JQGoh and @marcopeix!

🚩 News (March 2024): Time-LLM has been upgraded to serve as a general framework for repurposing a wide range of language models to time series forecasting. It now defaults to supporting Llama-7B and includes compatibility with two additional smaller PLMs (GPT-2 and BERT). Simply adjust --llm_model and --llm_dim to switch backbones.

Introduction

Time-LLM is a reprogramming framework to repurpose LLMs for general time series forecasting with the backbone language models kept intact. Notably, we show that time series analysis (e.g., forecasting) can be cast as yet another "language task" that can be effectively tackled by an off-the-shelf LLM.

  • Time-LLM comprises two key components: (1) reprogramming the input time series into text prototype representations that are more natural for the LLM, and (2) augmenting the input context with declarative prompts (e.g., domain expert knowledge and task instructions) to guide LLM reasoning.

Requirements

Use python 3.11 from MiniConda

  • torch==2.2.2
  • accelerate==0.28.0
  • einops==0.7.0
  • matplotlib==3.7.0
  • numpy==1.23.5
  • pandas==1.5.3
  • scikit_learn==1.2.2
  • scipy==1.12.0
  • tqdm==4.65.0
  • peft==0.4.0
  • transformers==4.31.0
  • deepspeed==0.14.0
  • sentencepiece==0.2.0

To install all dependencies:

pip install -r requirements.txt

Datasets

You can access the well pre-processed datasets from [Google Drive], then place the downloaded contents under ./dataset

Quick Demos

  1. Download datasets and place them under ./dataset
  2. Tune the model. We provide five experiment scripts for demonstration purpose under the folder ./scripts. For example, you can evaluate on ETT datasets by:
bash ./scripts/TimeLLM_ETTh1.sh 
bash ./scripts/TimeLLM_ETTh2.sh 
bash ./scripts/TimeLLM_ETTm1.sh 
bash ./scripts/TimeLLM_ETTm2.sh

Detailed usage

Please refer to run_main.py, run_m4.py and run_pretrain.py for the detailed description of each hyperparameter.

Further Reading

1, Foundation Models for Time Series Analysis: A Tutorial and Survey, in KDD 2024.

Authors: Yuxuan Liang, Haomin Wen, Yuqi Nie, Yushan Jiang, Ming Jin, Dongjin Song, Shirui Pan, Qingsong Wen*

@inproceedings{liang2024foundation,
  title={Foundation models for time series analysis: A tutorial and survey},
  author={Liang, Yuxuan and Wen, Haomin and Nie, Yuqi and Jiang, Yushan and Jin, Ming and Song, Dongjin and Pan, Shirui and Wen, Qingsong},
  booktitle={ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2024)},
  year={2024}
}

2, Position Paper: What Can Large Language Models Tell Us about Time Series Analysis, in ICML 2024.

Authors: Ming Jin, Yifan Zhang, Wei Chen, Kexin Zhang, Yuxuan Liang*, Bin Yang, Jindong Wang, Shirui Pan, Qingsong Wen*

@inproceedings{jin2024position,
   title={Position Paper: What Can Large Language Models Tell Us about Time Series Analysis}, 
   author={Ming Jin and Yifan Zhang and Wei Chen and Kexin Zhang and Yuxuan Liang and Bin Yang and Jindong Wang and Shirui Pan and Qingsong Wen},
  booktitle={International Conference on Machine Learning (ICML 2024)},
  year={2024}
}

3, Large Models for Time Series and Spatio-Temporal Data: A Survey and Outlook, in arXiv 2023. [GitHub Repo]

Authors: Ming Jin, Qingsong Wen*, Yuxuan Liang, Chaoli Zhang, Siqiao Xue, Xue Wang, James Zhang, Yi Wang, Haifeng Chen, Xiaoli Li (IEEE Fellow), Shirui Pan*, Vincent S. Tseng (IEEE Fellow), Yu Zheng (IEEE Fellow), Lei Chen (IEEE Fellow), Hui Xiong (IEEE Fellow)

@article{jin2023lm4ts,
  title={Large Models for Time Series and Spatio-Temporal Data: A Survey and Outlook}, 
  author={Ming Jin and Qingsong Wen and Yuxuan Liang and Chaoli Zhang and Siqiao Xue and Xue Wang and James Zhang and Yi Wang and Haifeng Chen and Xiaoli Li and Shirui Pan and Vincent S. Tseng and Yu Zheng and Lei Chen and Hui Xiong},
  journal={arXiv preprint arXiv:2310.10196},
  year={2023}
}

4, Transformers in Time Series: A Survey, in IJCAI 2023. [GitHub Repo]

Authors: Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan, Liang Sun

@inproceedings{wen2023transformers,
  title={Transformers in time series: A survey},
  author={Wen, Qingsong and Zhou, Tian and Zhang, Chaoli and Chen, Weiqi and Ma, Ziqing and Yan, Junchi and Sun, Liang},
  booktitle={International Joint Conference on Artificial Intelligence(IJCAI)},
  year={2023}
}

5, TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting, in ICLR 2024. [GitHub Repo]

Authors: Shiyu Wang, Haixu Wu, Xiaoming Shi, Tengge Hu, Huakun Luo, Lintao Ma, James Y. Zhang, Jun Zhou

@inproceedings{wang2023timemixer,
  title={TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting},
  author={Wang, Shiyu and Wu, Haixu and Shi, Xiaoming and Hu, Tengge and Luo, Huakun and Ma, Lintao and Zhang, James Y and ZHOU, JUN},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2024}
}

Acknowledgement

Our implementation adapts Time-Series-Library and OFA (GPT4TS) as the code base and have extensively modified it to our purposes. We thank the authors for sharing their implementations and related resources.

time-llm's People

Contributors

eltociear avatar hamba-m avatar kimmeen avatar kwuking avatar nickdatle avatar qingsongedu avatar yoshall avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

time-llm's Issues

difference between main_run.py and main_pretrain.py

I tried hard to notice the difference between the files main_run and main_pretrain, but I can't say but they are the same with very small differences that mainly doesn't affect the logic of the codes. Could we get any enlightenment in this context ?

issue with calculating the loss while using GPT-2

I'm using an Nvidia v100 GPU. I have changed the --mixed_precision to FP16 instead of BF16 since this GPU card can't work on BF16. I have also adjusted this line enc_out, n_vars = self.patch_embedding(x_enc.to(torch.bfloat16)) in TimeLLM.py
to be:
enc_out, n_vars = self.patch_embedding(x_enc.to(torch.float16))

I'm running this on GPT-2

Everything went well, but during the training phase I'm getting nan losses. Please see the attached screenshot.

image

Any thoughts why this is happening ?

script failed on single GPU

when running bash ./scripts/TimeLLM_ETTh1.sh

[2024-03-28 16:36:25,314] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2024-03-28 16:36:26.899060: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-28 16:36:26.901427: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.

my env info:

Linux hp 5.15.0-97-generic #107~20.04.1-Ubuntu SMP Fri Feb 9 14:20:11 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2

thanks

Error when loading M4 training data

When trying to load the M4 training data, it generates the following error:

raining_values = np.array(                                                          
ValueError: setting an array element with a sequence. The requested array has an inhomoge
neous shape after 1 dimensions. The detected shape was (4227,) + inhomogeneous part.     
Traceback (most recent call last):                                                       
  File "/data5/aamayuelasfernandez/Time-LLM/run_m4.py", line 133, in <module>            
    train_data, train_loader = data_provider(args, 'train')                              
  File "/data5/aamayuelasfernandez/Time-LLM/data_provider/data_factory.py", line 46, in d
ata_provider                                                                             
    data_set = Data(                                                                     
  File "/data5/aamayuelasfernandez/Time-LLM/data_provider/data_loader.py", line 349, in _
_init__                                                                                  
    self.__read_data__()                                                                 
  File "/data5/aamayuelasfernandez/Time-LLM/data_provider/data_loader.py", line 357, in _
_read_data__                                                                 

I loaded the data separately and it gives the same error:

Cell In[3], [line 4](vscode-notebook-cell:?execution_count=3&line=4)
      [1](vscode-notebook-cell:?execution_count=3&line=1) import numpy as np
      [2](vscode-notebook-cell:?execution_count=3&line=2) seasonal_patterns = 'Hourly'
----> [4](vscode-notebook-cell:?execution_count=3&line=4) training_values = np.array(
      [5](vscode-notebook-cell:?execution_count=3&line=5)             [v[~np.isnan(v)] for v in
      [6](vscode-notebook-cell:?execution_count=3&line=6)              dataset.values[dataset.groups == seasonal_patterns]])  # split different frequencies

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (414,) + inhomogeneous part.

I have debugged it a little and the files provided for M4 seem to have mixed lengths for a specific group/seasonal_pattern and therefore it cannot create the array.

Can you try to run it and M4 script and see if you run into the same issue?

Results not as expected (Table 1: Long-term forecasting results)

I used the default hyper-parameters from the repository and 8 A100 (80G) GPUs, but the results were below expectations.

For instance, these four results in ETTh1 and ETTh2.

ETTh1 512_96 MAE 0.415 (compared to 0.392 in Table 1)
ETTh1 512_192 MAE 0.428 (compared to 0.418 in Table 1)
ETTh2 512_96 MAE 0.365 (compared to 0.328 in Table 1)
ETTh2 512_192 MAE 0.389 (compared to 0.375 in Table 1)

Could it be that I missed some experimental details, or there are aspects I have not paid attention to?

The loss remains at nan:599it [06:49, 1.45it/s] iters: 600, epoch: 1 | loss: nan speed: 0.6871s/iter; left time: 9247.2732s

No matter how the learning rate is adjusted, the loss remains at nan
26dd6139dc44ce859efd271a5d92884
39127ddbfac96fda07b5070e05c72f6
My training parameters are as follows:
accelerate launch --num_processes 1 --main_process_port 4096 run_main.py --task_name long_term_forecast --is_training 1 --root_path ./dataset/ --data_path ETTh1.csv --model_id ETTh1_512_96 --model TimeLLM --data ETTh1 --features M --seq_len 96 --label_len 48 --pred_len 96 --factor 3 --enc_in 7 --dec_in 7 --c_out 7 --des Exp --itr 1 --d_model 32 --d_ff 128 --batch_size 4 --learning_rate 0.001 --llm_layers 32 --train_epochs 1 --model_comment TimeLLM-ETTh1

Accelerate command not found

Hello, I am trying to run TimeLLM_ETTm2.sh, after just downloading the zip file, and am getting this error.

./scripts/TimeLLM_ETTm2.sh: line 14: accelerate: command not found
./scripts/TimeLLM_ETTm2.sh: line 40: accelerate: command not found
./scripts/TimeLLM_ETTm2.sh: line 68: accelerate: command not found
./scripts/TimeLLM_ETTm2.sh: line 96: accelerate: command not found

When running pip list, accelerate shows up as installed. I also tried setting up the accelerate config. What should I do next?

Thank you!

Running on python 3.11

hi @kwuking - I got the code working with python 3.11. I'm still testing all of the scripts, but would you want the updated code once I'm done testing? Python 3.11 is about 100% faster than python 3.9, so it'll save you and other people who are training time and money.

Error when installing Scipy

When I install scipy, I get an error that wheels can't be built for numpy. I can't seem to fix this error using the methods I find online. Any advice?

About the pre-trained word embeddings

Hi, thanks for releasing the code of this excellent work! The idea of Patch Reprogramming is very interesting.

I do have a question regarding the pre-trained word embeddings (vocabulary) $E$ used in your work. Could you please clarify the scale of the considered vocabulary?

Fail to replicate the experiment

Hi there,
I attempted to replicate the experiments using a40-GPUs, but the results are worse than those reported in the paper. I followed the same parameter settings as your code, except I used gradient accumulation to achieve an equivalent batch size.

I have a few questions:

  1. In your code "run_main.py," both the training and testing pipelines are included, and you use accelerate with multiple processes. I noticed that the test loader is also split into multiple parts for each process, yet only the test results of the main process are outputted. Does this mean that you only evaluate your model on a portion of the test dataset?
  2. How many training epochs are typically needed? The validation loss appears to keep increasing during training, causing early stopping to terminate the training around 10 epochs, which is the value of the default patience.
  3. Was the experiment conducted with only one fixed random seed (==2021)? It seems that the random seed has a significant influence on the results. For instance, with different random seeds, my replicated results on ETTh1_96 yield MSE values of 0.392, 0.395, and 0.385 respectively (compared to the reported value in the paper of 0.362).

Are there any implementation details that I might have missed? Thank you in advance for your reply!

Undefined arguments for the forecast function

In the forecast method in TimeLLM, there are arguments that are not defined anywhere

def forecast(self, x_enc, x_mark_enc, x_dec, x_mark_dec):

What do these parameters refer to? They are not defined in the model class or in the bash scripts to run the model

what does the size for seq_len, label_len, and pred_len represent?

in the file data_loader.py

what does the numbers in "size" represent? I asked ChatGPT and copilot and this is the best I could come up with:

        # size: 512, 48, 96
        # self.seq_len (512): This is the length of the input sequences to the model. 
        #   In the context of time series forecasting, this would be the number of time steps from the past data that 
        #   the model will use to make its predictions.
        # self.label_len (48): This is the length of the label sequences. In time series forecasting, 
        #   this could be the number of time steps that the model is trained to predict ahead in the future.
        # self.pred_len (96): This is the length of the prediction sequences. This could be the number of time steps that
        #   the model is actually used to predict ahead in the future when it is deployed. This value can be different 
        #   from self.label_len depending on the specific forecasting strategy used. For example, in some cases, 
        #   the model might be trained to predict only the next few steps, but then used to make predictions further
        #   into the future by feeding its own predictions back as input.

Is that accurate?

Running Time-LLM on GPT2

Hello, would it be possible to share the scripts (e.g. for TimeLLM.py, modeling_utils.py, etc.) adapted for GPT2 please? I do not have enough compute power to train on LLaMA in a feasible amount of time. I have attempted to adapt it on my own but am running into a lot of errors. Thank you very much!

Multivariate timeseries

will say 'great job' before I dive into the code and try my own dataset. What do you think about fine-tuning the LLM model and running it on two or more time-series variables as input for forecasting the third time-series?

OSError: Can't load the configuration of '/mnt/alps/modelhub/pretrained_model/LLaMA/7B_hf/'.

I am running

!bash -x ./scripts/TimeLLM_ETTh1.sh

and am getting this error. Any ideas? Thank you in advance.

  • model_name=TimeLLM
  • train_epochs=100
  • learning_rate=0.01
  • llama_layers=32
  • master_port=00097
  • num_process=1
  • batch_size=24
  • d_model=32
  • d_ff=128
  • comment=TimeLLM-ETTh1
  • accelerate launch --num_processes 1 --main_process_port 00097 run_main.py --task_name long_term_forecast --is_training 1 --root_path ./dataset/ETT-small/ --data_path ETTh1.csv --model_id ETTh1_512_96 --model TimeLLM --data ETTh1 --features M --seq_len 512 --label_len 48 --pred_len 96 --factor 3 --enc_in 7 --dec_in 7 --c_out 7 --des Exp --itr 1 --d_model 32 --d_ff 128 --batch_size 24 --learning_rate 0.01 --llm_layers 32 --train_epochs 100 --model_comment TimeLLM-ETTh1
    [2024-02-23 21:57:50,939] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    The following values were not passed to accelerate launch and had defaults used instead:
    --num_machines was set to a value of 1
    --mixed_precision was set to a value of 'no'
    --dynamo_backend was set to a value of 'no'
    To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
    [2024-02-23 21:57:55,593] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    Traceback (most recent call last):
    File "/usr/local/lib/python3.9/dist-packages/transformers/configuration_utils.py", line 672, in _get_config_dict
    resolved_config_file = cached_file(
    File "/usr/local/lib/python3.9/dist-packages/transformers/utils/hub.py", line 417, in cached_file
    resolved_file = hf_hub_download(
    File "/usr/local/lib/python3.9/dist-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
    validate_repo_id(arg_value)
    File "/usr/local/lib/python3.9/dist-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
    raise HFValidationError(
    huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/alps/modelhub/pretrained_model/LLaMA/7B_hf/'. Use repo_type argument if needed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/content/drive/MyDrive/Colab_Notebooks/Time-LLM/run_main.py", line 136, in
model = TimeLLM.Model(args).float()
File "/content/drive/MyDrive/Colab_Notebooks/Time-LLM/models/TimeLLM.py", line 44, in init
self.llama_config = LlamaConfig.from_pretrained('/mnt/alps/modelhub/pretrained_model/LLaMA/7B_hf/')
File "/usr/local/lib/python3.9/dist-packages/transformers/configuration_utils.py", line 590, in from_pretrained
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/transformers/configuration_utils.py", line 617, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/transformers/configuration_utils.py", line 693, in _get_config_dict
raise EnvironmentError(
OSError: Can't load the configuration of '/mnt/alps/modelhub/pretrained_model/LLaMA/7B_hf/'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/alps/modelhub/pretrained_model/LLaMA/7B_hf/' is the correct path to a directory containing a config.json file
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 941, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 603, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'run_main.py', '--task_name', 'long_term_forecast', '--is_training', '1', '--root_path', './dataset/ETT-small/', '--data_path', 'ETTh1.csv', '--model_id', 'ETTh1_512_96', '--model', 'TimeLLM', '--data', 'ETTh1', '--features', 'M', '--seq_len', '512', '--label_len', '48', '--pred_len', '96', '--factor', '3', '--enc_in', '7', '--dec_in', '7', '--c_out', '7', '--des', 'Exp', '--itr', '1', '--d_model', '32', '--d_ff', '128', '--batch_size', '24', '--learning_rate', '0.01', '--llm_layers', '32', '--train_epochs', '100', '--model_comment', 'TimeLLM-ETTh1']' returned non-zero exit status 1.

  • accelerate launch --num_processes 1 --main_process_port 00097 run_main.py --task_name long_term_forecast --is_training 1 --root_path ./dataset/ETT-small/ --data_path ETTh1.csv --model_id ETTh1_512_192 --model TimeLLM --data ETTh1 --features M --seq_len 512 --label_len 48 --pred_len 192 --factor 3 --enc_in 7 --dec_in 7 --c_out 7 --des Exp --itr 1 --d_model 32 --d_ff 128 --batch_size 24 --learning_rate 0.02 --llm_layers 32 --train_epochs 100 --model_comment TimeLLM-ETTh1
    [2024-02-23 21:58:04,007] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    The following values were not passed to accelerate launch and had defaults used instead:
    --num_machines was set to a value of 1
    --mixed_precision was set to a value of 'no'
    --dynamo_backend was set to a value of 'no'
    To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
    [2024-02-23 21:58:08,698] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    Traceback (most recent call last):
    File "/usr/local/lib/python3.9/dist-packages/transformers/configuration_utils.py", line 672, in _get_config_dict
    resolved_config_file = cached_file(
    File "/usr/local/lib/python3.9/dist-packages/transformers/utils/hub.py", line 417, in cached_file
    resolved_file = hf_hub_download(
    File "/usr/local/lib/python3.9/dist-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
    validate_repo_id(arg_value)
    File "/usr/local/lib/python3.9/dist-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
    raise HFValidationError(
    huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/alps/modelhub/pretrained_model/LLaMA/7B_hf/'. Use repo_type argument if needed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/content/drive/MyDrive/Colab_Notebooks/Time-LLM/run_main.py", line 136, in
model = TimeLLM.Model(args).float()
File "/content/drive/MyDrive/Colab_Notebooks/Time-LLM/models/TimeLLM.py", line 44, in init
self.llama_config = LlamaConfig.from_pretrained('/mnt/alps/modelhub/pretrained_model/LLaMA/7B_hf/')
File "/usr/local/lib/python3.9/dist-packages/transformers/configuration_utils.py", line 590, in from_pretrained
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/transformers/configuration_utils.py", line 617, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/transformers/configuration_utils.py", line 693, in _get_config_dict
raise EnvironmentError(
OSError: Can't load the configuration of '/mnt/alps/modelhub/pretrained_model/LLaMA/7B_hf/'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/alps/modelhub/pretrained_model/LLaMA/7B_hf/' is the correct path to a directory containing a config.json file
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 941, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 603, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'run_main.py', '--task_name', 'long_term_forecast', '--is_training', '1', '--root_path', './dataset/ETT-small/', '--data_path', 'ETTh1.csv', '--model_id', 'ETTh1_512_192', '--model', 'TimeLLM', '--data', 'ETTh1', '--features', 'M', '--seq_len', '512', '--label_len', '48', '--pred_len', '192', '--factor', '3', '--enc_in', '7', '--dec_in', '7', '--c_out', '7', '--des', 'Exp', '--itr', '1', '--d_model', '32', '--d_ff', '128', '--batch_size', '24', '--learning_rate', '0.02', '--llm_layers', '32', '--train_epochs', '100', '--model_comment', 'TimeLLM-ETTh1']' returned non-zero exit status 1.

  • accelerate launch --num_processes 1 --main_process_port 00097 run_main.py --task_name long_term_forecast --is_training 1 --root_path ./dataset/ETT-small/ --data_path ETTh1.csv --model_id ETTh1_512_336 --model TimeLLM --data ETTh1 --features M --seq_len 512 --label_len 48 --pred_len 336 --factor 3 --enc_in 7 --dec_in 7 --c_out 7 --des Exp --itr 1 --d_model 32 --d_ff 128 --batch_size 24 --lradj COS --learning_rate 0.001 --llm_layers 32 --train_epochs 100 --model_comment TimeLLM-ETTh1
    Traceback (most recent call last):
    File "/usr/local/bin/accelerate", line 5, in
    from accelerate.commands.accelerate_cli import main
    File "/usr/local/lib/python3.9/dist-packages/accelerate/init.py", line 3, in
    from .accelerator import Accelerator
    File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 31, in
    import torch
    File "/usr/local/lib/python3.9/dist-packages/torch/init.py", line 1465, in
    from . import _meta_registrations
    File "/usr/local/lib/python3.9/dist-packages/torch/_meta_registrations.py", line 7, in
    from torch._decomp import _add_op_to_registry, global_decomposition_table, meta_table
    File "/usr/local/lib/python3.9/dist-packages/torch/_decomp/init.py", line 169, in
    import torch._decomp.decompositions
    File "/usr/local/lib/python3.9/dist-packages/torch/_decomp/decompositions.py", line 10, in
    import torch._prims as prims
    File "/usr/local/lib/python3.9/dist-packages/torch/_prims/init.py", line 2882, in
    register_nvprims()
    File "/usr/local/lib/python3.9/dist-packages/torch/_prims/nvfuser_prims.py", line 817, in register_nvprims
    nvprim.define(main_prim.schema)
    File "/usr/local/lib/python3.9/dist-packages/torch/library.py", line 69, in define
    return self.m.define(schema, alias_analysis)
    KeyboardInterrupt
    ^C

Clarification on Dataset-Specific Prompt Utilization

Hi authors,

I've been exploring the TimeLLM model and have some questions regarding its implementation of prompts for various datasets.

I observed in the repository that there is a collection of data-specific text descriptions within dataset/prompt_bank. However, it seems the TimeLLM model doesn't leverage these descriptions automatically. For instance, the implementation appears to consistently utilize the prompt for the ETT dataset by default, as seen in the model's code here.

Additionally, while the dataset-specific prompt is passed as the content argument in the main function (example here), it does not appear to be used in the subsequent training or testing process.

Could you please provide guidance on whether the prompts in models/TimeLLM require manual updates for each dataset, or if there is an intended mechanism based on the content argument that should automatically handle this?

Thank you for your assistance.

Getting error in command : bash ./scripts/TimeLLM_ETTh1.sh

In Google Colab,
After executing the command pip install -r requirements.txt
I executed this command bash ./scripts/TimeLLM_ETTh1.sh
and then I am getting this error:-

[2024-03-16 09:45:00,430] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2024-03-16 09:45:09.637447: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-16 09:45:09.637606: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-16 09:45:09.753286: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-16 09:45:12.608198: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[09:45:13] WARNING  The following values were not passed to `accelerate launch` and    

after this, the cell gets executed.
Please help me with this issue.

llama-7b command correct?

OSError: huggyllama/llama-7b does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 715682 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 715683) of binary:

Facing this error. Is the llama-7b command accurate? Can someone help me figure this out.

Thanks.

RuntimeError: expected scalar type Float but found BFloat16

I am trying to run the ETTm1 example, but despite a plethora of efforts, I keep getting:

[2024-02-07 17:07:11,875] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-07 17:07:12,281] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.13.1, git-hash=unknown, git-branch=unknown
[2024-02-07 17:07:12,282] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-02-07 17:07:12,282] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[2024-02-07 17:07:12,293] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=172.19.2.2, master_port=29500
[2024-02-07 17:07:12,293] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-02-07 17:07:13,600] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-02-07 17:07:13,601] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-02-07 17:07:13,601] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-02-07 17:07:13,602] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = Adam
[2024-02-07 17:07:13,602] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=Adam type=<class 'torch.optim.adam.Adam'>
[2024-02-07 17:07:13,603] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-02-07 17:07:13,603] [INFO] [stage_1_and_2.py:143:__init__] Reduce bucket size 200000000
[2024-02-07 17:07:13,603] [INFO] [stage_1_and_2.py:144:__init__] Allgather bucket size 200000000
[2024-02-07 17:07:13,603] [INFO] [stage_1_and_2.py:145:__init__] CPU Offload: False
[2024-02-07 17:07:13,603] [INFO] [stage_1_and_2.py:146:__init__] Round robin gradient partitioning: False
[2024-02-07 17:07:13,759] [INFO] [utils.py:791:see_memory_usage] Before initializing optimizer states
[2024-02-07 17:07:13,760] [INFO] [utils.py:792:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2024-02-07 17:07:13,761] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 2.27 GB, percent = 7.2%
[2024-02-07 17:07:13,980] [INFO] [utils.py:791:see_memory_usage] After initializing optimizer states
[2024-02-07 17:07:13,981] [INFO] [utils.py:792:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2024-02-07 17:07:13,981] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 2.32 GB, percent = 7.4%
[2024-02-07 17:07:13,981] [INFO] [stage_1_and_2.py:533:__init__] optimizer state initialized
[2024-02-07 17:07:14,103] [INFO] [utils.py:791:see_memory_usage] After initializing ZeRO optimizer
[2024-02-07 17:07:14,104] [INFO] [utils.py:792:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2024-02-07 17:07:14,105] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 2.32 GB, percent = 7.4%
[2024-02-07 17:07:14,107] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = Adam
[2024-02-07 17:07:14,107] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-02-07 17:07:14,107] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-02-07 17:07:14,107] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[3.9999999999999996e-05], mom=[(0.95, 0.999)]
[2024-02-07 17:07:14,108] [INFO] [config.py:984:print] DeepSpeedEngine configuration:
[2024-02-07 17:07:14,108] [INFO] [config.py:988:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2024-02-07 17:07:14,108] [INFO] [config.py:988:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-02-07 17:07:14,108] [INFO] [config.py:988:print]   amp_enabled .................. False
[2024-02-07 17:07:14,108] [INFO] [config.py:988:print]   amp_params ................... False
[2024-02-07 17:07:14,109] [INFO] [config.py:988:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2024-02-07 17:07:14,109] [INFO] [config.py:988:print]   bfloat16_enabled ............. True
[2024-02-07 17:07:14,109] [INFO] [config.py:988:print]   checkpoint_parallel_write_pipeline  False
[2024-02-07 17:07:14,109] [INFO] [config.py:988:print]   checkpoint_tag_validation_enabled  True
[2024-02-07 17:07:14,109] [INFO] [config.py:988:print]   checkpoint_tag_validation_fail  False
[2024-02-07 17:07:14,109] [INFO] [config.py:988:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7985856dae60>
[2024-02-07 17:07:14,109] [INFO] [config.py:988:print]   communication_data_type ...... None
[2024-02-07 17:07:14,109] [INFO] [config.py:988:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-02-07 17:07:14,109] [INFO] [config.py:988:print]   curriculum_enabled_legacy .... False
[2024-02-07 17:07:14,109] [INFO] [config.py:988:print]   curriculum_params_legacy ..... False
[2024-02-07 17:07:14,109] [INFO] [config.py:988:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-02-07 17:07:14,109] [INFO] [config.py:988:print]   data_efficiency_enabled ...... False
[2024-02-07 17:07:14,109] [INFO] [config.py:988:print]   dataloader_drop_last ......... False
[2024-02-07 17:07:14,109] [INFO] [config.py:988:print]   disable_allgather ............ False
[2024-02-07 17:07:14,109] [INFO] [config.py:988:print]   dump_state ................... False
[2024-02-07 17:07:14,109] [INFO] [config.py:988:print]   dynamic_loss_scale_args ...... None
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   eigenvalue_enabled ........... False
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   eigenvalue_gas_boundary_resolution  1
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   eigenvalue_layer_num ......... 0
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   eigenvalue_max_iter .......... 100
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   eigenvalue_stability ......... 1e-06
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   eigenvalue_tol ............... 0.01
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   eigenvalue_verbose ........... False
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   elasticity_enabled ........... False
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   fp16_auto_cast ............... None
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   fp16_enabled ................. False
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   fp16_master_weights_and_gradients  False
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   global_rank .................. 0
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   grad_accum_dtype ............. None
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   gradient_accumulation_steps .. 1
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   gradient_clipping ............ 0.0
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   gradient_predivide_factor .... 1.0
[2024-02-07 17:07:14,110] [INFO] [config.py:988:print]   graph_harvesting ............. False
[2024-02-07 17:07:14,111] [INFO] [config.py:988:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-02-07 17:07:14,111] [INFO] [config.py:988:print]   initial_dynamic_scale ........ 1
[2024-02-07 17:07:14,111] [INFO] [config.py:988:print]   load_universal_checkpoint .... False
[2024-02-07 17:07:14,111] [INFO] [config.py:988:print]   loss_scale ................... 1.0
[2024-02-07 17:07:14,111] [INFO] [config.py:988:print]   memory_breakdown ............. False
[2024-02-07 17:07:14,111] [INFO] [config.py:988:print]   mics_hierarchial_params_gather  False
[2024-02-07 17:07:14,111] [INFO] [config.py:988:print]   mics_shard_size .............. -1
[2024-02-07 17:07:14,111] [INFO] [config.py:988:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-02-07 17:07:14,111] [INFO] [config.py:988:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2024-02-07 17:07:14,111] [INFO] [config.py:988:print]   optimizer_legacy_fusion ...... False
[2024-02-07 17:07:14,111] [INFO] [config.py:988:print]   optimizer_name ............... None
[2024-02-07 17:07:14,111] [INFO] [config.py:988:print]   optimizer_params ............. None
[2024-02-07 17:07:14,111] [INFO] [config.py:988:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-02-07 17:07:14,111] [INFO] [config.py:988:print]   pld_enabled .................. False
[2024-02-07 17:07:14,111] [INFO] [config.py:988:print]   pld_params ................... False
[2024-02-07 17:07:14,111] [INFO] [config.py:988:print]   prescale_gradients ........... False
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   scheduler_name ............... None
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   scheduler_params ............. None
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   seq_parallel_communication_data_type  torch.float32
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   sparse_attention ............. None
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   sparse_gradients_enabled ..... False
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   steps_per_print .............. inf
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   train_batch_size ............. 24
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   train_micro_batch_size_per_gpu  24
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   use_data_before_expert_parallel_  False
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   use_node_local_storage ....... False
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   wall_clock_breakdown ......... False
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   weight_quantization_config ... None
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   world_size ................... 1
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   zero_allow_untested_optimizer  True
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=200000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=200000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   zero_enabled ................. True
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   zero_force_ds_cpu_optimizer .. True
[2024-02-07 17:07:14,112] [INFO] [config.py:988:print]   zero_optimization_stage ...... 2
[2024-02-07 17:07:14,113] [INFO] [config.py:974:print_user_config]   json = {
    "bf16": {
        "enabled": true, 
        "auto_cast": true
    }, 
    "zero_optimization": {
        "stage": 2, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 2.000000e+08, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 2.000000e+08, 
        "contiguous_gradients": true, 
        "sub_group_size": 1.000000e+09
    }, 
    "gradient_accumulation_steps": 1, 
    "train_batch_size": 24, 
    "train_micro_batch_size_per_gpu": 24, 
    "steps_per_print": inf, 
    "wall_clock_breakdown": false, 
    "fp16": {
        "enabled": false
    }, 
    "zero_allow_untested_optimizer": true
}
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/kaggle/working/Time-LLM/run_main.py", line 208, in <module>
    outputs = model(batch_x, batch_x_mark, dec_inp, batch_y_mark)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1842, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/kaggle/working/Time-LLM/models/Autoformer.py", line 146, in forward
    dec_out = self.forecast(x_enc, x_mark_enc, x_dec, x_mark_dec)
  File "/kaggle/working/Time-LLM/models/Autoformer.py", line 102, in forecast
    enc_out = self.enc_embedding(x_enc, x_mark_enc)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/kaggle/working/Time-LLM/layers/Embed.py", line 145, in forward
    x = self.value_embedding(x) + self.temporal_embedding(x_mark)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/kaggle/working/Time-LLM/layers/Embed.py", line 42, in forward
    x = self.tokenConv(x.permute(0, 2, 1)).transpose(1, 2)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 310, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 303, in _conv_forward
    return F.conv1d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
RuntimeError: expected scalar type Float but found BFloat16

bash scripts blocked due to some missing args

Hello,

I tried running the repository on Google Collab, after successfully installing all the requirements (except for SciPy that refuses to install the mentioned version for unknown reasons) and running the session on T4 GPU.

I also downloaded the datasets and uploaded them in the mentioned location (./datasets/)

I get the following output when executing whatever bash script of the available ones:
image

It seems like it's expecting some input. I tried writing random values (integers, chars, etc) but no response from the script.

Any idea about the reason behind this ?

Unable to run the scripts

Hi team,
can you please help me to resolve these errors?

Here is the log for the error:
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:97 (errno: 13 - Permission denied). The server socket has failed to bind to 0.0.0.0:97 (errno: 13 - Permission denied).
Can you please let me know which file to change the port number from 97 to value greater than 1024?

In colab platform I am not able to see the output after the following step.
[2024-03-10 02:58:00,249] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) 2024-03-10 02:58:04.804409: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261]
Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-03-10 02:58:04.804465: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607]
Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-03-10 02:58:04.805832: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515]
Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-03-10 02:58:06.139190: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38]
TF-TRT Warning: Could not find TensorRT [02:58:06] WARNING The following values were not passed to accelerate launch and

Replication set up

Hi all

That's for your great work on this paper. I am working on a research in conflict prediction and want to try to extrapolate some historical datasets using your framework. I tried replicating it on my Mac M1 which failed misserably because I don't have NVIDIA GPU I assume 😂.

Now I tried to run existing scripts using my ubuntu laptop that has 1040 NVIDIA GPU, but that also failed (I assume because lack of the space)

What was your set up for these experiments? Did you use conda to configure environment or just pip? I might be able to get access to a compute cluster of NVIDIA GPUs, but as they are costly I would like to get some details before I get to it.

Thanks in advance!

更详细的环境配置信息

您好,我下载了代码并安装了相应的包,但是还是无法直接运行。请问有更详细的环境配置信息吗,比如python版本,cuda版本。
Traceback (most recent call last): File "/home/tianpengfei1/Time-LLM/run_main.py", line 132, in <module> model = TimeLLM.Model(args).float() File "/home/tianpengfei1/Time-LLM/models/TimeLLM.py", line 53, in __init__ self.llama = LlamaModel.from_pretrained( File "/home/tianpengfei1/anaconda3/envs/llmtime/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2256, in from_pretrained quantization_config, kwargs = BitsAndBytesConfig.from_dict( File "/home/tianpengfei1/anaconda3/envs/llmtime/lib/python3.9/site-packages/transformers/utils/quantization_config.py", line 189, in from_dict config = cls(**config_dict) File "/home/tianpengfei1/anaconda3/envs/llmtime/lib/python3.9/site-packages/transformers/utils/quantization_config.py", line 118, in __init__ self.post_init() File "/home/tianpengfei1/anaconda3/envs/llmtime/lib/python3.9/site-packages/transformers/utils/quantization_config.py", line 144, in post_init if self.load_in_4bit and not version.parse(importlib.metadata.version("bitsandbytes")) >= version.parse( File "/home/tianpengfei1/anaconda3/envs/llmtime/lib/python3.9/importlib/metadata.py", line 569, in version return distribution(distribution_name).version File "/home/tianpengfei1/anaconda3/envs/llmtime/lib/python3.9/importlib/metadata.py", line 542, in distribution return Distribution.from_name(distribution_name) File "/home/tianpengfei1/anaconda3/envs/llmtime/lib/python3.9/importlib/metadata.py", line 196, in from_name raise PackageNotFoundError(name) importlib.metadata.PackageNotFoundError: bitsandbytes ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9021) of binary: /home/tianpengfei1/anaconda3/envs/llmtime/bin/python Traceback (most recent call last): File "/home/tianpengfei1/anaconda3/envs/llmtime/bin/accelerate", line 8, in <module> sys.exit(main()) File "/home/tianpengfei1/anaconda3/envs/llmtime/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/home/tianpengfei1/anaconda3/envs/llmtime/lib/python3.9/site-packages/accelerate/commands/launch.py", line 932, in launch_command multi_gpu_launcher(args) File "/home/tianpengfei1/anaconda3/envs/llmtime/lib/python3.9/site-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher distrib_run.run(args) File "/home/tianpengfei1/anaconda3/envs/llmtime/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/tianpengfei1/anaconda3/envs/llmtime/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/tianpengfei1/anaconda3/envs/llmtime/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Using FP16

If I have to change the precision to FP16 to accomodate my GPU, are there other changes I have to make? I changed the mixed precision flag and the DS config json shown below.

Screenshot 2024-03-27 154651

请问下训练中参数迭代部分是模型的哪里?

拜读了您的论文,有一个疑惑还请解答一下。
模型中,“pre-trained LLM”都是“Frozen”的,也就是参数是固定的,仅采用训练好的开源模型,那么是哪一部分的参数在被训练和迭代呢?
是否只有图片中的红色框图部分,输入的编码层“patch reprogram”以及输出的编码层“Output projection”被训练和迭代更新了?

2024-02-20_11-48-00

Problem of custom data utilization when new column added

Hi,

Thanks for the great work.
I would like to apply this work to our own data. However, our data includes an extra column, labelled as 'code'. An example can be seen in the image below. Could you please advise on how to modify the process to accommodate this extra column? Thank you.
image

关于text prototype

请问作者,文本原型(text prototype)体现在代码哪里,十分谢谢

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfd in position 0: invalid start byte

I am trying to run bash ./scripts/TimeLLM_ETTh1.sh up on google colab. The only modification I made was to use
deepspeed==0.12.0 (and not deepspeed==0.13.0) that you have in your requirements.txt file.

!bash -x ./scripts/TimeLLM_ETTh1.sh

  • model_name=TimeLLM
  • train_epochs=100
  • learning_rate=0.01
  • llama_layers=32
  • master_port=00097
  • num_process=8
  • batch_size=24
  • d_model=32
  • d_ff=128
  • comment=TimeLLM-ETTh1
  • accelerate launch --multi_gpu --mixed_precision bf16 --num_processes 8 --main_process_port 00097 run_main.py --task_name long_term_forecast --is_training 1 --root_path ./dataset/ETT-small/ --data_path ETTh1.csv --model_id ETTh1_512_96 --model TimeLLM --data ETTh1 --features M --seq_len 512 --label_len 48 --pred_len 96 --factor 3 --enc_in 7 --dec_in 7 --c_out 7 --des Exp --itr 1 --d_model 32 --d_ff 128 --batch_size 24 --learning_rate 0.01 --llm_layers 32 --train_epochs 100 --model_comment TimeLLM-ETTh1
    [2024-02-20 23:12:01,131] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    The following values were not passed to accelerate launch and had defaults used instead:
    --num_machines was set to a value of 1
    --dynamo_backend was set to a value of 'no'
    To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
    [2024-02-20 23:12:29,074] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [2024-02-20 23:12:29,090] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [2024-02-20 23:12:29,127] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [2024-02-20 23:12:29,251] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [2024-02-20 23:12:29,312] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [2024-02-20 23:12:29,322] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [2024-02-20 23:12:29,343] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [2024-02-20 23:12:29,567] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    Traceback (most recent call last):
    File "/usr/local/lib/python3.9/dist-packages/accelerate/utils/deepspeed.py", line 52, in init
    config_decoded = base64.urlsafe_b64decode(config_file_or_dict).decode("utf-8")Traceback (most recent call last):
    File "/usr/local/lib/python3.9/dist-packages/accelerate/utils/deepspeed.py", line 52, in init
    config_decoded = base64.urlsafe_b64decode(config_file_or_dict).decode("utf-8")
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfd in position 0: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/content/drive/MyDrive/Colab_Notebooks/Time-LLM/run_main.py", line 100, in
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/accelerate/utils/deepspeed.py", line 52, in init
config_decoded = base64.urlsafe_b64decode(config_file_or_dict).decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfd in position 0: invalid start byte
Traceback (most recent call last):

During handling of the above exception, another exception occurred:

bash-TimeLLM_ETTh1.sh.txt

how many epochs to do some testing?

Hi there!

I want to train and run some test inference on the included datasets, are the weights already fine-tuned and ready to use with the included datasets or do I need to train it myself? If so, how many epochs? Thanks in advanced!

CUDA Out of memory, had to lower batch size (Help with Memory Optimization)

So just a general question, I had to lower batch size to get it to run, so can someone recomment memory optimization frameworks that I can utilize so that the model can run faster and take in more batch size, as currently the way its going, it will take days for the model to complete all epochs.

So need a bit of help in memory optimization, or if others are able to get it to work reasonably on Colab. I am currently running the TimeLLM_ETTh1.sh script, and it going too slow.

The issue with inference on CPU.

I intend to perform model inference on the CPU. However, when I instantiate the model on the CPU, I encounter the error: RuntimeError: No GPU found. A GPU is needed for quantization. Does anyone know how to resolve this issue, or does it mean that I must perform model inference on a GPU?

compute envrionment on high variable time-series

Hi here,

Thank you for sharing the good work! We are trying to run your model on multivariate timeseries with high number of channels (traffic, ECL, and weather). We found it really hard to fit on our compute envrionment (4 x 48GA6000) even with batch size of 1.

Could you share a rough estimate of compute environment needed to run above datasets?

Thank you so much!

Error in running example script

RuntimeError: CUDA error: invalid device ordinal
OSError: Incorrect path_or_model_id: '/mnt/alps/modelhub/pretrained_model/LLaMA/7B_hf/'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

I just use git clone to my local computer, I can find my GPU on using Nvidia-SMI command, however, when I run the first example script,

bash ./scripts/TimeLLM_ETTh1.sh

I found this error, does this mean I have to download LLaMa to use or there are some modifications I need to do with bash script?

Model save path

I trained the model for 10 epochs and I didn't see any weights saved any where, but I did see this:

if accelerator.is_local_main_process:
    path = './checkpoints'  # unique checkpoint saving path
    del_files(path)  # delete checkpoint files
    accelerator.print('success delete checkpoints')

were the checkpoints deleted after training? If so, where are the best weights?

Multi gpu running Error

When I run the script with multi-gpu, I met the following error
image
My parameters are here. And I have 2 80G-A100.
image

Scripts for Weather, ECL and Traffic

Only scripts for ETT and M4 were provided. Could you please provide the relevant scripts for Weather, ECL, and Traffic mentioned in the paper?

Prompt vs no prompt results?

Hello,

Thank you for the great work. I was wondering if there was any prior experiments that you did without the prompts?
I am wondering how much the prompts may help.

Thank you.

Could not build wheels for numpy, which is required to install pyproject.toml-based projects

I'm trying to install requirements inside venv virt. environment. Python version is 3.11.8 and I'm getting this error:
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for numpy
Failed to build numpy
ERROR: Could not build wheels for numpy, which is required to install pyproject.toml-based projects
[end of output]

ValueError: not enough values to unpack (expected 3, got 2)

        for i, x, x_mark in zip(range(len(x_enc)), x_enc, x_mark_enc):
            B, T, N = x.size()
            x = self.normalize_layers[i](x, 'norm')
            if self.channel_independent:
                x = x.permute(0, 2, 1).contiguous().reshape(B * N, T, 1)
            x_list.append(x)

x.size()=2,how can B, T, N = x.size() ?

Question about fine-tuning

When I run the command:

bash ./scripts/TimeLLM_ETTh1.sh

Am I training the 7b_hf model to translate time series data to a format that is suitable for an LLM to do time series forecasting?

Would I then take the trained model to translate new time series data and use that translated data to feed into an LLM to make predictions?

RuntimeError: CUDA error: invalid device ordinal

I am running this on google colab with a T4 GPU.

Any idea why the gpu is not being recognized? I do not see any assignments anywhere.

------------------------------------------- start of log -------------------------------------------------------

  • model_name=TimeLLM

  • train_epochs=100

  • learning_rate=0.01

  • llama_layers=32

  • master_port=00097

  • num_process=8

  • batch_size=24

  • d_model=32

  • d_ff=128

  • comment=TimeLLM-ETTh1

  • accelerate launch --multi_gpu --mixed_precision bf16 --num_processes 8 --main_process_port 00097 run_main.py --task_name long_term_forecast --is_training 1 --root_path ./dataset/ETT-small/ --data_path ETTh1.csv --model_id ETTh1_512_96 --model TimeLLM --data ETTh1 --features M --seq_len 512 --label_len 48 --pred_len 96 --factor 3 --enc_in 7 --dec_in 7 --c_out 7 --des Exp --itr 1 --d_model 32 --d_ff 128 --batch_size 24 --learning_rate 0.01 --llm_layers 32 --train_epochs 100 --model_comment TimeLLM-ETTh1
    [2024-02-22 16:17:36,426] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    The following values were not passed to accelerate launch and had defaults used instead:
    --num_machines was set to a value of 1
    --dynamo_backend was set to a value of 'no'
    To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
    [2024-02-22 16:17:55,327] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [2024-02-22 16:17:55,384] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [2024-02-22 16:17:55,515] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [2024-02-22 16:17:55,601] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [2024-02-22 16:17:55,634] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [2024-02-22 16:17:55,638] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [2024-02-22 16:17:55,751] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [2024-02-22 16:17:55,774] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [2024-02-22 16:18:23,485] [INFO] [comm.py:637:init_distributed] cdb=None
    [2024-02-22 16:18:23,491] [INFO] [comm.py:637:init_distributed] cdb=None
    [2024-02-22 16:18:23,495] [INFO] [comm.py:637:init_distributed] cdb=None
    [2024-02-22 16:18:23,502] [INFO] [comm.py:637:init_distributed] cdb=None
    [2024-02-22 16:18:23,507] [INFO] [comm.py:637:init_distributed] cdb=None
    [2024-02-22 16:18:23,507] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
    [2024-02-22 16:18:23,510] [INFO] [comm.py:637:init_distributed] cdb=None
    [2024-02-22 16:18:23,512] [INFO] [comm.py:637:init_distributed] cdb=None
    [2024-02-22 16:18:23,513] [INFO] [comm.py:637:init_distributed] cdb=None
    Traceback (most recent call last):
    File "/content/drive/MyDrive/Colab_Notebooks/Time-LLM/run_main.py", line 104, in
    Traceback (most recent call last):
    Traceback (most recent call last):
    File "/content/drive/MyDrive/Colab_Notebooks/Time-LLM/run_main.py", line 104, in
    File "/content/drive/MyDrive/Colab_Notebooks/Time-LLM/run_main.py", line 104, in
    Traceback (most recent call last):
    File "/content/drive/MyDrive/Colab_Notebooks/Time-LLM/run_main.py", line 104, in
    Traceback (most recent call last):
    File "/content/drive/MyDrive/Colab_Notebooks/Time-LLM/run_main.py", line 126, in
    Traceback (most recent call last):
    File "/content/drive/MyDrive/Colab_Notebooks/Time-LLM/run_main.py", line 104, in
    Traceback (most recent call last):
    File "/content/drive/MyDrive/Colab_Notebooks/Time-LLM/run_main.py", line 104, in
    Traceback (most recent call last):
    File "/content/drive/MyDrive/Colab_Notebooks/Time-LLM/run_main.py", line 104, in
    accelerator = Accelerator(kwargs_handlers=[ddp_kwargs], deepspeed_plugin=deepspeed_plugin)
    File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 345, in init
    self.state = AcceleratorState(
    File "/usr/local/lib/python3.9/dist-packages/accelerate/state.py", line 680, in init
    PartialState(cpu, **kwargs)
    File "/usr/local/lib/python3.9/dist-packages/accelerate/state.py", line 182, in init
    torch.cuda.set_device(self.device)
    File "/usr/local/lib/python3.9/dist-packages/torch/cuda/init.py", line 350, in set_device
    torch._C._cuda_setDevice(device)
    RuntimeError: CUDA error: invalid device ordinal
    CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

    accelerator = Accelerator(kwargs_handlers=[ddp_kwargs], deepspeed_plugin=deepspeed_plugin)
    File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 345, in init
    self.state = AcceleratorState(
    File "/usr/local/lib/python3.9/dist-packages/accelerate/state.py", line 680, in init
    PartialState(cpu, **kwargs)
    File "/usr/local/lib/python3.9/dist-packages/accelerate/state.py", line 182, in init
    torch.cuda.set_device(self.device)
    File "/usr/local/lib/python3.9/dist-packages/torch/cuda/init.py", line 350, in set_device
    torch._C._cuda_setDevice(device)
    RuntimeError: CUDA error: invalid device ordinal
    CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

------------------------------------------- end of log --------------------------------------------------------

CalledProcessError when running TimeLLM_ETTh1.sh

When I run bash ./scripts/TimeLLM_ETTh1.sh, I get the following error:

[2024-03-28 15:11:33,415] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) Traceback (most recent call last): File "/root/.pyenv/versions/3.9.13/bin/accelerate", line 8, in <module> sys.exit(main()) File "/root/.pyenv/versions/3.9.13/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/root/.pyenv/versions/3.9.13/lib/python3.9/site-packages/accelerate/commands/launch.py", line 941, in launch_command simple_launcher(args) File "/root/.pyenv/versions/3.9.13/lib/python3.9/site-packages/accelerate/commands/launch.py", line 603, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/root/.pyenv/versions/3.9.13/bin/python3.9', 'run_main.py', '--task_name', 'long_term_forecast', '--is_training', '1', '--root_path', './dataset/ETT-small/', '--data_path', 'ETTh1.csv', '--model_id', 'ETTh1_512_96', '--model', 'TimeLLM', '--data', 'ETTh1', '--features', 'M', '--seq_len', '512', '--label_len', '48', '--pred_len', '96', '--factor', '3', '--enc_in', '7', '--dec_in', '7', '--c_out', '7', '--des', 'Exp', '--itr', '1', '--d_model', '32', '--d_ff', '128', '--batch_size', '1', '--learning_rate', '0.01', '--llm_layers', '32', '--train_epochs', '1', '--model_comment', 'TimeLLM-ETTh1']' died with <Signals.SIGKILL: 9>.

I'm currently running on a single GPU NVIDA GeForce RTX 3050Ti, CUDA: Build cuda_11.5.r11.5/compiler.30672275_0.

I've also adjusted the code to match the single GPU example provided in a previous issue: #28 and also set CUDA_HOME and CUDA_PATH to the nvcc dir, but I still receive the above error.

Custom data utilization

Hi,

Thanks for the great work.

I was wondering if there is any requirements or suggestions on utilizing Time-LLM on custom data. If I want to fit the model to new data, what parts of the code shall I take into consideration in both training and testing time?

Facing error while installing requirements.txt

Hi,
Don't know if its just me, but when I try to install the requirements.txt, I'm facing an error "Failed to build numpy
ERROR: Could not build wheels for numpy, which is required to install pyproject.toml-based projects
[end of output]"
To find out where the error was originating from, I decided to pip install all requirements individually, and found that this error is coming up while installing scipy.
So just wanted to understand, did I go wrong somewhere or if others faced a similar issue or not.

******Update: Did find that I am able to install newer version of scipy. So is it necessary to to have scipy==1.5.4?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.