I really appreciate your outstanding work! The work you have done on the fusion of image and time-series modalities is exactly the direction I have been researching recently. Therefore, I tested your model on my own dataset and compared it with some models I have used before, but I found that CrossViVit did not perform as well as expected. I would appreciate some advice on how to improve the forecasting performance of the model.
Similar to the dataset used in your work, my dataset consists of 15-minute satellite images with only one channel (image size = 96*96) and corresponding photovoltaic power. However, the difference is that the time span of the data I used is only 2.8 years, and I tested it on the last 5 months. The model takes in the past 4 hours of data (step=16) and predicts the future 4 hours of photovoltaic power (step=16).
I trained the model using the default parameters of CrossViVit as in your experiments, with the following differences:
- Optical flow was not used.
- Loss criterion = nn.MSE (the paper used L1 loss?).
- AdamW optimizor with a learning rate of 0.001.
- The batch_size was 16.
During the experiments, the train loss and valid loss kept fluctuating and did not decrease. I am not sure what the issue might be. Below are the train loss and valid loss from the wandb logs. I compared them with Perceiver-RNN, which is a model used in the OCF project: https://github.com/openclimatefix/predict_pv_yield (experiment/003*.py)
train_loss: https://api.wandb.ai/links/740402059/uji2orxi
train_step_loss: https://api.wandb.ai/links/740402059/lwz71qqh
valid_loss: https://api.wandb.ai/links/740402059/k7a2abvx
I believe that the cross-attention used in CrossViVit would perform better compared to the concatenation used in Perceiver-RNN, but I am not sure what might be causing the loss to not decrease.