Comments (10)
Hey!
Thanks for opening the issue. I will try to help you here.
installed the latest version of PyTorch
It could be that this is already the main issue. The requirement listed is pytorch 1.4.0, but it could also be that the earlier version was just masking a bug. So let's check further.
I searched on Google and made some improvements to the line of code
q *= dkh ** -0.5
What improvements did you do? The error message sounds like you can circumvent this by assigning the result to a new variable instead of q
. Have you tried this?
Now it doesn't stop at 30% anymore, but it starts from 0 again
What do you mean by this? The training? I somehow don't see whether an error results in a reset of the progress until then.
from multidim_conv.
Oh my goodness, you replied to me so quickly! Thank you very much
My current version of PyTorch is 2.1.0.
I changed the B in A toI changed the q *= dkh ** -0.5
in attention_augmented_conv.py to
tmp_tensor = dkh ** -0.5
q = q * tmp_tensor
OR
q = q * dkh ** -0.5
neither will do
This change was made because of the solution given to me when I searched google for the problem
RuntimeError: Output 0 of ReshapeAliasBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.
When I changed it like this, the error message changed to
NL dataset. Step: 1
Device: cuda
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 32, 5, 5] 928
Linear-2 [-1, 128] 102,528
Linear-3 [-1, 64] 8,256
Linear-4 [-1, 7] 455
DoubleDense-5 [-1, 7] 0
================================================================
Total params: 112,167
Trainable params: 112,167
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.43
Estimated Total Size (MB): 0.44
----------------------------------------------------------------
Epochs: 30%| | 50/150 [00:00<?, ?it/s]
Device: cuda
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 32, 5, 5] 928
Linear-2 [-1, 128] 102,528
Linear-3 [-1, 64] 8,256
Linear-4 [-1, 7] 455
DoubleDense-5 [-1, 7] 0
================================================================
Total params: 112,167
Trainable params: 112,167
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.43
Estimated Total Size (MB): 0.44
----------------------------------------------------------------
Epochs: 0%| | 0/150 [00:00<?, ?it/s]
You know what I mean? The code doesn't stop when it reaches 30%, it starts all over again.
I've also tried a combination of python3.8 + pytorch1.4
But this creates a new problem
Traceback (most recent call last):
File "D:\python_project\multidim_conv-master\multidim_conv-master\train.py", line 1, in <module>
import torch
File "C:\Users\xinyi\.conda\envs\pyt38\lib\site-packages\torch\__init__.py", line 44, in <module>
import numpy as _np
File "C:\Users\xinyi\.conda\envs\pyt38\lib\site-packages\numpy\__init__.py", line 125, in <module>
from numpy.__config__ import show as show_config
File "C:\Users\xinyi\.conda\envs\pyt38\lib\site-packages\numpy\__config__.py", line 12, in <module>
os.add_dll_directory(extra_dll_dir)
AttributeError: module 'os' has no attribute 'add_dll_directory'
So do you know what the problem is, should I be looking for problems with the new version or go back to the old one, thanks again.
from multidim_conv.
That is indeed weird. I do not see any reason why it should restart. Did you maybe also remove the print statements in train.py
? The script does train 8 models in sequence: 4 timesteps and for the NL and DK dataset (
Lines 401 to 407 in 1caae1f
I've also tried a combination of python3.8 + pytorch1.4
I think I used Python 3.7.
The Error AttributeError: module 'os' has no attribute 'add_dll_directory'
led me here where it is said that add_dll_directory
was only introduced in 3.8.
Maybe you can create a new env with 3.7 and try that again.
from multidim_conv.
I have lowered the version as you instructed
I have created a new virtual environment with Python version 3.7. The versions of other packages are as follows:
Package Version
----------------------- ----------
absl-py 2.0.0
cachetools 5.3.1
certifi 2022.12.7
charset-normalizer 3.3.0
colorama 0.4.6
einops 0.2.0
google-auth 2.23.3
google-auth-oauthlib 0.4.6
grpcio 1.59.0
idna 3.4
importlib-metadata 6.7.0
Markdown 3.4.4
MarkupSafe 2.1.3
numpy 1.21.6
oauthlib 3.2.2
olefile 0.46
Pillow 9.5.0
pip 22.3.1
protobuf 3.20.3
pyasn1 0.5.0
pyasn1-modules 0.3.0
requests 2.31.0
requests-oauthlib 1.3.1
rsa 4.9
scipy 1.7.3
setuptools 65.6.3
six 1.16.0
tensorboard 2.11.2
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
torch 1.4.0+cu92
torchsummary 1.5.1
torchvision 0.5.0+cu92
tqdm 4.66.1
typing_extensions 4.7.1
urllib3 2.0.6
Werkzeug 2.2.3
wheel 0.38.4
wincertstore 0.2
zipp 3.15.0
It reported an error at 52%.
NL dataset. Step: 1
Device: cuda
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 32, 5, 5] 928
Linear-2 [-1, 128] 102,528
Linear-3 [-1, 64] 8,256
Linear-4 [-1, 7] 455
DoubleDense-5 [-1, 7] 0
================================================================
Total params: 112,167
Trainable params: 112,167
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.43
Estimated Total Size (MB): 0.44
----------------------------------------------------------------
Epochs: 52%|█████▏ | 78/150 [15:18<14:07, 11.78s/it]
Stopping early --> val_loss has not decreased over 20 epochs
Traceback (most recent call last):
File "D:\python_project\multidim_conv-master\multidim_conv-master\train.py", line 404, in <module>
train_wind_nl(folder+data, epochs=150, input_timesteps=6, prediction_timestep=t, dev=dev, earlystopping=20)
File "D:\python_project\multidim_conv-master\multidim_conv-master\train.py", line 307, in train_wind_nl
summary(model, (7, input_timesteps, 6), device="cpu")
File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torchsummary\torchsummary.py", line 72, in summary
model(*x)
File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "D:\python_project\multidim_conv-master\multidim_conv-master\models\wind_models.py", line 93, in forward
x = F.relu(self.conv1(x))
File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
File "D:\python_project\multidim_conv-master\multidim_conv-master\models\attention_augmented_conv.py", line 52, in forward
flat_q, flat_k, flat_v, q, k, v = self.compute_flat_qkv(x, self.dk, self.dv, self.Nh)
File "D:\python_project\multidim_conv-master\multidim_conv-master\models\attention_augmented_conv.py", line 79, in compute_flat_qkv
q *= dkh ** -0.5
RuntimeError: Output 0 of ReshapeAliasBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.
What can I do to get your code to work correctly Thank you very much!
from multidim_conv.
I don't know why it is not working for you, but it is not crashing from the training. Because of Stopping early --> val_loss has not decreased over 20 epochs
we can see that the crash happened after that..
Does the model get saved in models/trained_models/wind_model_NL_{prediction_timestep}h_{model.__class__.__name__}.pt
?
from multidim_conv.
Yes, there are two files in the trained_models folder
wind_model_NL_1h_CNN2DWind_NL.pt 445kb
wind_model_NL_1h_CNN2DAttWind_NL.pt 453kb
from multidim_conv.
Can you try to start the training for 2 steps ahead? So remove the 1
from here:
Lines 401 to 407 in 1caae1f
It seems like the problem occurs then.
from multidim_conv.
I'm sorry for replying to you so late. I've been a bit busy these past couple of days
I followed what you said and removed 1, but the result is still the same. I'm about to give up
for t in [2, 3, 4]:
Have you tried running your code?
from multidim_conv.
I was just able to reproduce the problem. It was indeed in this line q *= dkh ** -0.5
.
I fixed it by renaming q
to q_new
:
def compute_flat_qkv(self, x, dk, dv, Nh):
qkv = self.qkv_conv(x)
N, _, H, W = qkv.size()
q, k, v = torch.split(qkv, [dk, dk, dv], dim=1)
q = self.split_heads_2d(q, Nh)
k = self.split_heads_2d(k, Nh)
v = self.split_heads_2d(v, Nh)
dkh = dk // Nh
q_new = q * dkh ** -0.5
flat_q = torch.reshape(q, (N, Nh, dk // Nh, H * W))
flat_k = torch.reshape(k, (N, Nh, dk // Nh, H * W))
flat_v = torch.reshape(v, (N, Nh, dv // Nh, H * W))
return flat_q, flat_k, flat_v, q_new, k, v
When you get the latest from master it should work. Note that when executing train.py
5 models get trained at each time step for each dataset.
Furthermore, I activated some more prints to show progress.
from multidim_conv.
@poemon Did this fix your issue?
from multidim_conv.
Related Issues (2)
- about fig7 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from multidim_conv.