Code Monkey home page Code Monkey logo

Comments (10)

HansBambel avatar HansBambel commented on September 24, 2024

Hey!

Thanks for opening the issue. I will try to help you here.

installed the latest version of PyTorch

It could be that this is already the main issue. The requirement listed is pytorch 1.4.0, but it could also be that the earlier version was just masking a bug. So let's check further.

I searched on Google and made some improvements to the line of code q *= dkh ** -0.5

What improvements did you do? The error message sounds like you can circumvent this by assigning the result to a new variable instead of q. Have you tried this?

Now it doesn't stop at 30% anymore, but it starts from 0 again

What do you mean by this? The training? I somehow don't see whether an error results in a reset of the progress until then.

from multidim_conv.

poemon avatar poemon commented on September 24, 2024

Oh my goodness, you replied to me so quickly! Thank you very much

My current version of PyTorch is 2.1.0.
I changed the B in A toI changed the q *= dkh ** -0.5 in attention_augmented_conv.py to

tmp_tensor = dkh ** -0.5
q = q * tmp_tensor

OR

q = q * dkh ** -0.5

neither will do

This change was made because of the solution given to me when I searched google for the problem

RuntimeError: Output 0 of ReshapeAliasBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.

When I changed it like this, the error message changed to

NL dataset. Step:  1
Device: cuda
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1             [-1, 32, 5, 5]             928
            Linear-2                  [-1, 128]         102,528
            Linear-3                   [-1, 64]           8,256
            Linear-4                    [-1, 7]             455
       DoubleDense-5                    [-1, 7]               0
================================================================
Total params: 112,167
Trainable params: 112,167
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.43
Estimated Total Size (MB): 0.44
----------------------------------------------------------------
Epochs:   30%|          | 50/150 [00:00<?, ?it/s]
Device: cuda
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1             [-1, 32, 5, 5]             928
            Linear-2                  [-1, 128]         102,528
            Linear-3                   [-1, 64]           8,256
            Linear-4                    [-1, 7]             455
       DoubleDense-5                    [-1, 7]               0
================================================================
Total params: 112,167
Trainable params: 112,167
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.43
Estimated Total Size (MB): 0.44
----------------------------------------------------------------
Epochs:   0%|          | 0/150 [00:00<?, ?it/s]

You know what I mean? The code doesn't stop when it reaches 30%, it starts all over again.

I've also tried a combination of python3.8 + pytorch1.4

But this creates a new problem

Traceback (most recent call last):
  File "D:\python_project\multidim_conv-master\multidim_conv-master\train.py", line 1, in <module>
    import torch
  File "C:\Users\xinyi\.conda\envs\pyt38\lib\site-packages\torch\__init__.py", line 44, in <module>
    import numpy as _np
  File "C:\Users\xinyi\.conda\envs\pyt38\lib\site-packages\numpy\__init__.py", line 125, in <module>
    from numpy.__config__ import show as show_config
  File "C:\Users\xinyi\.conda\envs\pyt38\lib\site-packages\numpy\__config__.py", line 12, in <module>
    os.add_dll_directory(extra_dll_dir)
AttributeError: module 'os' has no attribute 'add_dll_directory' 

So do you know what the problem is, should I be looking for problems with the new version or go back to the old one, thanks again.

from multidim_conv.

HansBambel avatar HansBambel commented on September 24, 2024

That is indeed weird. I do not see any reason why it should restart. Did you maybe also remove the print statements in train.py? The script does train 8 models in sequence: 4 timesteps and for the NL and DK dataset (

multidim_conv/train.py

Lines 401 to 407 in 1caae1f

for t in [1, 2, 3, 4]:
print("NL dataset. Step: ", t)
data = "Wind_data_NL/dataset.pkl"
train_wind_nl(folder+data, epochs=150, input_timesteps=6, prediction_timestep=t, dev=dev, earlystopping=20)
print("DK dataset. Step: ", t)
data = f"Wind_data/lag=4/step{t}.mat"
train_wind_dk(folder+data, epochs=150, dev=dev, earlystopping=20)
).

I've also tried a combination of python3.8 + pytorch1.4

I think I used Python 3.7.

The Error AttributeError: module 'os' has no attribute 'add_dll_directory' led me here where it is said that add_dll_directory was only introduced in 3.8.

Maybe you can create a new env with 3.7 and try that again.

from multidim_conv.

poemon avatar poemon commented on September 24, 2024

I have lowered the version as you instructed
I have created a new virtual environment with Python version 3.7. The versions of other packages are as follows:

Package                 Version
----------------------- ----------
absl-py                 2.0.0
cachetools              5.3.1
certifi                 2022.12.7
charset-normalizer      3.3.0
colorama                0.4.6
einops                  0.2.0
google-auth             2.23.3
google-auth-oauthlib    0.4.6
grpcio                  1.59.0
idna                    3.4
importlib-metadata      6.7.0
Markdown                3.4.4
MarkupSafe              2.1.3
numpy                   1.21.6
oauthlib                3.2.2
olefile                 0.46
Pillow                  9.5.0
pip                     22.3.1
protobuf                3.20.3
pyasn1                  0.5.0
pyasn1-modules          0.3.0
requests                2.31.0
requests-oauthlib       1.3.1
rsa                     4.9
scipy                   1.7.3
setuptools              65.6.3
six                     1.16.0
tensorboard             2.11.2
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.1
torch                   1.4.0+cu92
torchsummary            1.5.1
torchvision             0.5.0+cu92
tqdm                    4.66.1
typing_extensions       4.7.1
urllib3                 2.0.6
Werkzeug                2.2.3
wheel                   0.38.4
wincertstore            0.2
zipp                    3.15.0

It reported an error at 52%.

NL dataset. Step:  1
Device: cuda
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1             [-1, 32, 5, 5]             928
            Linear-2                  [-1, 128]         102,528
            Linear-3                   [-1, 64]           8,256
            Linear-4                    [-1, 7]             455
       DoubleDense-5                    [-1, 7]               0
================================================================
Total params: 112,167
Trainable params: 112,167
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.43
Estimated Total Size (MB): 0.44
----------------------------------------------------------------
Epochs:  52%|█████▏    | 78/150 [15:18<14:07, 11.78s/it]
Stopping early --> val_loss has not decreased over 20 epochs
Traceback (most recent call last):
  File "D:\python_project\multidim_conv-master\multidim_conv-master\train.py", line 404, in <module>
    train_wind_nl(folder+data, epochs=150, input_timesteps=6, prediction_timestep=t, dev=dev, earlystopping=20)
  File "D:\python_project\multidim_conv-master\multidim_conv-master\train.py", line 307, in train_wind_nl
    summary(model, (7, input_timesteps, 6), device="cpu")
  File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torchsummary\torchsummary.py", line 72, in summary
    model(*x)
  File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\python_project\multidim_conv-master\multidim_conv-master\models\wind_models.py", line 93, in forward
    x = F.relu(self.conv1(x))
  File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "D:\python_project\multidim_conv-master\multidim_conv-master\models\attention_augmented_conv.py", line 52, in forward
    flat_q, flat_k, flat_v, q, k, v = self.compute_flat_qkv(x, self.dk, self.dv, self.Nh)
  File "D:\python_project\multidim_conv-master\multidim_conv-master\models\attention_augmented_conv.py", line 79, in compute_flat_qkv
    q *= dkh ** -0.5
RuntimeError: Output 0 of ReshapeAliasBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.

What can I do to get your code to work correctly Thank you very much!

from multidim_conv.

HansBambel avatar HansBambel commented on September 24, 2024

I don't know why it is not working for you, but it is not crashing from the training. Because of Stopping early --> val_loss has not decreased over 20 epochs we can see that the crash happened after that..

Does the model get saved in models/trained_models/wind_model_NL_{prediction_timestep}h_{model.__class__.__name__}.pt?

from multidim_conv.

poemon avatar poemon commented on September 24, 2024

Yes, there are two files in the trained_models folder

wind_model_NL_1h_CNN2DWind_NL.pt  445kb
wind_model_NL_1h_CNN2DAttWind_NL.pt  453kb

from multidim_conv.

HansBambel avatar HansBambel commented on September 24, 2024

Can you try to start the training for 2 steps ahead? So remove the 1 from here:

multidim_conv/train.py

Lines 401 to 407 in 1caae1f

for t in [1, 2, 3, 4]:
print("NL dataset. Step: ", t)
data = "Wind_data_NL/dataset.pkl"
train_wind_nl(folder+data, epochs=150, input_timesteps=6, prediction_timestep=t, dev=dev, earlystopping=20)
print("DK dataset. Step: ", t)
data = f"Wind_data/lag=4/step{t}.mat"
train_wind_dk(folder+data, epochs=150, dev=dev, earlystopping=20)

It seems like the problem occurs then.

from multidim_conv.

poemon avatar poemon commented on September 24, 2024

I'm sorry for replying to you so late. I've been a bit busy these past couple of days
I followed what you said and removed 1, but the result is still the same. I'm about to give up

for t in [2, 3, 4]: 

Have you tried running your code?

from multidim_conv.

HansBambel avatar HansBambel commented on September 24, 2024

I was just able to reproduce the problem. It was indeed in this line q *= dkh ** -0.5.

I fixed it by renaming q to q_new:

def compute_flat_qkv(self, x, dk, dv, Nh):
    qkv = self.qkv_conv(x)
    N, _, H, W = qkv.size()
    q, k, v = torch.split(qkv, [dk, dk, dv], dim=1)
    q = self.split_heads_2d(q, Nh)
    k = self.split_heads_2d(k, Nh)
    v = self.split_heads_2d(v, Nh)

    dkh = dk // Nh
    q_new = q * dkh ** -0.5
    flat_q = torch.reshape(q, (N, Nh, dk // Nh, H * W))
    flat_k = torch.reshape(k, (N, Nh, dk // Nh, H * W))
    flat_v = torch.reshape(v, (N, Nh, dv // Nh, H * W))
    return flat_q, flat_k, flat_v, q_new, k, v

When you get the latest from master it should work. Note that when executing train.py 5 models get trained at each time step for each dataset.

Furthermore, I activated some more prints to show progress.

from multidim_conv.

HansBambel avatar HansBambel commented on September 24, 2024

@poemon Did this fix your issue?

from multidim_conv.

Related Issues (2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.