What could be the reason for the Invalid Loss error to be present in GPU training and

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi, <a class="user-mention notranslate" data-hovercard-type="user" d

Invalid Loss error when training on the GPU about wtte-rnn HOT 6 OPEN

ragulpr commented on May 20, 2024 1

Invalid Loss error when training on the GPU

from wtte-rnn.

Comments (6)

ragulpr commented on May 20, 2024 1

Great update. I don't have access to GPU right now so I haven't run the unit-tests for CUDA for a while.

One initial theory is that I think cuda-batchnorm is (must) be different than Keras-cpu-batchnorm, since the latter accepts a mask and as far as I know, cuda-batchnorm unfortunately doesn't. I'm not sure if keras calls the cuda-batchnorm primitives though. Are you using batchnorm? Otherwise, same goes with mask in general. I would be surprised if CuDNNGRU accepts Mask as keras cpu-version does.

Another is that the machine-epsilon is different on CPU/GPU. I recommend setting keras.backend.set_epsilon(1e-07), but I'm not sure whether GPU respects this.

As a general recommendation, I recommend clipping log-likelihood using

loss_fun = wtte.loss(kind='discrete',reduce_loss=False,clip_prob=1e-5).loss_function

from wtte-rnn.

dongishan commented on May 20, 2024 1

@ragulpr Thanks for your reply. I am not using batchnorm and you are correct about the CuDNN not accepting the masking.

I will try the epsilon and log-likelihood clipping and will let you know how it goes.

from wtte-rnn.

ragulpr commented on May 20, 2024 1

If you find anything inside WTTE not working properly with GPU it would be very good to know thanks alot for raising issue. For general NaN-avoidance there are many other git-issues with recommendations. Some top-of the list remedies for further reference;

Doublecheck mask working properly
Pretrain outputlayer as seen in https://github.com/ragulpr/wtte-rnn-examples
Clip log-likelihood through clip_prob=... flag

from wtte-rnn.

as2636 commented on May 20, 2024 1

Hi,

@dongishan called my attention to this post recently, and it just came to my mind today while working with the GPU. I have also observed numerical instabilities in the loss function when using it (we use the same cluster). I have observed this for WTTE-RNN, but also for an extension of it that I wrote for a gaussian-based loss function.

I had not commented anything until now because my main hypothesis was that those instabilities were due to my data being contaminated/badly pre-processed (I use real industrial data). But today I started comparing the GPU and the CPU and initial results show that the loss is much more stable for the CPU.

My architecture is quite simple, with a large batch size, two stacked 50 neuron LSTM's with regularisation, and a Timedistributed 100 neuron dense layer. I use tanh everywhere as an activation function.

With regards to numerical instability in the wtte-rnn case, I have usually been successful avoiding it by normalizing the times to event and by using the continuous log-lik. For some reason, in my data-sets the discrete mode was more prone to numerical instability. I prefer that to clipping.

Edit: Update: I have run now 4 experiments (10000 epochs each) and I observed some loss instabilities in the CPU case, but still in a much minor extent than in the GPU case.

from wtte-rnn.

ragulpr commented on May 20, 2024

Another idea I forgot; I've had problems getting GPU to respect the random seed I set for it, but that might be a pytorch problem. If you repeat experiment using different seeds on CPU maybe you get the same NaN-failures?

from wtte-rnn.

ragulpr commented on May 20, 2024

There may be a lot of reasons for numerical instability as pointed out, so would be very helpful if we can find a GPU/CPU reproducible example. Can it have anything to do with content of your keras.json-file? Maybe GPU is float32 and CPU is float64 or similar?

ps.
Since your final layer is dim 100, after dense(2) it will be approximately Normal(0,100) so the variance is high. I usually scale this as below

model.add(Lambda(wtte.output_lambda, arguments={"init_alpha":init_alpha, 
                                                "max_beta_value":2.0,
                                                # Stability heuristic: scale by log-number of pre-output layer inputs
                                                "scalefactor":1/np.log(100),
                                               }))

from wtte-rnn.

Invalid Loss error when training on the GPU about wtte-rnn HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent