Code Monkey home page Code Monkey logo

Comments (5)

ragulpr avatar ragulpr commented on June 1, 2024 1

General answer

NaN is always a problem, and it's hard to debug. Some starting points:

  1. Assume the problem is leaky truth in your data!
  • If truth about TTE is leaking beta may tend to infinity causing instability (unless capped with max_beta).
  • If truth about censoring is leaking alpha tends to inf and/or beta to 0 eventually causing NaN.
  • Broken masks or negative TTE, 0s in TTE if using continuous loss function
  1. Initialization is important. Gradients explode if you're too far off causing NAN.
  2. More censored data leads to larger gradient steps leading to higher probability of exploding gradient (causing NaN).
  3. Learning rate is dependent on data and can be in magnitudes you didn't expect. High learning rates (w.r.t data) may cause NAN.

Some comments about what I've done about this:

If everything above is checked your machine epsilon is a likely culprit. The warning that I put in there should be flagged in that case. Essentially, try to call keras.backend.set_epsilon(1e-08) to lower epsilon.

Analysis of your problem

I assume by "number of records" you mean the amount of observed datapoints (in the pandas dataframe), not number of timesteps that they were under observation. Ex 1-record datapoint may lead to hundreds of empty timesteps in numpy array.

As 1-record customers causes instability I really think the problem is the data. If they log in once and nothing happens the algo will likely know for sure after a few timesteps that they aren't coming back/is dead coinciding with that they are censored making it safe to push distribution towards infinity. Note that Biases did not go NaN (see the non-nan weights) and that only final layer is NaN. Hints towards that upper layer hands over a representation that the output layer knows exactly what to do with (and gradients explodes with joy)

Edit:
A hacky solution against obviously dead sequences is to remove the right part of the sequence when it's clearly inferrable that they are dead, ex if x amount of censored timesteps after signup.
I've done successful testing with a non-hacky approach to this (using the fact that deadness coincides with that you can predict that the timestep is censored!) by predicting prob. of censoring and use it for weighting away censored datapoints but haven't had time to do a writeup on it yet!

from wtte-rnn.

aprotopopov avatar aprotopopov commented on June 1, 2024 1

Thanks for your responses and advice.
By number of records I mean number of days when users have sessions. And NaNs here probably due to much censoring.

But I didn't understand where mathematically NaN occured. Possible reasons for NaNs which I see for now:

  • expoiding gradients (I'm clipping it in optimizer)
  • expoit in activation/output_lambda (I didn't find any division by zero or logarithms from zero or negative values)
  • expoit in loss (I modified some could to exclude any possible expoits)

What other reasons could be for NaN?


I've done successful testing with a non-hacky approach to this (using the fact that deadness coincides with that you can predict that the timestep is censored!) by predicting prob. of censoring and use it for weighting away censored datapoints but haven't had time to do a writeup on it yet!

It's very interesting approach. It'll be very helpful to see how you are doing that.


P.S. I think condition for lower epsilon is a bit wrong. Should it be K.epsilon() >= 1e-07 instead of K.epsilon() <= 1e-07?

P.P.S. Hacks to change discrete loss function:

        def loglik_discrete(y, u, a, b, epsilon=1e-35, lowest_val=1e-45):
            a = K.sign(a + lowest_val) * K.maximum(K.abs(a), lowest_val)
            hazard0 = K.pow((y + epsilon) / a, b)
            hazard1 = K.pow((y + 1.0) / a, b)

            log_val = K.clip(K.exp(hazard1 - hazard0) - 1.0, lowest_val,
                             np.inf)
            loglikelihoods = u * K.log(log_val) - hazard1
            return loglikelihoods

from wtte-rnn.

ragulpr avatar ragulpr commented on June 1, 2024 1

Some numerical problems I've been thinking about for the discrete case:

  • alpha = 0 leading to divide by zero.
  • y == 0 leading to log(0) since K.pow may be implemented as z^b = exp[log(z)b]. The y + epsilon supposed to takes care of this and does a good job at it.
  • alpha == Inf causing (y + epsilon) / a == 0 leading to log(0). Haven't been looking into this.
  • b<<1 and b>>1. See how 0 == K.exp(hazard1 - hazard0) - 1.0 could happen whenever beta<<1 and/or when alpha>>y and beta>>1

This does not cover what can happen in gradients which is another layer of complexity.

I think huge or tiny betas and alphas is up to the calling functions to take care of, i.e having the option of applying this hack to output_lambda or penalties.

I've never looked into whether alpha=0 is a problem, would be very curious to hear if it is helpful. I have done a whole deal of experiments clipping alpha from being huge and this has not been helpful. Let me know if you want more info on this.

TODO:

Numerical instability is the problem with wtte which I've spend huge amount of time on so I'm going to be extremely careful about changing the current working implementation without convincing tests. Due to the complexity I've been unit testing the whole thing instead of small edgecase tests but this would be extremely helpful.

  • Need tests to decide/test tolerance levels leading to exploding gradients i.e testing loss and gradient stability with varying a, b and y
  • Fix K.epsilon() >= 1e-07 so that it throws the right error :)

from wtte-rnn.

aprotopopov avatar aprotopopov commented on June 1, 2024

Some comments about data

I'm using users records per day for 60 days with sample fo 20000 users with about 100 features. Train is 16000 users, validation 4000 users. For train I'm hiding last 0.1 frac (6 days).

Without filtering

Without filtering users per records (there are a lot of users with 1 record) I'll get NaNs pretty fast. init_alpha ~ 107.5. Lowest val_loss ~ 0.3095 for reduce_loss=False.

Users num records
image

Weight watcher callback
image
image

With filtering

With filtering users (>= 10 records) training become more stable even with more LSTM neurons, init_alpha ~ 2.77. Plots with 1 LSTM neuron. Lowest val_loss ~val_loss: 0.7786

Users num records
image

Weight watcher callback
image
image

from wtte-rnn.

ragulpr avatar ragulpr commented on June 1, 2024

Also pertinent to your problem: There's a very subtle philosophical problem at the first step of sequences that may expose the truth:

If a sequence is born due to an event, the first timestep will always have TTE=0. The data pipeline template (supposed to) take care of this by shifting & removing the first timestep. Easy to miss if using your own or modified pipeline

from wtte-rnn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.