Comments (5)
General answer
NaN is always a problem, and it's hard to debug. Some starting points:
- Assume the problem is leaky truth in your data!
- If truth about TTE is leaking beta may tend to infinity causing instability (unless capped with
max_beta
). - If truth about censoring is leaking alpha tends to inf and/or beta to 0 eventually causing NaN.
- Broken masks or negative TTE, 0s in TTE if using continuous loss function
- Initialization is important. Gradients explode if you're too far off causing NAN.
- More censored data leads to larger gradient steps leading to higher probability of exploding gradient (causing NaN).
- Learning rate is dependent on data and can be in magnitudes you didn't expect. High learning rates (w.r.t data) may cause NAN.
Some comments about what I've done about this:
- Initialization is stable (as it's tested to be initialized around init_alpha, beta=1)
- Loss function is convergent (tested to be run close enough to expected values)
- Loss function is unlikely to deteriorate due to log(0) or divide by zero thanks to the added epsilon but this may happen anyway (no formal test but could be checked by transforming above test with an initialization far away from the expected value)
If everything above is checked your machine epsilon is a likely culprit. The warning that I put in there should be flagged in that case. Essentially, try to call keras.backend.set_epsilon(1e-08)
to lower epsilon.
Analysis of your problem
I assume by "number of records" you mean the amount of observed datapoints (in the pandas dataframe), not number of timesteps that they were under observation. Ex 1-record datapoint may lead to hundreds of empty timesteps in numpy array.
As 1-record customers causes instability I really think the problem is the data. If they log in once and nothing happens the algo will likely know for sure after a few timesteps that they aren't coming back/is dead coinciding with that they are censored making it safe to push distribution towards infinity. Note that Biases did not go NaN (see the non-nan weights) and that only final layer is NaN. Hints towards that upper layer hands over a representation that the output layer knows exactly what to do with (and gradients explodes with joy)
Edit:
A hacky solution against obviously dead sequences is to remove the right part of the sequence when it's clearly inferrable that they are dead, ex if x amount of censored timesteps after signup.
I've done successful testing with a non-hacky approach to this (using the fact that deadness coincides with that you can predict that the timestep is censored!) by predicting prob. of censoring and use it for weighting away censored datapoints but haven't had time to do a writeup on it yet!
from wtte-rnn.
Thanks for your responses and advice.
By number of records I mean number of days when users have sessions. And NaNs here probably due to much censoring.
But I didn't understand where mathematically NaN
occured. Possible reasons for NaNs
which I see for now:
- expoiding gradients (I'm clipping it in optimizer)
- expoit in activation/output_lambda (I didn't find any division by zero or logarithms from zero or negative values)
- expoit in loss (I modified some could to exclude any possible expoits)
What other reasons could be for NaN
?
I've done successful testing with a non-hacky approach to this (using the fact that deadness coincides with that you can predict that the timestep is censored!) by predicting prob. of censoring and use it for weighting away censored datapoints but haven't had time to do a writeup on it yet!
It's very interesting approach. It'll be very helpful to see how you are doing that.
P.S. I think condition for lower epsilon is a bit wrong. Should it be K.epsilon() >= 1e-07
instead of K.epsilon() <= 1e-07
?
P.P.S. Hacks to change discrete loss function:
def loglik_discrete(y, u, a, b, epsilon=1e-35, lowest_val=1e-45):
a = K.sign(a + lowest_val) * K.maximum(K.abs(a), lowest_val)
hazard0 = K.pow((y + epsilon) / a, b)
hazard1 = K.pow((y + 1.0) / a, b)
log_val = K.clip(K.exp(hazard1 - hazard0) - 1.0, lowest_val,
np.inf)
loglikelihoods = u * K.log(log_val) - hazard1
return loglikelihoods
from wtte-rnn.
Some numerical problems I've been thinking about for the discrete case:
alpha = 0
leading to divide by zero.y == 0
leading tolog(0)
sinceK.pow
may be implemented asz^b = exp[log(z)b]
. They + epsilon
supposed to takes care of this and does a good job at it.alpha == Inf
causing(y + epsilon) / a == 0
leading tolog(0)
. Haven't been looking into this.b<<1
andb>>1
. See how0 == K.exp(hazard1 - hazard0) - 1.0
could happen wheneverbeta<<1
and/or whenalpha>>y
andbeta>>1
This does not cover what can happen in gradients which is another layer of complexity.
I think huge or tiny betas and alphas is up to the calling functions to take care of, i.e having the option of applying this hack to output_lambda
or penalties.
I've never looked into whether alpha=0
is a problem, would be very curious to hear if it is helpful. I have done a whole deal of experiments clipping alpha from being huge and this has not been helpful. Let me know if you want more info on this.
TODO:
Numerical instability is the problem with wtte which I've spend huge amount of time on so I'm going to be extremely careful about changing the current working implementation without convincing tests. Due to the complexity I've been unit testing the whole thing instead of small edgecase tests but this would be extremely helpful.
- Need tests to decide/test tolerance levels leading to exploding gradients i.e testing loss and gradient stability with varying a, b and y
- Fix
K.epsilon() >= 1e-07
so that it throws the right error :)
from wtte-rnn.
Some comments about data
I'm using users records per day for 60 days with sample fo 20000 users with about 100 features. Train is 16000 users, validation 4000 users. For train I'm hiding last 0.1 frac (6 days).
Without filtering
Without filtering users per records (there are a lot of users with 1 record) I'll get NaNs
pretty fast. init_alpha ~ 107.5
. Lowest val_loss ~ 0.3095
for reduce_loss=False
.
With filtering
With filtering users (>= 10 records) training become more stable even with more LSTM neurons, init_alpha ~ 2.77
. Plots with 1 LSTM neuron. Lowest val_loss ~val_loss: 0.7786
from wtte-rnn.
Also pertinent to your problem: There's a very subtle philosophical problem at the first step of sequences that may expose the truth:
If a sequence is born due to an event, the first timestep will always have TTE=0. The data pipeline template (supposed to) take care of this by shifting & removing the first timestep. Easy to miss if using your own or modified pipeline
from wtte-rnn.
Related Issues (20)
- Event with duration
- Is it applicable for my dataset HOT 1
- Loss Function - Not the PCF? HOT 2
- Keras and Theano why? HOT 1
- Log-likelihood for discrete Weibull distribution HOT 3
- c-index
- wtte.pipelines.data_pipeline returns wrong seq_ids
- possible memory issue with large data
- Weird Beta outputs
- Stability of loss function for left censored data HOT 1
- References of success of the WTTE-RNN structure?
- multi variate time series : we have categorical and continues data
- Why do you use a log in the discrete weibull loss function?
- How to use the model to predict
- Porting WTTE-RNN to PyTorch HOT 2
- Numerical instability parameterization tricks
- How to label for "time to the next event" ?
- will it work for multivariate time series prediction both regression and classification
- preparation data for churn prediction HOT 1
- how would one support 3 labels: win, loss, censored?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wtte-rnn.