Dear colleagues, According to the docs: <a target=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Feature request: Support Tweedie Distribution about xgboostlss HOT 26 CLOSED

statmixedml commented on May 14, 2024

Feature request: Support Tweedie Distribution

from xgboostlss.

Comments (26)

StatMixedML commented on May 14, 2024 1

If I understood correctly XGBoostLSS allows us to predict the entire conditional distribution for our regression target. However you need to specify the underliying distribution first.

Yes correct. One needs to specify a distribution in the first place.

I am dealing with a problem with heavily zero inflated data (in fact any box cox transformation like log does not change at all the underliying distribution).

Since I haven't seen the data, I can only speculate on this. But the Negative Binomial is typically a good choice for count data, potentially also with many 0s. You may want to try it as an alternative. If this is not working (let me know!), I can also add a Zero-Inflated Negative Binomial, since this is conceptually more easy than implementing the Tweedie (which has been on my list for quite some time).

You may also want to read on this

Do We Really Need Zero-Inflated Models?

M5 Competition Uncertainty: Overdispersion, distributional forecasting, GAMLSS and beyond

from xgboostlss.

StatMixedML commented on May 14, 2024

@edgBR Thanks for your interest.

I am not sure I understand your request. Can you please specify your request more clearly.

from xgboostlss.

edgBR commented on May 14, 2024

Hi @StatMixedML

Apologies for not being specific, I was in a rush and I just wanted to create this to develop later.

If I understood correctly XGBoostLSS allows us to predict the entire conditional distribution for our regression target. However you need to specify the underliying distribution first.

I am dealing with a problem with heavily zero inflated data (in fact any box cox transformation like log does not change at all the underliying distribution). Because of that I am using the tweedie loss function in XGBoost and I am tunning the tweedie variance power: https://xgboost.readthedocs.io/en/stable/parameter.html#parameters-for-tweedie-regression-objective-reg-tweedie.

Therefore my assumption is that in order to produce consistent confidence intervals for this problem I will need the tweedie distribution as well in the distributions available in XGBoostLSS

BR
E

from xgboostlss.

edgBR commented on May 14, 2024

Hi @StatMixedML the problem is that my data is actually not count data. Is payment data (continous positive data).

I will try and let you know in a couple of hours. Regarding your articles, very interesting reading thanks for that.

As FYI the method that I have built is similar to this: https://scikit-lego.netlify.app/meta.html#Zero-Inflated-Regressor and this methodology has improved my results by 5% in comparison with plain xgboost with the tweedie loss.

BR
E

from xgboostlss.

StatMixedML commented on May 14, 2024

Hi @edgBR,

the problem is that my data is actually not count data. Is payment data (continous positive data).

Ok I see. The negative binomial might then not be a suitable assumption.

As FYI the method that I have built is similar to this: https://scikit-lego.netlify.app/meta.html#Zero-Inflated-Regressor and this methodology has improved my results by 5% in comparison with plain xgboost with the tweedie loss.

I am aware of this approach and this reads like a viable approach for your problem. However, I have some problems with this since you need to estimate two models, one for the 0-part and another one for the non-0-part. More elegantly, you can do the same with a single zero-inflated model. This allows then to also have a proper uncertainty estimation. In principle, you can extend any distribution, say, Normal, Gamma, etc. to account for the excess of 0s. In any case, you need a strong feature that accounts for the 0 part.

from xgboostlss.

edgBR commented on May 14, 2024

Hi @StatMixedML

Thats correct you have to estimate 2 models and in fact one of the reasons because I built my own stuff is because this method assumes that the classifier will be perfect which is not the case.

As of today I created an HPO script that is tunning the classifier hyperparamenters and the regressor hyperparameters according to the regressor optimization metric. I have also played with the imperfection of the classifier and therefore I am injecting some 0 to my training data (instead of training everything with positive data). So for some cases my best model is a single regression xgboost model trained with tweedie loss and sometimes is a classifier + and xgboost regressor trained also with tweedie loss to not get negative data.

In any case I see valuable the implementation of tweedie here.

For the moment I am using expectiles as I found this paper:

Using expectile regression for classification ratemaking

Which despite being very far away from my use case, contains some interesting conclusions.

from xgboostlss.

StatMixedML commented on May 14, 2024

Hi @edgBR ,

thanks for your detailed answer. I was also to suggest using XGBoostLSS expectile estimation. Let me know how this is going.

from xgboostlss.

edgBR commented on May 14, 2024

@StatMixedML

For the moment I am getting to wide confidence intervals and the most problematic thing is that the lower bounds contains negative values. Is there a way to put a cap in the expectiles?

from xgboostlss.

StatMixedML commented on May 14, 2024

@edgBR I am afraid there is none, except that you need to manually clip negative values to 0

from xgboostlss.

edgBR commented on May 14, 2024

@StatMixedML understood.

Would it be possible if you could describe what are the implementation challenges of the tweedie distribution? Is the gradient calculation with autograd the problem?

BR
E

from xgboostlss.

StatMixedML commented on May 14, 2024

@edgBR There is conceptually no problem wit the Tweedie. It has been on my list for some time now. It is just that I haven't spend time on looking into it. Do you know of any PyTorch implementation of the Tweedie?

from xgboostlss.

edgBR commented on May 14, 2024

@StatMixedML a couple of weeks ago I found this in one of the discussions in the gluon-ts github:

https://discuss.pytorch.org/t/custom-tweedie-loss-throwing-an-error-in-pytorch/76349/6

Dispersion parameters and another usefull mathematical definitions can be found here:

https://arxiv.org/pdf/1912.12356.pdf

Maybe this is helpful?

from xgboostlss.

StatMixedML commented on May 14, 2024

@edgBR I am aware of this, but I need to check if that is the correct density for the Tweedie. Do you know which density XGBoost is using?

from xgboostlss.

edgBR commented on May 14, 2024

Hi @StatMixedML,

Unfortunatelly I am unaware of what XGBoost is using underneath.

The traces in the code point me here:

https://github.com/tonydifranco/xgboost/blob/4f6d4eaac58c2d93671dbd827acc606b7e8fedf9/src/objective/regression_obj.cc

(Line 285 and beyond) but there I can only find the gradient and the hessian.

In any case this seems like a good implementation of all parameters: https://github.com/thequackdaddy/tweedie/blob/master/tweedie/tweedie_dist.py

But using numpy.

from xgboostlss.

StatMixedML commented on May 14, 2024

@edgBR Thanks for sharing. Looks good!

I am currently working on a different project, but you can try try implementing the tweedie with the autograd yourself and open a PR once working.

from xgboostlss.

edgBR commented on May 14, 2024

Hi @StatMixedML

Could you maybe facilitate a list of changes that need to be implemented in order to incorporate the tweedie?

If I understood correctly we need:

AutoGrad Gradient Calculation for each of the parameters
AutoGrad Hessian Calculation for each of the parameters
The parameters list for the tweedie distribution (in this case mu, sigma and power between 1 and 2 as you already implemented gamma)
A way of sampling the pdf of the tweedie

Anything else?

from xgboostlss.

StatMixedML commented on May 14, 2024

@edgBR Besides the things you've mentioned, you would also need

The last function is needed since it is used for drawing samples from the predicted distribution. In fact, if you need quantiles rather than the samples themselves, the quantile function is more efficient.

from xgboostlss.

StatMixedML commented on May 14, 2024

@edgBR Kindly asking for an update. Did you manage to implement the loss?

from xgboostlss.

edgBR commented on May 14, 2024

Hi @StatMixedML

Unfortunately not and I have decided to take a new path using MAPIE: https://mapie.readthedocs.io/en/latest/examples_regression/index.html

plus XGBoost with Tweedie loss.

BR
Edgar

from xgboostlss.

StatMixedML commented on May 14, 2024

@edgBR Thanks for your feedback and for the MAPIE, looks interesting indeed.

from xgboostlss.

edgBR commented on May 14, 2024

Hi @StatMixedML

Please reopen this.

I found an implementation of tweedie estimation in python which is validated with the R one:

https://github.com/thequackdaddy/tweedie/blob/master/tweedie/tweedie_dist.py

Migrating this to autograd might be easier that expected.

BR
E

from xgboostlss.

StatMixedML commented on May 14, 2024

@edgBR Thanks for sharing, need to look into the details of the implmentation though first.

Are you confident giving it a try and adding the Tweedie to XGBoostLSS?

from xgboostlss.

edgBR commented on May 14, 2024

Hi @StatMixedML I have requested my manager to put some time on this.

So most likely yes :)

from xgboostlss.

StatMixedML commented on May 14, 2024

@edgBR If you want to use PyTorch's autograd, you would first need to translate all functions (more precisely mainly the density) into pytorch function using tensors. Otherwise, pytorch cannot create the computational graph to calculate derivatives.

from xgboostlss.

StatMixedML commented on May 14, 2024

@edgBR Out of curiosity, how is the progress on the implementation. Need advise / help?

from xgboostlss.

StatMixedML commented on May 14, 2024

@edgBR Kindly asking for an update on this. Need advise / help?

from xgboostlss.

Feature request: Support Tweedie Distribution about xgboostlss HOT 26 CLOSED

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent