Code Monkey home page Code Monkey logo

Comments (26)

StatMixedML avatar StatMixedML commented on May 14, 2024 1

If I understood correctly XGBoostLSS allows us to predict the entire conditional distribution for our regression target. However you need to specify the underliying distribution first.

Yes correct. One needs to specify a distribution in the first place.

I am dealing with a problem with heavily zero inflated data (in fact any box cox transformation like log does not change at all the underliying distribution).

Since I haven't seen the data, I can only speculate on this. But the Negative Binomial is typically a good choice for count data, potentially also with many 0s. You may want to try it as an alternative. If this is not working (let me know!), I can also add a Zero-Inflated Negative Binomial, since this is conceptually more easy than implementing the Tweedie (which has been on my list for quite some time).

You may also want to read on this

Do We Really Need Zero-Inflated Models?

M5 Competition Uncertainty: Overdispersion, distributional forecasting, GAMLSS and beyond

from xgboostlss.

StatMixedML avatar StatMixedML commented on May 14, 2024

@edgBR Thanks for your interest.

I am not sure I understand your request. Can you please specify your request more clearly.

from xgboostlss.

edgBR avatar edgBR commented on May 14, 2024

Hi @StatMixedML

Apologies for not being specific, I was in a rush and I just wanted to create this to develop later.

If I understood correctly XGBoostLSS allows us to predict the entire conditional distribution for our regression target. However you need to specify the underliying distribution first.

I am dealing with a problem with heavily zero inflated data (in fact any box cox transformation like log does not change at all the underliying distribution). Because of that I am using the tweedie loss function in XGBoost and I am tunning the tweedie variance power: https://xgboost.readthedocs.io/en/stable/parameter.html#parameters-for-tweedie-regression-objective-reg-tweedie.

Therefore my assumption is that in order to produce consistent confidence intervals for this problem I will need the tweedie distribution as well in the distributions available in XGBoostLSS

BR
E

from xgboostlss.

edgBR avatar edgBR commented on May 14, 2024

Hi @StatMixedML the problem is that my data is actually not count data. Is payment data (continous positive data).

I will try and let you know in a couple of hours. Regarding your articles, very interesting reading thanks for that.

As FYI the method that I have built is similar to this: https://scikit-lego.netlify.app/meta.html#Zero-Inflated-Regressor and this methodology has improved my results by 5% in comparison with plain xgboost with the tweedie loss.

BR
E

from xgboostlss.

StatMixedML avatar StatMixedML commented on May 14, 2024

Hi @edgBR,

the problem is that my data is actually not count data. Is payment data (continous positive data).

Ok I see. The negative binomial might then not be a suitable assumption.

As FYI the method that I have built is similar to this: https://scikit-lego.netlify.app/meta.html#Zero-Inflated-Regressor and this methodology has improved my results by 5% in comparison with plain xgboost with the tweedie loss.

I am aware of this approach and this reads like a viable approach for your problem. However, I have some problems with this since you need to estimate two models, one for the 0-part and another one for the non-0-part. More elegantly, you can do the same with a single zero-inflated model. This allows then to also have a proper uncertainty estimation. In principle, you can extend any distribution, say, Normal, Gamma, etc. to account for the excess of 0s. In any case, you need a strong feature that accounts for the 0 part.

from xgboostlss.

edgBR avatar edgBR commented on May 14, 2024

Hi @StatMixedML

Thats correct you have to estimate 2 models and in fact one of the reasons because I built my own stuff is because this method assumes that the classifier will be perfect which is not the case.

As of today I created an HPO script that is tunning the classifier hyperparamenters and the regressor hyperparameters according to the regressor optimization metric. I have also played with the imperfection of the classifier and therefore I am injecting some 0 to my training data (instead of training everything with positive data). So for some cases my best model is a single regression xgboost model trained with tweedie loss and sometimes is a classifier + and xgboost regressor trained also with tweedie loss to not get negative data.

In any case I see valuable the implementation of tweedie here.

For the moment I am using expectiles as I found this paper:

Using expectile regression for classification ratemaking

Which despite being very far away from my use case, contains some interesting conclusions.

from xgboostlss.

StatMixedML avatar StatMixedML commented on May 14, 2024

Hi @edgBR ,

thanks for your detailed answer. I was also to suggest using XGBoostLSS expectile estimation. Let me know how this is going.

from xgboostlss.

edgBR avatar edgBR commented on May 14, 2024

@StatMixedML

For the moment I am getting to wide confidence intervals and the most problematic thing is that the lower bounds contains negative values. Is there a way to put a cap in the expectiles?

from xgboostlss.

StatMixedML avatar StatMixedML commented on May 14, 2024

@edgBR I am afraid there is none, except that you need to manually clip negative values to 0

from xgboostlss.

edgBR avatar edgBR commented on May 14, 2024

@StatMixedML understood.

Would it be possible if you could describe what are the implementation challenges of the tweedie distribution? Is the gradient calculation with autograd the problem?

BR
E

from xgboostlss.

StatMixedML avatar StatMixedML commented on May 14, 2024

@edgBR There is conceptually no problem wit the Tweedie. It has been on my list for some time now. It is just that I haven't spend time on looking into it. Do you know of any PyTorch implementation of the Tweedie?

from xgboostlss.

edgBR avatar edgBR commented on May 14, 2024

@StatMixedML a couple of weeks ago I found this in one of the discussions in the gluon-ts github:

https://discuss.pytorch.org/t/custom-tweedie-loss-throwing-an-error-in-pytorch/76349/6

Dispersion parameters and another usefull mathematical definitions can be found here:

https://arxiv.org/pdf/1912.12356.pdf

Maybe this is helpful?

from xgboostlss.

StatMixedML avatar StatMixedML commented on May 14, 2024

@edgBR I am aware of this, but I need to check if that is the correct density for the Tweedie. Do you know which density XGBoost is using?

from xgboostlss.

edgBR avatar edgBR commented on May 14, 2024

Hi @StatMixedML,

Unfortunatelly I am unaware of what XGBoost is using underneath.

The traces in the code point me here:

https://github.com/tonydifranco/xgboost/blob/4f6d4eaac58c2d93671dbd827acc606b7e8fedf9/src/objective/regression_obj.cc

(Line 285 and beyond) but there I can only find the gradient and the hessian.

In any case this seems like a good implementation of all parameters: https://github.com/thequackdaddy/tweedie/blob/master/tweedie/tweedie_dist.py

But using numpy.

from xgboostlss.

StatMixedML avatar StatMixedML commented on May 14, 2024

@edgBR Thanks for sharing. Looks good!

I am currently working on a different project, but you can try try implementing the tweedie with the autograd yourself and open a PR once working.

from xgboostlss.

edgBR avatar edgBR commented on May 14, 2024

Hi @StatMixedML

Could you maybe facilitate a list of changes that need to be implemented in order to incorporate the tweedie?

If I understood correctly we need:

  • AutoGrad Gradient Calculation for each of the parameters
  • AutoGrad Hessian Calculation for each of the parameters
  • The parameters list for the tweedie distribution (in this case mu, sigma and power between 1 and 2 as you already implemented gamma)
  • A way of sampling the pdf of the tweedie

Anything else?

from xgboostlss.

StatMixedML avatar StatMixedML commented on May 14, 2024

@edgBR Besides the things you've mentioned, you would also need

The last function is needed since it is used for drawing samples from the predicted distribution. In fact, if you need quantiles rather than the samples themselves, the quantile function is more efficient.

from xgboostlss.

StatMixedML avatar StatMixedML commented on May 14, 2024

@edgBR Kindly asking for an update. Did you manage to implement the loss?

from xgboostlss.

edgBR avatar edgBR commented on May 14, 2024

Hi @StatMixedML

Unfortunately not and I have decided to take a new path using MAPIE: https://mapie.readthedocs.io/en/latest/examples_regression/index.html

plus XGBoost with Tweedie loss.

BR
Edgar

from xgboostlss.

StatMixedML avatar StatMixedML commented on May 14, 2024

@edgBR Thanks for your feedback and for the MAPIE, looks interesting indeed.

from xgboostlss.

edgBR avatar edgBR commented on May 14, 2024

Hi @StatMixedML

Please reopen this.

I found an implementation of tweedie estimation in python which is validated with the R one:

https://github.com/thequackdaddy/tweedie/blob/master/tweedie/tweedie_dist.py

Migrating this to autograd might be easier that expected.

BR
E

from xgboostlss.

StatMixedML avatar StatMixedML commented on May 14, 2024

@edgBR Thanks for sharing, need to look into the details of the implmentation though first.

Are you confident giving it a try and adding the Tweedie to XGBoostLSS?

from xgboostlss.

edgBR avatar edgBR commented on May 14, 2024

Hi @StatMixedML I have requested my manager to put some time on this.

So most likely yes :)

from xgboostlss.

StatMixedML avatar StatMixedML commented on May 14, 2024

@edgBR If you want to use PyTorch's autograd, you would first need to translate all functions (more precisely mainly the density) into pytorch function using tensors. Otherwise, pytorch cannot create the computational graph to calculate derivatives.

from xgboostlss.

StatMixedML avatar StatMixedML commented on May 14, 2024

@edgBR Out of curiosity, how is the progress on the implementation. Need advise / help?

from xgboostlss.

StatMixedML avatar StatMixedML commented on May 14, 2024

@edgBR Kindly asking for an update on this. Need advise / help?

from xgboostlss.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.