Comments (26)
If I understood correctly XGBoostLSS allows us to predict the entire conditional distribution for our regression target. However you need to specify the underliying distribution first.
Yes correct. One needs to specify a distribution in the first place.
I am dealing with a problem with heavily zero inflated data (in fact any box cox transformation like log does not change at all the underliying distribution).
Since I haven't seen the data, I can only speculate on this. But the Negative Binomial is typically a good choice for count data, potentially also with many 0s. You may want to try it as an alternative. If this is not working (let me know!), I can also add a Zero-Inflated Negative Binomial, since this is conceptually more easy than implementing the Tweedie (which has been on my list for quite some time).
You may also want to read on this
Do We Really Need Zero-Inflated Models?
M5 Competition Uncertainty: Overdispersion, distributional forecasting, GAMLSS and beyond
from xgboostlss.
@edgBR Thanks for your interest.
I am not sure I understand your request. Can you please specify your request more clearly.
from xgboostlss.
Hi @StatMixedML
Apologies for not being specific, I was in a rush and I just wanted to create this to develop later.
If I understood correctly XGBoostLSS allows us to predict the entire conditional distribution for our regression target. However you need to specify the underliying distribution first.
I am dealing with a problem with heavily zero inflated data (in fact any box cox transformation like log does not change at all the underliying distribution). Because of that I am using the tweedie loss function in XGBoost and I am tunning the tweedie variance power: https://xgboost.readthedocs.io/en/stable/parameter.html#parameters-for-tweedie-regression-objective-reg-tweedie.
Therefore my assumption is that in order to produce consistent confidence intervals for this problem I will need the tweedie distribution as well in the distributions available in XGBoostLSS
BR
E
from xgboostlss.
Hi @StatMixedML the problem is that my data is actually not count data. Is payment data (continous positive data).
I will try and let you know in a couple of hours. Regarding your articles, very interesting reading thanks for that.
As FYI the method that I have built is similar to this: https://scikit-lego.netlify.app/meta.html#Zero-Inflated-Regressor and this methodology has improved my results by 5% in comparison with plain xgboost with the tweedie loss.
BR
E
from xgboostlss.
Hi @edgBR,
the problem is that my data is actually not count data. Is payment data (continous positive data).
Ok I see. The negative binomial might then not be a suitable assumption.
As FYI the method that I have built is similar to this: https://scikit-lego.netlify.app/meta.html#Zero-Inflated-Regressor and this methodology has improved my results by 5% in comparison with plain xgboost with the tweedie loss.
I am aware of this approach and this reads like a viable approach for your problem. However, I have some problems with this since you need to estimate two models, one for the 0-part and another one for the non-0-part. More elegantly, you can do the same with a single zero-inflated model. This allows then to also have a proper uncertainty estimation. In principle, you can extend any distribution, say, Normal, Gamma, etc. to account for the excess of 0s. In any case, you need a strong feature that accounts for the 0 part.
from xgboostlss.
Hi @StatMixedML
Thats correct you have to estimate 2 models and in fact one of the reasons because I built my own stuff is because this method assumes that the classifier will be perfect which is not the case.
As of today I created an HPO script that is tunning the classifier hyperparamenters and the regressor hyperparameters according to the regressor optimization metric. I have also played with the imperfection of the classifier and therefore I am injecting some 0 to my training data (instead of training everything with positive data). So for some cases my best model is a single regression xgboost model trained with tweedie loss and sometimes is a classifier + and xgboost regressor trained also with tweedie loss to not get negative data.
In any case I see valuable the implementation of tweedie here.
For the moment I am using expectiles as I found this paper:
Using expectile regression for classification ratemaking
Which despite being very far away from my use case, contains some interesting conclusions.
from xgboostlss.
Hi @edgBR ,
thanks for your detailed answer. I was also to suggest using XGBoostLSS expectile estimation. Let me know how this is going.
from xgboostlss.
For the moment I am getting to wide confidence intervals and the most problematic thing is that the lower bounds contains negative values. Is there a way to put a cap in the expectiles?
from xgboostlss.
@edgBR I am afraid there is none, except that you need to manually clip negative values to 0
from xgboostlss.
@StatMixedML understood.
Would it be possible if you could describe what are the implementation challenges of the tweedie distribution? Is the gradient calculation with autograd the problem?
BR
E
from xgboostlss.
@edgBR There is conceptually no problem wit the Tweedie. It has been on my list for some time now. It is just that I haven't spend time on looking into it. Do you know of any PyTorch implementation of the Tweedie?
from xgboostlss.
@StatMixedML a couple of weeks ago I found this in one of the discussions in the gluon-ts github:
https://discuss.pytorch.org/t/custom-tweedie-loss-throwing-an-error-in-pytorch/76349/6
Dispersion parameters and another usefull mathematical definitions can be found here:
https://arxiv.org/pdf/1912.12356.pdf
Maybe this is helpful?
from xgboostlss.
@edgBR I am aware of this, but I need to check if that is the correct density for the Tweedie. Do you know which density XGBoost is using?
from xgboostlss.
Hi @StatMixedML,
Unfortunatelly I am unaware of what XGBoost is using underneath.
The traces in the code point me here:
(Line 285 and beyond) but there I can only find the gradient and the hessian.
In any case this seems like a good implementation of all parameters: https://github.com/thequackdaddy/tweedie/blob/master/tweedie/tweedie_dist.py
But using numpy.
from xgboostlss.
@edgBR Thanks for sharing. Looks good!
I am currently working on a different project, but you can try try implementing the tweedie with the autograd yourself and open a PR once working.
from xgboostlss.
Hi @StatMixedML
Could you maybe facilitate a list of changes that need to be implemented in order to incorporate the tweedie?
If I understood correctly we need:
- AutoGrad Gradient Calculation for each of the parameters
- AutoGrad Hessian Calculation for each of the parameters
- The parameters list for the tweedie distribution (in this case mu, sigma and power between 1 and 2 as you already implemented gamma)
- A way of sampling the pdf of the tweedie
Anything else?
from xgboostlss.
@edgBR Besides the things you've mentioned, you would also need
- The negative log-likelihood, which is in turn based on the pdf of the distribution
- Parameter Dictionary
- Inverse Parameter Dictionary
- Starting Values
- Function for calculating quantiles from predicted distribution
The last function is needed since it is used for drawing samples from the predicted distribution. In fact, if you need quantiles rather than the samples themselves, the quantile function is more efficient.
from xgboostlss.
@edgBR Kindly asking for an update. Did you manage to implement the loss?
from xgboostlss.
Hi @StatMixedML
Unfortunately not and I have decided to take a new path using MAPIE: https://mapie.readthedocs.io/en/latest/examples_regression/index.html
plus XGBoost with Tweedie loss.
BR
Edgar
from xgboostlss.
@edgBR Thanks for your feedback and for the MAPIE, looks interesting indeed.
from xgboostlss.
Hi @StatMixedML
Please reopen this.
I found an implementation of tweedie estimation in python which is validated with the R one:
https://github.com/thequackdaddy/tweedie/blob/master/tweedie/tweedie_dist.py
Migrating this to autograd might be easier that expected.
BR
E
from xgboostlss.
@edgBR Thanks for sharing, need to look into the details of the implmentation though first.
Are you confident giving it a try and adding the Tweedie to XGBoostLSS?
from xgboostlss.
Hi @StatMixedML I have requested my manager to put some time on this.
So most likely yes :)
from xgboostlss.
@edgBR If you want to use PyTorch's autograd, you would first need to translate all functions (more precisely mainly the density) into pytorch function using tensors. Otherwise, pytorch cannot create the computational graph to calculate derivatives.
from xgboostlss.
@edgBR Out of curiosity, how is the progress on the implementation. Need advise / help?
from xgboostlss.
@edgBR Kindly asking for an update on this. Need advise / help?
from xgboostlss.
Related Issues (20)
- Model Tuning: Validation Metric HOT 12
- Choose values of loc, scale, etc. for norm.ppf HOT 1
- Constant prediction parameters HOT 6
- Support categorical features? HOT 1
- Support for XGBoost >= 1.6 HOT 2
- Loading Models - Losing Information HOT 4
- Cannot install package
- Cannot install package HOT 1
- Distribution error: Tensor of shape.. HOT 16
- Expectile Crossing and Predicted Distribution Recentering at Zero HOT 3
- Distribution loss_fn HOT 3
- Zero (and one?) adjusted Dirichlet? HOT 1
- Multi-task learning and ONNX support
- EvoTreesLSS?
- Support for censored probabilistic regression (survival analysis; AFT models) HOT 1
- Feature Request: Support Lambert W x F distributions
- Reducing `install_requires` to minimum (& expand `extras_require`) and looser version ranges HOT 3
- Regarding XGBoostLSS HOT 5
- Suggestion: spin-off the `distributions` module into a shared common dependency among all *lss modules HOT 1
- XGBoostLSSRegressor - Scikit Learn API HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xgboostlss.