Code Monkey home page Code Monkey logo

xgboostlss's Introduction

Python Version GitHub tag (with filter) Documentation status badge Unit test status badge Code coverage status badge Pepy Total Downlods

XGBoostLSS - An extension of XGBoost to probabilistic modelling

We introduce a comprehensive framework that models and predicts the full conditional distribution of univariate and multivariate targets as a function of covariates. Choosing from a wide range of continuous, discrete, and mixed discrete-continuous distributions, modelling and predicting the entire conditional distribution greatly enhances the flexibility of XGBoost, as it allows to create probabilistic forecasts from which prediction intervals and quantiles of interest can be derived.

Features

✅ Estimation of all distributional parameters.
✅ Normalizing Flows allow modelling of complex and multi-modal distributions.
✅ Mixture-Densities can model a diverse range of data characteristics.
✅ Multi-target regression allows modelling of multivariate responses and their dependencies.
✅ Zero-Adjusted and Zero-Inflated Distributions for modelling excess of zeros in the data.
✅ Automatic derivation of Gradients and Hessian of all distributional parameters using PyTorch.
✅ Automated hyper-parameter search, including pruning, is done via Optuna.
✅ The output of XGBoostLSS is explained using SHapley Additive exPlanations.
✅ XGBoostLSS provides full compatibility with all the features and functionality of XGBoost.
✅ XGBoostLSS is available in Python.

News

💥 [2024-01-19] Release of XGBoostLSS to PyPI.
💥 [2023-08-25] Release of v0.4.0 introduces Mixture-Densities. See the release notes for an overview.
💥 [2023-07-19] Release of v0.3.0 introduces Normalizing Flows. See the release notes for an overview.
💥 [2023-06-22] Release of v0.2.2. See the release notes for an overview.
💥 [2023-06-21] XGBoostLSS now supports multi-target regression.
💥 [2023-06-07] XGBoostLSS now supports Zero-Inflated and Zero-Adjusted Distributions.
💥 [2023-05-26] Release of v0.2.1. See the release notes for an overview.
💥 [2023-05-18] Release of v0.2.0. See the release notes for an overview.
💥 [2021-12-22] XGBoostLSS now supports estimating the full predictive distribution via Expectile Regression.
💥 [2021-12-20] XGBoostLSS is initialized with suitable starting values to improve convergence of estimation.
💥 [2021-12-04] XGBoostLSS now supports automatic derivation of Gradients and Hessians.
💥 [2021-12-02] XGBoostLSS now supports pruning during hyperparameter optimization.
💥 [2021-11-14] XGBoostLSS v0.1.0 is released!

Installation

To install the development version, please use

pip install git+https://github.com/StatMixedML/XGBoostLSS.git

For the PyPI version, please use

pip install xgboostlss

Available Distributions

Our framework is built upon PyTorch and Pyro, enabling users to harness a diverse set of distributional families. XGBoostLSS currently supports the following distributions.

How to Use

Please visit the example section for guidance on how to use the framework.

Documentation

For more information and context, please visit the documentation.

Feedback

We encourage you to provide feedback on how to enhance XGBoostLSS or request the implementation of additional distributions by opening a new discussion.

How to Cite

If you use XGBoostLSS in your research, please cite it as:

@misc{Maerz2023,
  author = {Alexander M\"arz},
  title = {{XGBoostLSS: An Extension of XGBoost to Probabilistic Modelling}},
  year = {2023},
  note = {GitHub repository, Version 0.4.0},
  howpublished = {\url{https://github.com/StatMixedML/XGBoostLSS}}
}

Reference Paper

Arxiv link
Arxiv link
Arxiv link

Star History

xgboostlss's People

Contributors

cattes avatar statmixedml avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xgboostlss's Issues

Feature Request: Support Lambert W x F distributions

It would be great if XGBoostLSS can support Lambert W x F distributions; particularly useful are Lambert W x Gaussian distributions (Tukey's h is a special case of this for $\alpha=1$ and $h = \delta$) as they can be used to transform data to normally distributed data, even if original data is (very) heavy tailed.

In XGBoostLSS context I can see this being useful any time where normal regression might be too restrictive to give correct tail probability estimates (e.g., low sample size; financial data) and one can inspect $\delta$ predictions from XGBoostLSS for which parts of space have long/heavy tail (more uncertainty) than others. Secondly, skewed/heavy-tail Lambert W x gamma distributions are useful to impose heavier right tail for survival like problems.

I'm not aware of a pytorch implementation of Lambert W function, let alone Lambert W x F distributions. TensorFlow has both implemented; scipy.special.lambertw implements the Lambert W function.

If a pytorch implementation of the distribution is required to make this work in XGBoostLSS, then as an alternative AFAICT this should be possible to accomplish using normalizing flows, with the heavy-tail Lambert W transformation as a specific normalizing flow function.

References

Multi-task learning and ONNX support

Thanks for the great work!

I have two questions:

  1. Can we perform multi-task learning in a single training where one task is classification and the target variables are categorical (classes) and the other task is regression where the target values are continuous?

  2. Does XGBoostLSS models support ONNX conversions?

SHAP for categorical features

I'm getting an error when trying to conduct SHAP interpretations for a model containing categorical features:

[18:06:07] WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.
<IPython.core.display.HTML object>
Error in py_call_impl(callable, call_args$unnamed, call_args$named) : 
  xgboost.core.XGBoostError: [18:06:07] /workspace/src/tree/tree_model.cc:899: Check failed: !HasCategoricalSplit(): Please use JSON/UBJSON for saving models with categorical splits.

large scale values for NBI distribution

Hi,

Thank you for creating XGBoostLSS.
I have a question, When I set the distribution to "NBI" (Negative Binomial) and make predictions for some samples, I see that the scale parameter is always the same for all samples and it is greater than zero (for example, 88). I suppose that for negative binomial, scale is equal to 1/n and because n>0 it should be less than 1 for most of the time.

Do I have any misunderstanding on this?

Distribution error: Tensor of shape..

First of all thanks for the new pytorch version.
I've been using the previous versions and today saw a new version and wanted to give it a try, with the data and code that worked fine in previous version.

After I edited the old code with the differences of the new version, following the examples, i've noticed some problems with distributions.
So when I give my label data to the optimization or training I get something on the lines of:

Expected value argument (Tensor of shape (8700,)) to be within the support (GreaterThan(lower_bound=0.0)) of the distribution LogNormal(), but found invalid values:
tensor([0.2782, 0.3064, 0.3202,  ..., 0.3338, 0.3202, 0.3202])

or

Expected parameter scale (Tensor of shape ()) of distribution Normal(loc: -974.14453125, scale: -1210.6866455078125) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
-1210.6866455078125

This happens with every distribution that I've tried.
My dataset is between 0.09 and 0.9 in value, I've tried with similar datasets and got similar results. With one dataset I've managed to run the model by multiplying the values by 10, but for other datasets does not work.

Reminding that all these datasets worked fine with previous version. Do you know what might be the reason?

Reducing `install_requires` to minimum (& expand `extras_require`) and looser version ranges

At quick glance it seems that the current setup.py file is fully exhaustive on all dependencies as an absolute requirement including very specific ranges for versions. If at all possible, it would greatly improve compatibility w/ existing Python repos (and more people being able to use it w/o having to resolve conflicts) if the install_requires was only specifiying absolutely required modules (e.g., plotting or optuna is not really required to use this great package) and the minimum version date (>=) needed, instead of the (approximate) ~= range.

See also first accepted answer here:
https://stackoverflow.com/questions/6947988/when-to-use-pip-requirements-file-versus-install-requires-in-setup-py

Curious to learn more about whether this package has to be so specific/restrictive on the dependencies (e.g., suggest to use
https://stackoverflow.com/questions/10572603/specifying-optional-dependencies-in-pypi-python-setup-py

Model Tuning: Validation Metric

I'm looking into the code for the xgboostlss class and it seems like the validation metric is hardcoded to use negative log-likelihood. Is there going to be flexibility to define the validation metric chosen? (i.e. MAE, etc.)

As of now my tuning process is returning inf for each trial.

Distribution loss_fn

Hi there,
with the recent update on the model.py I start to get an error:
'DistributionClass' object has no attribute 'loss_fn'

--> 372 pruning_callback = optuna.integration.XGBoostPruningCallback(trial, f"test-{self.dist.loss_fn}")
374 xgblss_param_tuning = self.cv(params=hyper_params,
375 dtrain=dtrain,
376 num_boost_round=num_boost_round,
(...)
381 verbose_eval=False
382 )
384 opt_rounds = xgblss_param_tuning[f"test-{self.dist.loss_fn}-mean"].idxmin() + 1

Any idea how to fix it?

Serialization issue

Dear Alexander,

We are now comparing MAPIE confidence intervals to a custom methodology we have created based on the XGBoostLSS.

Basically we are using the expectiles that XGBoostLSS provides us to fit a CDF that is an approximation to a tweedie CDF. (This is a bit hacky but we are getting theoretical coverage).

The problem is that we are splitting this procedure in 2 parts and we are using the joblib of XGBoostLSS to the input of the CDF approximation process.

I have not seen any example of how to serialize XGBoostLSS objects but when trying with joblib:

        X_train_ci = processor_reg.transform(regression_df.drop(columns=[arguments.clf_targets] +
                                                                                [arguments.reg_targets])) 
        y_train_ci = regression_df[arguments.reg_targets]
        n_cpu = multiprocessing.cpu_count()
        dtrain = xgb.DMatrix(X_train_ci, label=y_train_ci, nthread=n_cpu)

        logging.info("Training XGBoost")
        xgboostlss_model_expectile = xgboostlss.train(hyperparameters,
                                                      dtrain,
                                                      dist=distribution_expectile,
                                                      num_boost_round=hyperparameters["opt_rounds"],
                                                      verbose_eval=True)
        
        if partial:
            logging.info("Serializing partial fit")
            joblib.dump(xgboostlss_model_expectile, 'outputs/models/zip_conf_partial_fit.joblib')
        else:
            logging.info("Serializing full fit")
            joblib.dump(xgboostlss_model_expectile, 'outputs/models/zip_conf.joblib')

We are getting the following error:

image

Being distribution expectile defined as:

        distribution_expectile = Expectile                                   
        distribution_expectile.expectiles =  arguments.exp_list
        distribution_expectile.stabilize = "MAD" 

Are we choosing the wrong serialization method? Do we need to save XGBoostLSS in a different way?

BR
Edgar

Constant prediction parameters

When I try to use BCT distribution for a dataset, the resulting prediction parameters (location, nu, tau) are all constant. I believe nu, tar are their initial values (0.5,10). Only scale changes like the example with Gaussian distribution. I am wondering what does this mean?

The parameters I used for training are:
Best trial: Value: 125.35836839999999 Params: eta: 0.040946512023655214 max_depth: 4 gamma: 2.9779429635527073e-05 subsample: 0.2887880027234064 colsample_bytree: 0.3208686624698766 min_child_weight: 316 opt_rounds: 500

Regarding XGBoostLSS

Is it possible to do point prediction using XGBoostLSS? Because, if it can only predict a prediction interval, then after uncertainty quantification, how could we know whether the predictions are improving and most importantly along with uncertainty quantification we need the point prediction of the parameters in our research. So, any help regarding this problem will be very helpful.

Bump versions and tag them

With active development, it may be helpful to bump the version after significant changes, and tag them. This will also help in publishing to pypi.

Let me know if you would like any assistance with this.

Loading Models - Losing Information

When I use joblib to load a model that build by the xgblss, I found that the reloaded model lose some information. The reloaded model will output the different values from the original model even though they have the same input. The difference of the outputs is a constant for all inputs, can you fix this? I think this is because model's attributes get lost after reloading the models.

By the way, is it possible to add quantile regression to the distribution and multitasking learning? I think they are quite useful.

XGBoostLSS for Julia

Hi,
your package & paper look really promising. I can't wait to test drive it.
The readme mentions a Julia implementation is planned, that would be amazing.

May I suggest, consider wrapping it in the interface MLJ.jl which already has several boosting & probabilistic forecasting options with more on the way...

Support for XGBoost >= 1.6

Would it be possible to support XGBoost 1.6 (or later)? If so, what would be the process of getting that out (can possibly take a look at upgrading it/raising a PR)?

Context is that we have some other dependencies that require XGBoost 1.6 or greater which clashes with the setup.py here.

Cannot Find the Source Code

Hi, this is a great library and I would like to tweak the source code to do my stuff, are you plan to share your source code?

code upload?

Hello, I have read the paper and I have to say that it is some really great work. Is it possible to upload the code as other have asked for it? You have 87 stars and 14 forks without uploading the code. You could help a lot of people..

Publish to PyPi

For distribution, it would be very helpful to publish to PyPi. A likely pre-requisite is tagging versions (see #18).

Let me know if you would like any assistance with this.

package install

I am asking the procedure to install the package?, used install_github but did not work

Zero (and one?) adjusted Dirichlet?

Would it be possible to implement a 0 (and maybe even 0 and 1) adjusted dirichlet distribution, similar to:

Tsagris, M., & Stewart, C. (2018). A Dirichlet Regression Model for Compositional Data with Zeros. Lobachevskii Journal of Mathematics, 39(3), 398–412. doi:10.1134/s1995080218030198

`skpro` integration

It would be great to integrate this package - and adjacent ones like LightGBMLSS - with skpro, which in turn directly integrates with sktime for time series forecasting.
(both of course integate seamlessly with sklearn)

Issue opened here: sktime/skpro#184

This is very similar to the suggestion of @joshdunnlime for sklearn interface, skpro provides interface specifications and stringent tests (no need to write new ones!) for probabilistic tabular regressors already.

What would be needed is, as far as I see it:

  • predict_proba interface
  • distributions from XGBoostLSS implemented as skpro tabular distributions

Architecturally, there are two options:

  • small changes in XGBoostLSS, and work done in skpro in interfacing
  • or, import check_estimator from skpro (works on distribution objects as well as on estimators), and use that to create fully skpro conformant interfaces within XGBoostLSS. Then have a light import wrapper in skpro.
    • it is perhaps worthy of note that skpro already has an adapter to tensorflow for distributions.

Personally, I would think option 1 is preferable at least for the distributions, since the different distribution types are of general use, including for statmixedML's other packages, so it would avoid duplication of distribution objects or interfaces.

Multi-parameter optimization with custom loss function for probabilistic forecasting

Description

Dear community,

I am currently working in a probabilistic extension of XGBoost that models all parameters of a distribution. This allows to create probabilistic forecasts from which prediction intervals and quantiles of interest can be derived.

The problem is that XGBoost doesn't permit do optimize over several parameters. Assume we have a Normal distribution y ~ N(µ, sigma). So far, my approach is a two-step procedure, where I first optimize µ with sigma fixed, and then optimize sigma with µ fixed and then iterate between these two.

Since this is inefficient, are there any ways of simultaneously optimize both µ and sigma using a custom loss function?

Cannot install package

Description

The original shap package is currently not maintained. Hence, in its current implementation, shap is not compatible with numpy>=1.24.0,. For details see the following issue shap/shap#2911.

Workaround solution

Hence, XGBoostLSS currently relies on https://github.com/dsgibbons/shap.git. For this package to be properly installed, please avoid installing xgboostlss in a directory/path or conda environment that contains "xgboost/xgboostlss" or any other xgboost related name. Otherwise, the dsgibbons/shap won't turn off cuda building in dsgibbons/shap setup() call and xgboostlss won't install properly. See the following issue dsgibbons/shap#50.

Beta distribution - 0 log-likelihood, 'mean of empty string' message

Hello! Been exploring your package for a predictive modelling project, and think I've found an issue? Either that or I've missed something important I need to do, but either way it's not working. Basically, whenever I try to train a model using a beta distribution as the output, I get log-likelihoods of 0 every time. I've put reproducible code below, using the sklearn diabetes regression toy dataset as an example (alright, it's not a beta distribution, but it should still return something non-zero...). The problem doesn't occur with Gaussian distribution, and it occurs regardless of stabilisation method, and whether I'm using train, cv or hyper_opt, so it's not just a hyperparameter choice. During the training at some point you also get either a message saying Mean of empty slice or All-NaN slice encountered, so that probably points to the cause. I'm not entirely sure what's causing this, but it's probably an issue with the custom objective function? Could be the custom metrics, I suppose, but I think the objective function is more likely.

Anyway, if you could take a look at it and let me know if I'm doing something wrong, or if there is something weird going on under the hood, that would be great! Thanks very much.

from sklearn.datasets import load_diabetes
from xgboostlss.model import xgboostlss
from xgboostlss.distributions import Beta, Gaussian
from xgboost import DMatrix

X, y = load_diabetes(return_X_y=True, as_frame=True)
train_x = X[:-50]
train_y = y[:-50]
test_x = X[-50:]
test_y = y[-50:]
dtrain = DMatrix(train_x, train_y)
dtest = DMatrix(test_x, test_y)

beta = Beta
beta.stabilize = "L2"

single_trial_params = {
    "eta": 0.01,                   
    "max_depth": 4,
    "gamma": 1,
    "subsample": 0.6,
    "colsample_bytree": 0.7,
    "min_child_weight": 1,
}

eval_results = {}
model = xgboostlss.train(
    params=single_trial_params,
    dtrain=dtrain,
    dist=beta,
    evals=[(dtrain, 'train'), (dtest, 'test')],
    evals_result=eval_results
)

EvoTreesLSS?

Hi! First, I just wanted to say thank you so much for XGBoostLSS and LightGBMLSS, they're amazing packages and super useful :)

I wanted to ask, have you considered adding support for an EvoTrees backend? I think this would be incredibly helpful; EvoTrees.jl is the main package I use for regression trees. Thank you!

Cannot install package

Description

The original shap package is currently not maintained. Hence, in its current implementation, shap is not compatible with numpy>=1.24.0,. For details see the following issue shap/shap#2911.

Workaround solution

Hence, XGBoostLSS currently relies on https://github.com/dsgibbons/shap.git. For this package to be properly installed, please avoid installing xgboostlss in a directory/path or conda/venv environment that contains "xgboost/xgboostlss" or any other xgboost related name. Otherwise, the dsgibbons/shap won't turn off cuda building in dsgibbons/shap setup() call and xgboostlss will likely not install properly. See the following issue dsgibbons/shap#50.

Out of MemoryError

Hi, I'm trying to train the XGBLSS on the m5 dataset but I keep on getting out of MemoryError with the hyper_opt method. Is there a parameter to reduce the amount of memory used for this method?

Question on scaling to large dataset

Hi Alexander, thanks for this incredible work. I really like this package; it's really helpful in situations where providing point forecast is not enough.

However, I'm curious about scaling to large dataset. Please pardon my ignorance. I've seen the examples in the notebook section but the datasets used isn't much.

So, my question is, does the current implementation works in similar fashion as the xgboost package? That is, one can leverage spark, dask, etc out of the box with xgboostlss?

I'll appreciate it if you can shed some light on this.

Thanks.

XGBoostLSS for uncertainty estimation in binary classification

Hi,

Thanks for creating the XGBoostLSS (and LightGBMLSS) package, this an important extension to ML algorithms that only provide point estimates for the mean.

I have a question, or thought, about capturing uncertainty for binary classification. I would be interested in estimating the uncertainty in the model score in a binary classification, as a consequence of the amount of data that the algorithm has seen during training. If in a certain part of the feature space there was one positive and one negative training instance, the point estimate will be p=0.5, but with a lot less certainty than if there would have been a hundred positive and a hundred negative instances. One way to estimate the uncertainty in the binomial proportion p for a given number of negative (a) and positive (b) observations is to use a Beta distribution with parameters α = a + 1/2 and β = b + 1/2. This is referred to as Jeffreys interval in this paper, for example:
https://projecteuclid.org/journals/statistical-science/volume-16/issue-2/Interval-Estimation-for-a-Binomial-Proportion/10.1214/ss/1009213286.full

I wonder if the methods of XGBoostLSS can be used, or extended, to return the α, β parameters of a Beta distribution that fits the binary data observed during training. This would be a very valuable estimate of uncertainty. I see that the Beta distribution is implemented in XGBoostLSS, but if I understand correctly, this is to fit a regression on target variables that themselves follow the Beta distribution (not for classification of binomial target variables). And I'm not sure how to obtain the NLL for the Beta distribution for a binary target variable: the Beta PDF is 0 or ∞ at x=0 and x=1.

I have to admit that I am a bit out of my depth on the mathematics of this question, so if my reasoning is entirely in the wrong direction or incompatible with the idea behind XGBoostLSS, I am also happy to hear!

Suggestion: spin-off the `distributions` module into a shared common dependency among all *lss modules

First of all thank you for working on several boosting LSS versions (xgboost, catboost, lightgbm)!

I did notice that both xgb and lightgbm have the (at first sight) exact same distributions.py submodule that is the torch core of for any distribution supported as a boosting prediction:

Is there a specific reason to not share the distribution modules (and any other shared functionality, like plotting of distribution results) in a common Python module that then gets imported/shared as a common dependency here to avoid duplication; and separate development from adding distributions vs adding functionality to the boosting part of the modules.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.