johngoertz / gumbi Goto Github PK

View Code? Open in Web Editor NEW

48.0 48.0 1.0 14.65 MB

Gaussian Process Model Building Interface

Home Page: https://JohnGoertz.github.io/Gumbi/

License: Apache License 2.0

Python 100.00%

bayesian-inference gaussian-processes probabilistic-programming python regression

gumbi's People

Contributors

Stargazers

Watchers

Forkers

signatur-biosciences

gumbi's Issues

ParrayPlotter called on matplotlib function can't handle specifying axes

Current pattern:

plt.sca(ax)
pp = gmb.ParrayPlotter(X, Y, z)
pp(plt.contourf, levels=20, cmap='pink', norm=norm)
pp.colorbar(ax=ax)

Desired pattern:

pp = gmb.ParrayPlotter(X, Y, z)
pp(plt.contourf, levels=20, cmap='pink', norm=norm, ax=ax)
pp.colorbar(ax=ax)

Assertion Error when testing Multi-Output Regression example code

Hi @JohnGoertz

I began testing gumbi on Windows 10 and Windows 10 WSL2. My configuration needs some work for Windows 10 it appears from a compiler point of view. On WSL2 (using Ubuntu 20.04 LTS with gcc 9.3.0), the basic example (see the code below) runs without any errors.

gp.fit(outputs=['mpg'], continuous_dims=['horsepower'])
X = gp.prepare_grid()
y = gp.predict_grid()
gmb.ParrayPlotter(X, y).plot()
sns.scatterplot(data=cars, x='horsepower', y='mpg', color=sns.cubehelix_palette()[-1], alpha=0.5);```


Based on your suggestion from [https://discourse.pymc.io/t/introducing-gumbi-the-gaussian-process-model-building-interface/8377/5](url) I next tested the Multi-Output Regression example code with the Cars data set.

```gp.fit(outputs=['mpg', 'acceleration'], continuous_dims=['horsepower']);
X = gp.prepare_grid()
Y = gp.predict_grid()
axs = plt.subplots(2,1, figsize=(6, 8))[1]
for ax, output in zip(axs, gp.outputs):
    y = Y.get(output)
    gmb.ParrayPlotter(X, y).plot(ax=ax)
    sns.scatterplot(data=cars, x='horsepower', y=output, color=sns.cubehelix_palette()[-1], alpha=0.5, ax=ax);```

I get the following **Assertion Error**. See the full trace back below:

AssertionError                            Traceback (most recent call last)
Input In [8], in <module>
----> 1 gp.fit(outputs=['mpg', 'acceleration'], continuous_dims=['horsepower'])
      3 X = gp.prepare_grid()
      4 Y = gp.predict_grid()

File ~/miniconda3/envs/gumbi_env/lib/python3.10/site-packages/gumbi/regression/GP_pymc3.py:292, in GP.fit(self, outputs, linear_dims, continuous_dims, continuous_levels, continuous_coords, categorical_dims, categorical_levels, additive, seed, heteroskedastic_inputs, heteroskedastic_outputs, sparse, n_u, **MAP_kwargs)
    234 """Fits a GP surface
    235 
    236 Parses inputs, compiles a Pymc3 model, then finds the MAP value for the hyperparameters. `{}_dims` arguments
   (...)
    284 self : :class:`GP`
    285 """
    287 self.specify_model(outputs=outputs, linear_dims=linear_dims, continuous_dims=continuous_dims,
    288                    continuous_levels=continuous_levels, continuous_coords=continuous_coords,
    289                    categorical_dims=categorical_dims, categorical_levels=categorical_levels,
    290                    additive=additive)
--> 292 self.build_model(seed=seed,
    293                  heteroskedastic_inputs=heteroskedastic_inputs,
    294                  heteroskedastic_outputs=heteroskedastic_outputs,
    295                  sparse=sparse, n_u=n_u)
    297 self.find_MAP(**MAP_kwargs)
    299 return self

File ~/miniconda3/envs/gumbi_env/lib/python3.10/site-packages/gumbi/regression/GP_pymc3.py:363, in GP.build_model(self, seed, continuous_kernel, heteroskedastic_inputs, heteroskedastic_outputs, sparse, n_u)
    360 n_p = len(self.outputs)
    362 D_in = len(self.dims)
--> 363 assert X.shape[1] == D_in
    365 idx_l = [self.dims.index(dim) for dim in self.linear_dims]
    366 idx_s = [self.dims.index(dim) for dim in self.continuous_dims]

AssertionError: 

What am I doing wrong? Or is this related to any of the `numpy > 1.19.3` errors (does not look like that)?

Sree

Export Gumbi model to PyMC3 and example workflow for cross_validate

Thanks a lot for creating Gumbi. I was playing around a bit with it and I hope it will evolve further!
What I was trying to do is actually explained here: https://discourse.pymc.io/t/use-exact-gaussian-process-model-from-gpytorch-as-emulator-in-pymc3/8680. Do you think that I can use Gumbi to do sth. similar as done in GPyTorch ( https://docs.gpytorch.ai/en/stable/examples/01_Exact_GPs/Simple_GP_Regression.html) ?
If yes, is there a simple method to export the fitted gumbi model in order that the gumbi.predict can be used inside of "pure" pymc3 again (aesara compatible that it can run with the NUTs sampler) ?

Apart from that, I have another question: I was able to fit a GP of my data with Gumbi, however I could not really check its performance? Do you have an example code where you use the cross_validate method? I did not really manage to get it working:
I did:
gp.cross_validate(['melt_f', 'prcp_fac', 'temp_bias'], n_train = 200)

-> but then I got a TypeError: init() missing 1 required positional argument: 'outputs'. When I do gp.outputs, however, I get the right output name ?!

Thanks a lot in advance!

Add example (notebook) explaining `cross_validation`

The behavior of the cross_validation method may be confusing - a notebook or expanded Examples in the docstring would help.

Allow Standardizer to accept array inputs

Current pattern:

gmb.Standardizer(y={'μ':y_train.mean(), 'σ':y_train.std()})

Desired pattern:

gmb.Standardizer(y=y_train)

Allow posterior sampling

Right now Gumbi only allows (pymc) marginalized posterior predictions, i.e. only mean and variance rather than individual samples. We should also implement an interface for drawing individual posterior samples via .conditional.

Gumbi exposes the Pymc API, so for now the user can access the underlying pymc objects to do this:

gp = gmb.GP(...)

gp.fit(...)

gp.prepare_grid()

# add the GP conditional to the model, given the new X values
with model:
    f_pred = gp.conditional("f_pred", gp.grid_points)

# To use the MAP values, you can just replace the trace with a length-1 list with `mp`
with model:
    pred_samples = pm.sample_posterior_predictive([gp.MAP], vars=[f_pred], samples=2000)

But this approach obviously introduces complexity that Gumbi was intended to remove. In particular:

The two-step process of declaring f_pred and then drawing samples should be reduced to a single command, maybe gp.draw(samples=2000, point='MAP')
f_pred should be declared as a pm.Data object so that its value can be updated repeatedly, similar to the suggestion here. This should probably be done pre-emptively during intial model building.
The output pred_samples should be reshaped and stored as a Parray similar to how predict behaves. This will be slightly complicated by the fact that pred_samples will have an additional dimension compared to gp.grid_points corresponding to different samples.
- ParrayPlotter should potentially be updated to accomodate this, otherwise the user might need to create a new ParrayPlotter instance for each sample.

Add example notebook on extending Gumbi by inheriting `Regressor`

The core functionality of Gumbi is contained in the Abstract Base Class Regressor. This was written to allow simple extension with custom models and inferrence methods through defining the abstract methods fit, build_model, and predict. There should be a notebook demonstrating how this could be achieved, potentially with a Bayesian Neural Network or a Generalized Linear Model.

Allow UPArray to be constructed from an array

Current pattern:

gmb.uparray('y', μ=y_out.mean(axis=1), σ2=y_out.var(axis=1), stdzr=stdzr)

Desired pattern:

gmb.uparray(y=y_out, axis=1, stdzr=stdzr)

CircleCI docs build failing

https://app.circleci.com/pipelines/github/JohnGoertz/Gumbi/30/workflows/b5d1dd70-daf6-4f45-b381-783168ec0168/jobs/40

Command:

#!/bin/bash -eo pipefail
python docs/source/generate_api_rst.py

Error:

Traceback (most recent call last):
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/site-packages/theano/configparser.py", line 238, in fetch_val_for_key
    return self._theano_cfg.get(section, option)
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/configparser.py", line 781, in get
    d = self._unify_values(section, vars)
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/configparser.py", line 1152, in _unify_values
    raise NoSectionError(section) from None
configparser.NoSectionError: No section: 'blas'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/site-packages/theano/configparser.py", line 354, in __get__
    val_str = cls.fetch_val_for_key(self.name, delete_key=delete_key)
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/site-packages/theano/configparser.py", line 242, in fetch_val_for_key
    raise KeyError(key)
KeyError: 'blas__ldflags'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/circleci/project/docs/source/generate_api_rst.py", line 9, in <module>
    import gumbi
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/site-packages/gumbi/__init__.py", line 5, in <module>
    from .regression import *
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/site-packages/gumbi/regression/__init__.py", line 1, in <module>
    from .GP_pymc3 import GP
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/site-packages/gumbi/regression/GP_pymc3.py", line 6, in <module>
    import pymc3 as pm
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/site-packages/pymc3/__init__.py", line 23, in <module>
    import theano
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/site-packages/theano/__init__.py", line 83, in <module>
    from theano import scalar, tensor
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/site-packages/theano/tensor/__init__.py", line 20, in <module>
    from theano.tensor import nnet  # used for softmax, sigmoid, etc.
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/site-packages/theano/tensor/nnet/__init__.py", line 3, in <module>
    from . import opt
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/site-packages/theano/tensor/nnet/opt.py", line 32, in <module>
    from theano.tensor.nnet.conv import ConvOp, conv2d
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/site-packages/theano/tensor/nnet/conv.py", line 20, in <module>
    from theano.tensor import blas
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/site-packages/theano/tensor/blas.py", line 163, in <module>
    from theano.tensor.blas_headers import blas_header_text, blas_header_version
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/site-packages/theano/tensor/blas_headers.py", line 1016, in <module>
    if not config.blas__ldflags:
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/site-packages/theano/configparser.py", line 358, in __get__
    val_str = self.default()
  File "/home/circleci/.pyenv/versions/3.9.10/lib/python3.9/site-packages/theano/link/c/cmodule.py", line 2621, in default_blas_ldflags
    blas_info = numpy.distutils.__config__.blas_opt_info
AttributeError: module 'numpy.distutils.__config__' has no attribute 'blas_opt_info'

Exited with code exit status 1

CircleCI received exit code 1

Allow *Parrays to be created without explicit Standardizer instance

Sometimes it's simpler to ignore the "standardizer" functionality of *Parrays and treat the variable(s) as zero-mean and unit-variance.

Current pattern:

gmb.uparray('y', μ=y_out.mean(axis=1), σ2=y_out.var(axis=1), stdzr=stdzr)

Desired pattern:

gmb.uparray('y', μ=y_out.mean(axis=1), σ2=y_out.var(axis=1))  # Internally creates a default Standardizer() instance

Implement "Law of total expectation/variance" in UPArrays

Given a list of UPArrays, y_upas, we can find total expectation/variance as:

μs = np.stack([y.μ for y in y_upas])
σ2s = np.stack([y.σ2 for y in y_upas])

total_upa = gmb.uparray('y',
                        μ = μs.mean(0),
                        σ2 = μs.var(0) + σ2s.mean(0),
                        stdzr=stdzr
                       )

Implement as something like

gmb.uparray.total(y_upas)

Where name and stdzr are inferred from, e.g., the first upa in the list.

Allow formulation of a `Latent` GP

Gumbi should be able to provide a (Pymc) Latent GP implementation that enables regression with non-Normal likelihoods. This would allow use cases such as these Pymc examples with StudentT likelihood for regression and Bernoulli likelihood for classification.

The best way will probably be to have a function that returns the gp.prior object, and the user can then tack on their desired likelihood and any additional variables. The code pattern will probably look like this, but speak up, users, if you have opinions!

gp = gmb.GP(...)

gp.build_model(..., Latent=True)

with gp.model:
    f = gp.prior

    # logit link and Bernoulli likelihood
    p = pm.Deterministic("p", pm.math.invlogit(f))
    y_ = pm.Bernoulli("y", p=p, observed=y)
    
gp.sample(...)

Prediction will obviously depend on #13.

Error in Multi Output Regression example syntax

I'm posting this as a FYI for other users who may try to run the Multi Output Regression example syntax shared here:

[https://johngoertz.github.io/Gumbi/notebooks/examples/Cars_Dataset.html#Correlated-multi-input-regression-accross-different-classes-in-a-category](Multi Output Regression)

The posted syntax is

gp.fit(outputs=['mpg', 'acceleration'], continuous_dims=['horsepower']);

X = gp.prepare_grid()
Y = gp.predict_grid()

axs = plt.subplots(2,1, figsize=(6, 8))[1]
for ax, output in zip(axs, gp.outputs):
    y = Y.get(output)

    gmb.ParrayPlotter(X, y).plot(ax=ax)

    sns.scatterplot(data=cars, x='horsepower', y=output, color=sns.cubehelix_palette()[-1], alpha=0.5, ax=ax);

The error is in axs = plt.subplots(2,1, figsize=(6, 8))[1]

The correction is axs = plt.pyplot.subplots(2,1, figsize=(6, 8))[1]

johngoertz / gumbi Goto Github PK

gumbi's People

Contributors

Stargazers

Watchers

Forkers

gumbi's Issues

Recommend Projects

Recommend Topics

Recommend Org