🐛 Bug It is not clear that this is a bug but I am not getting the

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Using a fitted GP's mean and covariance functions as mean and covariance function in a new GP about botorch HOT 8 OPEN

neildhir commented on June 7, 2024

Using a fitted GP's mean and covariance functions as mean and covariance function in a new GP

from botorch.

Comments (8)

esantorella commented on June 7, 2024 1

The intent now is to use fitted mean and covariance functions of gp2 in a new model (gp3), wanting to make use of its reduced uncertain in region where it has data, in gp1, and used gp2's inflated uncertainty in some regions to inflate the same regions in gp1.

Why do you want to do this?

I'm wondering if a multi-fidelity model like SingleTaskMultiFidelityGP would help. Here is a tutorial. In this case, it sounds like the small dataset might be "full fidelity", providing more reliable data but at a higher cost (since it's noiseless and you can't collect many such points) and the larger data set might be lower fidelity, cheap but noisy? Then you could pass all the data to the same model, using a column of the x data to mark which observations are low-fidelity.

from botorch.

saitcakmak commented on June 7, 2024 1

I haven't checked the terminology in the paper closely, so I can't say whether what you implemented is the same as the proposed approach, but it seems to fit the framework.

Then, only the free parameters of the ScaledMatern kernel is adjusted to the 'small' dataset:

That's correct. The no_grad context should prevent gp2 hyper-params from getting updated during training.

One potential issue I noticed while looking at this is the use of input/outcome transforms & the posterior call within the kernel. In a BoTorch model with input / outcome transforms, the underlying GPyTorch components operate in the transformed space. When you're evaluating self.gp.posterior in ConditionalScaledMaternKernel, the Xs you use for evaluation will be already transformed. The self.gp.posterior call will transform these again (since self.gp also has input transforms attached), which will lead to self.gp being evaluated with different inputs than its parent model. Similarly, the Posterior returned by self.gp.posterior will be untransformed using the outcome transforms, and will get untransformed again as part of the parent model's posterior call.

To avoid these transform related issues within ConditionalScaledMaternKernel, you need to skip input & outcome transforms while evaluating self.gp. Replacing

        with no_grad():
            sigma = self.gp.posterior(x1).variance.sqrt() # from gp2
            sigma_ = self.gp.posterior(x2).variance.sqrt() # from gp2

with

        with no_grad():
            self.gp.eval()
            sigma = self.gp(x1).covariance_matrix.diagonal(dim1=-2, dim2=-1).sqrt() # from gp2
            sigma_ = self.gp(x2).covariance_matrix.diagonal(dim1=-2, dim2=-1).sqrt() # from gp2

should work for single output SingleTaskGP.

from botorch.

saitcakmak commented on June 7, 2024 1

Re gradients: It is interesting that you have to turn them off manually, despite using no_grad. I guess the gradients must still be accumulating on those parameters due to subsequent operations, even though the operations within the context do not populate the grad graph.

Re eval: Both eval() and train() calls apply to sub-modules recursively. In fit_gpytorch_mll, we would first put the model in train(), fit the model in this mode, then call eval() on the model before returning it. In GPyTorch, ExactGP.__call__ checks for self.training and calculates the prior / posterior accordingly. Having an eval() call there would make sure that the model would always get evaluated in posterior mode. Without it, I'd expect the model to get evaluated in prior mode when the larger model is being trained but it would not make a difference in evaluations when the larger model itself is in eval / posterior mode.

Which mean calculation to use

I think m = self.gp(x).mean.squeeze(-1) would be the correct computation here since it eliminates any interference from the input & outcome transforms. Both options should be equivalent if you don't use any input / outcome transforms. You could compare the difference between the two by adding some print statements into your code. I'd also recommend testing on a few different examples, which would help ensure that the behavior you're seeing is a general pattern and not due to the particular data in the example.

from botorch.

neildhir commented on June 7, 2024

Certainly, there are different ways to combine datasets, and this is just a method that I have been working on.

In the example above, both datasets have some noise but it is certainly true that the smaller one represents a setting where data is a lot more scarce. Whereas the larger dataset contains more samples but are ultimately just random samples from the same domain.

My next intent was actually to explore multi-task multi-fidelity modelling, but I want to get this working first. The idea above is easy enough to understand (I hope) and I thought it would have been easy to implement but something is not going quite right in the fitting of the model, hence hoping more knowledgeable people could help out.

from botorch.

saitcakmak commented on June 7, 2024

Hi @neildhir. Looks like you're training gp3 with only the data for that model, which will adapt the hyper parameters of ConditionalScaledMaternKernel to minimize the MLL with that data only. The data from gp2 is not involved in this MLL computation in any way, so transferring the information over in this way seems hard to achieve. Skipping the model training for gp3 (and the added scale-matern kernel altogether) might be worth a try, since it will limit the correction applied by the fitting procedure to adopt the model to the training data of gp3.

from botorch.

neildhir commented on June 7, 2024

Hi @saitcakmak

Hmm, let me see now so gp2 is trained on the 'big' dataset which means that its parameters are fitted once its mean and covariance function are passed to gp3, so the data from gp2 is transferred in that sense (perhaps more correct to say the information). I then want gp3 to be trained on the 'small' dataset (the same as gp1, red dots) but doing so whilst taking into account the information gained from the big dataset, through the fitted mean and covariance function from gp2. The MLL computation should only be fitting the parameters of the ScaledMaternKernel quite right, but leaving alone the parameters of the learned covariance function of gp2. The contribution from gp2 comes in the forward pass, through:

        with no_grad():
            sigma = self.gp.posterior(x1).variance.sqrt() # from gp2
            sigma_ = self.gp.posterior(x2).variance.sqrt() # from gp2
            B = sigma @ sigma_.T

Then, only the free parameters of the ScaledMatern kernel is adjusted to the 'small' dataset:

        A = self.scaled_matern_kernel(x1, x2) # fitted to small dataset
        K = A + B # contribution from the big dataset, used for gp2, is leveraged here through the addition of the matrix B

This way the kernel ought to be leveraging the uncertainty from the big dataset in its uncertainty estimation for the small dataset. That at least was my intent.

I am trying to reproduce the method here: https://arxiv.org/pdf/2005.11741.pdf (see fig 3 and equations (2) to (4)). But using BoTorch.

from botorch.

neildhir commented on June 7, 2024

This is great, thanks, a few questions then.

Given these issues, first, is it worth just not using any in and output transforms and just operating on the raw data (or having it scaled outside BoTorch)? It seems that it gets very messy with the transformations in my setting.

So I thought that no_grad() would prevent the hyperparams from getting updated, but that does not seem to be the case. So I did this above:

for param_name, param in mll.named_parameters():
    if 'gp' in param_name:
        # Turn off gradients for the GP trained on (train_X2, train_Y2) data
        param.requires_grad = False
        print(param_name, param.requires_grad)
fit_gpytorch_mll(mll)

to ensure that no parameters related to gp2 are updated during training. I assume that is still valid? This then brings us to the next point, according to the documentation fit_gpytorch_mll(mll) returns a gp in eval() which is why I did not believe I needed to impose those inside the Kernel function. But it looks like you're saying that is indeed the case but rather to circumvent the issues with the transforms.

Do I need to do the same thing in the Mean function too? Apologies, lots of questions.

Finally, using your suggested change (though no change to the mean function if I understand this correctly):

        with no_grad():
            self.gp.eval()
            sigma = self.gp(x1).covariance_matrix.diagonal(dim1=-2, dim2=-1).sqrt() # from gp2
            sigma_ = self.gp(x2).covariance_matrix.diagonal(dim1=-2, dim2=-1).sqrt() # from gp2

I get

where x = torch.linspace(bounds[0],bounds[1],1000).unsqueeze(-1) and gets raised when it tris to plot the final (combined) GP: plot_gp(gp3, x, "red").

from botorch.

neildhir commented on June 7, 2024

Okay I did some more experimentation and managed to get some things to work. I'll try to organise this a bit more.

Working mean and kernel functions

class ConditionalScaledMaternKernel(Kernel):
    has_lengthscale = True
    def __init__(self, gp):
        super().__init__()
        self.gp = gp  # Assumed to be in eval() mode
        self.scaled_matern_kernel = ScaleKernel(MaternKernel())

    def forward(self, x1, x2, **params):
        with no_grad():
            self.gp.eval()
            sigma = self.gp(x1).covariance_matrix.diagonal(dim1=-2, dim2=-1).sqrt().unsqueeze(-1)
            sigma_ = self.gp(x2).covariance_matrix.diagonal(dim1=-2, dim2=-1).sqrt().unsqueeze(-1)
            B = sigma @ sigma_.T
        A = self.scaled_matern_kernel(x1, x2)
        K = A + B
        return K

class ConditionalMean(Mean):
    def __init__(self, gp):
        super().__init__()
        self.gp = gp  # Assumed to be in eval() mode

    def forward(self, x):
        with no_grad():
            # self.gp.eval()
            m = self.gp.posterior(x).mean.squeeze(-1)
            # m = self.gp(x).mean.squeeze(-1)
        return m

Gradients

It seems that we explicitly have to turn off the gradients before running fit_gpytorch_mll(mll). If we don't then we get the following behaviour.

Here we use m = self.gp.posterior(x).mean.squeeze(-1) with gp.eval() commented out and gradients updates are not set to True for the gp2's parameters.

If on the other hand we do turn them off i.e.

for param_name, param in mll.named_parameters():
    if 'gp' in param_name:
        # Turn off gradients for the GP trained on (train_X2, train_Y2) data
        param.requires_grad = False
fit_gpytorch_mll(mll)

Then we get

Where again we use m = self.gp.posterior(x).mean.squeeze(-1) with gp.eval() commented out. In fact gp.eval() seems to have not effect at all in the mean function which is what I would expect since fit_gpytorch_mll(mll) puts the gp into eval() mode anyway at the end of that function.

Which mean calculation to use?

As you can see in the mean function definition above there are two ways to calculate the mean, with m = self.gp(x).mean.squeeze(-1) inspired by @saitcakmak 's use of the same type of construct in the kernel definition. Now if we use this way of calculating the mean we get the following posterior behaviour.

In both of the above two plots we have turned off the gradients before running fit_gpytorch_mll(mll) which is why gp3 (in red) is shown. But now the uncertainty bounds are not fully covering the red datapoints.

Now I wonder, what is the difference is between the two? The intent is to use the mean function calculated in gp2, as a prior for the mean in gp3. But m = self.gp.posterior(x).mean.squeeze(-1) and m = self.gp(x).mean.squeeze(-1) produce very different behaviours.

from botorch.

Using a fitted GP's mean and covariance functions as mean and covariance function in a new GP about botorch HOT 8 OPEN

Comments (8)

Working mean and kernel functions

Gradients

Which mean calculation to use?

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent