Code Monkey home page Code Monkey logo

Comments (8)

esantorella avatar esantorella commented on June 7, 2024 1

The intent now is to use fitted mean and covariance functions of gp2 in a new model (gp3), wanting to make use of its reduced uncertain in region where it has data, in gp1, and used gp2's inflated uncertainty in some regions to inflate the same regions in gp1.

Why do you want to do this?

I'm wondering if a multi-fidelity model like SingleTaskMultiFidelityGP would help. Here is a tutorial. In this case, it sounds like the small dataset might be "full fidelity", providing more reliable data but at a higher cost (since it's noiseless and you can't collect many such points) and the larger data set might be lower fidelity, cheap but noisy? Then you could pass all the data to the same model, using a column of the x data to mark which observations are low-fidelity.

from botorch.

saitcakmak avatar saitcakmak commented on June 7, 2024 1

I haven't checked the terminology in the paper closely, so I can't say whether what you implemented is the same as the proposed approach, but it seems to fit the framework.

Then, only the free parameters of the ScaledMatern kernel is adjusted to the 'small' dataset:

That's correct. The no_grad context should prevent gp2 hyper-params from getting updated during training.

One potential issue I noticed while looking at this is the use of input/outcome transforms & the posterior call within the kernel. In a BoTorch model with input / outcome transforms, the underlying GPyTorch components operate in the transformed space. When you're evaluating self.gp.posterior in ConditionalScaledMaternKernel, the Xs you use for evaluation will be already transformed. The self.gp.posterior call will transform these again (since self.gp also has input transforms attached), which will lead to self.gp being evaluated with different inputs than its parent model. Similarly, the Posterior returned by self.gp.posterior will be untransformed using the outcome transforms, and will get untransformed again as part of the parent model's posterior call.

To avoid these transform related issues within ConditionalScaledMaternKernel, you need to skip input & outcome transforms while evaluating self.gp. Replacing

        with no_grad():
            sigma = self.gp.posterior(x1).variance.sqrt() # from gp2
            sigma_ = self.gp.posterior(x2).variance.sqrt() # from gp2

with

        with no_grad():
            self.gp.eval()
            sigma = self.gp(x1).covariance_matrix.diagonal(dim1=-2, dim2=-1).sqrt() # from gp2
            sigma_ = self.gp(x2).covariance_matrix.diagonal(dim1=-2, dim2=-1).sqrt() # from gp2

should work for single output SingleTaskGP.

from botorch.

saitcakmak avatar saitcakmak commented on June 7, 2024 1

Re gradients: It is interesting that you have to turn them off manually, despite using no_grad. I guess the gradients must still be accumulating on those parameters due to subsequent operations, even though the operations within the context do not populate the grad graph.

Re eval: Both eval() and train() calls apply to sub-modules recursively. In fit_gpytorch_mll, we would first put the model in train(), fit the model in this mode, then call eval() on the model before returning it. In GPyTorch, ExactGP.__call__ checks for self.training and calculates the prior / posterior accordingly. Having an eval() call there would make sure that the model would always get evaluated in posterior mode. Without it, I'd expect the model to get evaluated in prior mode when the larger model is being trained but it would not make a difference in evaluations when the larger model itself is in eval / posterior mode.

Which mean calculation to use

I think m = self.gp(x).mean.squeeze(-1) would be the correct computation here since it eliminates any interference from the input & outcome transforms. Both options should be equivalent if you don't use any input / outcome transforms. You could compare the difference between the two by adding some print statements into your code. I'd also recommend testing on a few different examples, which would help ensure that the behavior you're seeing is a general pattern and not due to the particular data in the example.

from botorch.

neildhir avatar neildhir commented on June 7, 2024

Certainly, there are different ways to combine datasets, and this is just a method that I have been working on.

In the example above, both datasets have some noise but it is certainly true that the smaller one represents a setting where data is a lot more scarce. Whereas the larger dataset contains more samples but are ultimately just random samples from the same domain.

My next intent was actually to explore multi-task multi-fidelity modelling, but I want to get this working first. The idea above is easy enough to understand (I hope) and I thought it would have been easy to implement but something is not going quite right in the fitting of the model, hence hoping more knowledgeable people could help out.

from botorch.

saitcakmak avatar saitcakmak commented on June 7, 2024

Hi @neildhir. Looks like you're training gp3 with only the data for that model, which will adapt the hyper parameters of ConditionalScaledMaternKernel to minimize the MLL with that data only. The data from gp2 is not involved in this MLL computation in any way, so transferring the information over in this way seems hard to achieve. Skipping the model training for gp3 (and the added scale-matern kernel altogether) might be worth a try, since it will limit the correction applied by the fitting procedure to adopt the model to the training data of gp3.

from botorch.

neildhir avatar neildhir commented on June 7, 2024

Hi @saitcakmak

Hmm, let me see now so gp2 is trained on the 'big' dataset which means that its parameters are fitted once its mean and covariance function are passed to gp3, so the data from gp2 is transferred in that sense (perhaps more correct to say the information). I then want gp3 to be trained on the 'small' dataset (the same as gp1, red dots) but doing so whilst taking into account the information gained from the big dataset, through the fitted mean and covariance function from gp2. The MLL computation should only be fitting the parameters of the ScaledMaternKernel quite right, but leaving alone the parameters of the learned covariance function of gp2. The contribution from gp2 comes in the forward pass, through:

        with no_grad():
            sigma = self.gp.posterior(x1).variance.sqrt() # from gp2
            sigma_ = self.gp.posterior(x2).variance.sqrt() # from gp2
            B = sigma @ sigma_.T

Then, only the free parameters of the ScaledMatern kernel is adjusted to the 'small' dataset:

        A = self.scaled_matern_kernel(x1, x2) # fitted to small dataset
        K = A + B # contribution from the big dataset, used for gp2, is leveraged here through the addition of the matrix B

This way the kernel ought to be leveraging the uncertainty from the big dataset in its uncertainty estimation for the small dataset. That at least was my intent.

I am trying to reproduce the method here: https://arxiv.org/pdf/2005.11741.pdf (see fig 3 and equations (2) to (4)). But using BoTorch.

from botorch.

neildhir avatar neildhir commented on June 7, 2024

This is great, thanks, a few questions then.

Given these issues, first, is it worth just not using any in and output transforms and just operating on the raw data (or having it scaled outside BoTorch)? It seems that it gets very messy with the transformations in my setting.

So I thought that no_grad() would prevent the hyperparams from getting updated, but that does not seem to be the case. So I did this above:

for param_name, param in mll.named_parameters():
    if 'gp' in param_name:
        # Turn off gradients for the GP trained on (train_X2, train_Y2) data
        param.requires_grad = False
        print(param_name, param.requires_grad)
fit_gpytorch_mll(mll)

to ensure that no parameters related to gp2 are updated during training. I assume that is still valid? This then brings us to the next point, according to the documentation fit_gpytorch_mll(mll) returns a gp in eval() which is why I did not believe I needed to impose those inside the Kernel function. But it looks like you're saying that is indeed the case but rather to circumvent the issues with the transforms.

Do I need to do the same thing in the Mean function too? Apologies, lots of questions.

Finally, using your suggested change (though no change to the mean function if I understand this correctly):

        with no_grad():
            self.gp.eval()
            sigma = self.gp(x1).covariance_matrix.diagonal(dim1=-2, dim2=-1).sqrt() # from gp2
            sigma_ = self.gp(x2).covariance_matrix.diagonal(dim1=-2, dim2=-1).sqrt() # from gp2

I get
Screenshot 2024-04-17 at 15 31 57

where x = torch.linspace(bounds[0],bounds[1],1000).unsqueeze(-1) and gets raised when it tris to plot the final (combined) GP: plot_gp(gp3, x, "red").

from botorch.

neildhir avatar neildhir commented on June 7, 2024

Okay I did some more experimentation and managed to get some things to work. I'll try to organise this a bit more.

Working mean and kernel functions

class ConditionalScaledMaternKernel(Kernel):
    has_lengthscale = True
    def __init__(self, gp):
        super().__init__()
        self.gp = gp  # Assumed to be in eval() mode
        self.scaled_matern_kernel = ScaleKernel(MaternKernel())

    def forward(self, x1, x2, **params):
        with no_grad():
            self.gp.eval()
            sigma = self.gp(x1).covariance_matrix.diagonal(dim1=-2, dim2=-1).sqrt().unsqueeze(-1)
            sigma_ = self.gp(x2).covariance_matrix.diagonal(dim1=-2, dim2=-1).sqrt().unsqueeze(-1)
            B = sigma @ sigma_.T
        A = self.scaled_matern_kernel(x1, x2)
        K = A + B
        return K

class ConditionalMean(Mean):
    def __init__(self, gp):
        super().__init__()
        self.gp = gp  # Assumed to be in eval() mode

    def forward(self, x):
        with no_grad():
            # self.gp.eval()
            m = self.gp.posterior(x).mean.squeeze(-1)
            # m = self.gp(x).mean.squeeze(-1)
        return m

Gradients

It seems that we explicitly have to turn off the gradients before running fit_gpytorch_mll(mll). If we don't then we get the following behaviour.

Screenshot 2024-04-18 at 14 49 40

Here we use m = self.gp.posterior(x).mean.squeeze(-1) with gp.eval() commented out and gradients updates are not set to True for the gp2's parameters.

If on the other hand we do turn them off i.e.

for param_name, param in mll.named_parameters():
    if 'gp' in param_name:
        # Turn off gradients for the GP trained on (train_X2, train_Y2) data
        param.requires_grad = False
fit_gpytorch_mll(mll)

Then we get

Screenshot 2024-04-18 at 14 51 51

Where again we use m = self.gp.posterior(x).mean.squeeze(-1) with gp.eval() commented out. In fact gp.eval() seems to have not effect at all in the mean function which is what I would expect since fit_gpytorch_mll(mll) puts the gp into eval() mode anyway at the end of that function.

Which mean calculation to use?

As you can see in the mean function definition above there are two ways to calculate the mean, with m = self.gp(x).mean.squeeze(-1) inspired by @saitcakmak 's use of the same type of construct in the kernel definition. Now if we use this way of calculating the mean we get the following posterior behaviour.

Screenshot 2024-04-18 at 14 55 14

In both of the above two plots we have turned off the gradients before running fit_gpytorch_mll(mll) which is why gp3 (in red) is shown. But now the uncertainty bounds are not fully covering the red datapoints.

Now I wonder, what is the difference is between the two? The intent is to use the mean function calculated in gp2, as a prior for the mean in gp3. But m = self.gp.posterior(x).mean.squeeze(-1) and m = self.gp(x).mean.squeeze(-1) produce very different behaviours.

from botorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.