donkeyshot21 / cassle Goto Github PK

View Code? Open in Web Editor NEW

111.0 111.0 18.0 806 KB

Official repository for the paper "Self-Supervised Models are Continual Learners" (CVPR 2022)

License: MIT License

Shell 23.26% Python 76.74%

cassle's People

Contributors

Stargazers

Watchers

Forkers

yaoyao-liu bjmeo8 drimpossible issacwg ihaeyong mbosc ardo0115 tdemin16 iamwangyabin orienfish seanby yagamimisa shulin16 ruiyang123 dangtrunganh tinyloop arrow2718 lokshaw-chau

cassle's Issues

Difficulty In Reproducing The Paper Results (e.g., Table 4., BYOL reported 66%, measured <<60%)

Hi,

Thanks a lot for your amazing work and releasing the code. I am trying to reproduce your Table 4 for sometime. I directly use the code and the scripts with NO modification.

For example, in this Table, BYOL fine-tuning on ImageNet-100 for 5-class incremental task performance is 66.0. Instead, I measured below <<60.0, at least 6% below. Please see the full results Table below if interested (a 5 x 5 Table).

results.pdf

Any idea what may be causing the gap? Is there any nuances in evaluation method? For example, for average accuracy, I simply take the mean of the below Table across all rows and colums (as also suggested by GEM, as you referenced).

Thanks a lot again for your response and your eye-opening work.

DomainNet dataset version

Hi,

Very interesting work.
Did you use the cleaned version of DomainNet or the original one?
The cleaned version excludes a lot of duplicate images.

Thanks

Task_id set to zero as first task SSL

cassle/bash_files/continual/cifar/barlow_distill.sh

Line 8 in ba739c8

--task_idx 0 \

Hi,
I want to be sure that setting the first task of distillation process as zero isn't an error, since you have already learned this as SSL in normal process.
What can be the goal of this, please ?

Thanks

Need of checkpoints : BT and VicReg

Hi all,

Is there any link where I can access to the checkpoint of model trained using Barlowtwins and VicREG ?

I would like to evaluate this approach using different models and need the trained last checkpoint of these models.

Thanks.

The data for Linear Evaluation Accuracy.

Hi,

This is an exciting and enlightening work.

I wonder where the data for training the classifier come from, for linear evaluation accuracy.
The training data of the current task?

The classifier for Linear Evaluation Accuracy

Hi,

This is an exciting and enlightening work.

I am confused by the number of classifiers for Linear Evaluation Accuracy.

In the paper, you said, "For class-incremental and data-incremental, we use the task-agnostic setting, meaning that at evaluation time we do not assume to know the task ID". As I understand it, this means that you only maintain one classifier and continuously optimize it after learning each task for linear evaluation accuracy.

However, I found in #1 that you said, "as we operate in the class-incremental setting we train one linear classifier per task."

I would appreciate a clearer explanation.

Thanks.

Why did you load the checkpoint of task 0 before training cassle?

Hello,
Congratulations on your excellent work. I have a question about the training setting.

Why did you load the checkpoint of task 0 before training cassle? I see the first task of cassle is trained without distillers. Then the setting is the same as the first task of finetuning. I think loading the checkpoint is unnecessary.

Looking forward to your reply,
Thanks.

problem with reproducibility

Hi,

thanks for your interesting work.

I have problems reproducing the results.

Did you use dali for all your experiments? Can we trust the results of the regular data loader? I'm getting 6~7% accuracy drops on Imagenet while switching from dali to regular dataloader (I needed to run the regular one for fair comparisons with my method).
Also I think there is a problem here

cassle/cassle/args/utils.py

Line 171 in b5b0929

args.lr = args.lr * args.batch_size * len(args.gpus) / 256

why is 256 hardcoded here?
It would be nice to mention in the readme that the batch size needs to be modified based on the number of gpus.

Thanks

Forward Transfer Issue

Hi! We are following your excellent work.

We would like to know more clearly the details of your experiments on CIRAR-100 to calculate Forward Transfer, such as how the accuracy of the random model on each task is obtained.

If we understand correctly, since the random seed is fixed, then the accuracy of the random model should be fixed as well. Is it possible to provide the accuracy of the random model on five tasks for reference.

Thanks!

KNN Classifier Issue

Hi,
I found this work very interesting and plan to work on similar topics. However I encounter some issues:
(1) For the fine-tuning example with Barlow Twins and CIFAR-100, should it be barlow.sh instead of barlow_distill.sh? Otherwise, we need to provide the pretrained model in order to successfully run the code.
(2) If I enable the the KNN online evaluation by setting disable_knn_eval = False, there was an issue showing empty test feature and expect argument in base.py line 432. I saw the previous closed issue saying the similar thing but it still appears even if I set a meaningful online_eval_batch_size = 256.
Thanks for your help!

train data on DomainNet

I'm wondering if you provide the correct training procedure for DomainNet, as I see from main_pretrain.py, you only use the trainer.fit() on validation data, and it seems not a train but a validation. Moreover, is that DomainNet data is in same with the Dali data?

How did you train the classifier?

Hello,

I have read your paper. It is very impressive. I got a question for class incremental setting and am wondering to know if you can answer.

Did you train the classifier for each task only in the embedding training process? Or did you re-train all classifiers after all task embedding training processes finish? I see that the embedding of the previous task may change after the next task is trained. How does the old classifier trained by the old embedding format take this changed embedding? Your paper mentioned "a subset, e.g., 10% of the data". Did this mean using 10% of data to retrain the classifier at the very end?

Looking forward to your kind reply.

Thanks.

¿Bug in online KNN eval?

Hello,

Congrats for your paper! It touches very interesting questions and I'd love to further study the problem of CSSL!

I am trying to execute your script for training barlow twins python job_launcher.py --script bash_files/continual/cifar/barlow_distill.sh, but I might have encountered a bug: if I train with the WeightedKNNClassifier for performance monitor, your code calls its forward here with only the train_features and target_features provided.
After that, the compute function breaks down here at line 89 because self.test_features is an empty list.

Am I getting something wrong? I am working in a new conda env with setup as specified in your README file.

Muchas gracias!

About the Forward Transfer

Hi,
Thanks for your excellent work!
I'm curious about how to calculate the "Forward Transfer" after training. For example, I have successfully re-produced the class-il results for Fine-tuning and CaSSLe (with BYOL) on Cifar-100 but don't know how to directly check the FT results. Does it need a seperate run to obtain the "linear evaluation accuracy of a random network" as the paper stated?
BTW, just to be sure, is it right to directly check the "val_acc1" results of wandb board as the final linear evaluation accuracy?

Some question about lower and upper bounds

Hi,

I have some questions regarding the calculation of upper and lower bounds, taking class incremental learning as an example:

In supervised learning, the lower bound (Fine-tuning) is performed in a task-specific manner, i.e., Task 1 fine-tuning -> Task 2 fine-tuning ...; whereas the upper bound (offline) involves training a model by integrating all the data together.

Regarding SimCLR, my understanding is that the lower bound (Fine-tuning) corresponds to SSL (Self-Supervised Learning) stage, where it undergoes Task 1 SSL -> Task 2 SSL ..., followed by Linear Evaluation. The upper bound (offline) involves performing SSL on the entire dataset and then conducting Linear Evaluation. I'm not sure if my understanding is correct ?

Some questions about training and evaluation process

Hello,

Thank you for your fantastic project! I have some questions regarding model evaluation.
1）Taking CIFAR10 as an example, if there are 2 tasks, each with 5 classes, is the process shown in the following figure correct?

2）If it is correct, after the self-supervised continual learning part is completed, a 10-class classifier will be trained. When training this 10-class classifier, will all the data from all categories be used simultaneously?

3）Additionally, what is the overall process for Fine-tuning (using Table 2 as an example, Strategy 1 Fine-tuning)? Is it to replace CaSSLe with non-continual learning SSL method?

Thanks！

Question about contrastive distillation loss

Hi,

I have a few questions about the simclr code.

cassle/cassle/losses/simclr.py

Line 21 in b5b0929

logits = torch.einsum("if, jf -> ij", p, z) / temperature

It seems that the predicted features (p) are not in the negatives, which is different from what's suggested in the paper (appendix B). I understand that you switch p and z here (for a symmetric loss?)

cassle/cassle/distillers/contrastive.py

Lines 65 to 68 in b5b0929

    
           distill_loss = ( 
        
               simclr_distill_loss_func(p1, p2, frozen_z1, frozen_z2, self.distill_temperature) 
        
               + simclr_distill_loss_func(frozen_z1, frozen_z2, p1, p2, self.distill_temperature) 
        
           ) / 2

but there is still no comparisons between different samples in p.

In the paper the distillation loss is applied to the two views independently. Based on the code above, does it mean that we should use them jointly to reproduce the result?

cassle/cassle/losses/simclr.py

Lines 30 to 33 in b5b0929

    
           logit_mask = torch.ones_like(pos_mask, device=device) 
        
           logit_mask.fill_diagonal_(True) 
        
           logit_mask[:, b:].fill_diagonal_(True) 
        
           logit_mask[b:, :].fill_diagonal_(True)

The four lines of code here seem to make logit_mask an all-ones matrix. In my understanding we should assign the diagonals to False. Am I missing something?

TIA

	distill_loss = (
	simclr_distill_loss_func(p1, p2, frozen_z1, frozen_z2, self.distill_temperature)
	+ simclr_distill_loss_func(frozen_z1, frozen_z2, p1, p2, self.distill_temperature)
	) / 2

	logit_mask = torch.ones_like(pos_mask, device=device)
	logit_mask.fill_diagonal_(True)
	logit_mask[:, b:].fill_diagonal_(True)
	logit_mask[b:, :].fill_diagonal_(True)

donkeyshot21 / cassle Goto Github PK

cassle's People

Contributors

Stargazers

Watchers

Forkers

cassle's Issues

Recommend Projects

Recommend Topics

Recommend Org