alpha-vl / convmae Goto Github PK

View Code? Open in Web Editor NEW

457.0 11.0 38.0 8.74 MB

ConvMAE: Masked Convolution Meets Masked Autoencoders

License: MIT License

Python 99.80% Shell 0.20%

backbone computer-vision masked-image-modeling object-detection semantic-segmentation mae

convmae's Introduction

[NeurIPS 2022] MCMAE: Masked Convolution Meets Masked Autoencoders

Peng Gao¹, Teli Ma¹, Hongsheng Li², Ziyi Lin², Jifeng Dai³, Yu Qiao¹,

¹ Shanghai AI Laboratory, ² MMLab, CUHK, ³ Sensetime Research.

* We change the project name from ConvMAE to MCMAE.

This repo is the official implementation of MCMAE: Masked Convolution Meets Masked Autoencoders. It currently concludes codes and models for the following tasks:

ImageNet Pretrain: See PRETRAIN.md.
ImageNet Finetune: See FINETUNE.md.
Object Detection: See DETECTION.md.
Semantic Segmentation: See SEGMENTATION.md.
Video Classification: See VideoConvMAE.

Updates

14/Mar/2023

MR-MCMAE (a.k.a. ConvMAE-v2) paper released: Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking.

15/Sep/2022

Paper accepted at NeurIPS 2022.

9/Sep/2022

ConvMAE-v2 pretrained checkpoints are released.

21/Aug/2022

Official-ConvMAE-Det which follows official ViTDet codebase is released.

08/Jun/2022

🚀FastConvMAE🚀: significantly accelerates the pretraining hours (4000 single GPU hours => 200 single GPU hours). The code is going to be released at FastConvMAE.

27/May/2022

The supported codes for ImageNet-1K pretraining.
The supported codes and models for semantic segmentation are provided.

20/May/2022

Update results on video classification.

16/May/2022

The supported codes and models for COCO object detection and instance segmentation are available.

11/May/2022

Pretrained models on ImageNet-1K for ConvMAE.
The supported codes and models for ImageNet-1K finetuning and linear probing are provided.

08/May/2022

The preprint version is public at arxiv.

Introduction

ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme.

We present the strong and efficient self-supervised framework ConvMAE, which is easy to implement but show outstanding performances on downstream tasks.
ConvMAE naturally generates hierarchical representations and exhibit promising performances on object detection and segmentation.
ConvMAE-Base improves the ImageNet finetuning accuracy by 1.4% compared with MAE-Base. On object detection with Mask-RCNN, ConvMAE-Base achieves 53.2 box AP and 47.1 mask AP with a 25-epoch training schedule while MAE-Base attains 50.3 box AP and 44.9 mask AP with 100 training epochs. On ADE20K with UperNet, ConvMAE-Base surpasses MAE-Base by 3.6 mIoU (48.1 vs. 51.7).

Pretrain on ImageNet-1K

The following table provides pretrained checkpoints and logs used in the paper.

	ConvMAE-Base
pretrained checkpoints	download
logs	download

The following results are for ConvMAE-v2 (pretrained for 200 epochs on ImageNet-1k).

model	pretrained checkpoints	ft. acc. on ImageNet-1k
ConvMAE-v2-Small	download	83.6
ConvMAE-v2-Base	download	85.7
ConvMAE-v2-Large	download	86.8
ConvMAE-v2-Huge	download	88.0

Main Results on ImageNet-1K

Models	#Params(M)	Supervision	Encoder Ratio	Pretrain Epochs	FT acc@1(%)	LIN acc@1(%)	FT logs/weights	LIN logs/weights
BEiT	88	DALLE	100%	300	83.0	37.6	-	-
MAE	88	RGB	25%	1600	83.6	67.8	-	-
SimMIM	88	RGB	100%	800	84.0	56.7	-	-
MaskFeat	88	HOG	100%	300	83.6	N/A	-	-
data2vec	88	RGB	100%	800	84.2	N/A	-	-
ConvMAE-B	88	RGB	25%	1600	85.0	70.9	log/weight

Main Results on COCO

Mask R-CNN

Models	Pretrain	Pretrain Epochs	Finetune Epochs	#Params(M)	FLOPs(T)	box AP	mask AP	logs/weights
Swin-B	IN21K w/ labels	90	36	109	0.7	51.4	45.4	-
Swin-L	IN21K w/ labels	90	36	218	1.1	52.4	46.2	-
MViTv2-B	IN21K w/ labels	90	36	73	0.6	53.1	47.4	-
MViTv2-L	IN21K w/ labels	90	36	239	1.3	53.6	47.5	-
Benchmarking-ViT-B	IN1K w/o labels	1600	100	118	0.9	50.4	44.9	-
Benchmarking-ViT-L	IN1K w/o labels	1600	100	340	1.9	53.3	47.2	-
ViTDet	IN1K w/o labels	1600	100	111	0.8	51.2	45.5	-
MIMDet-ViT-B	IN1K w/o labels	1600	36	127	1.1	51.5	46.0	-
MIMDet-ViT-L	IN1K w/o labels	1600	36	345	2.6	53.3	47.5	-
ConvMAE-B	IN1K w/o lables	1600	25	104	0.9	53.2	47.1	log/weight

Main Results on ADE20K

UperNet

Models	Pretrain	Pretrain Epochs	Finetune Iters	#Params(M)	FLOPs(T)	mIoU	logs/weights
DeiT-B	IN1K w/ labels	300	16K	163	0.6	45.6	-
Swin-B	IN1K w/ labels	300	16K	121	0.3	48.1	-
MoCo V3	IN1K	300	16K	163	0.6	47.3	-
DINO	IN1K	400	16K	163	0.6	47.2	-
BEiT	IN1K+DALLE	1600	16K	163	0.6	47.1	-
PeCo	IN1K	300	16K	163	0.6	46.7	-
CAE	IN1K+DALLE	800	16K	163	0.6	48.8	-
MAE	IN1K	1600	16K	163	0.6	48.1	-
ConvMAE-B	IN1K	1600	16K	153	0.6	51.7	log/weight

Main Results on Kinetics-400

Models	Pretrain Epochs	Finetune Epochs	#Params(M)	Top1	Top5	logs/weights
VideoMAE-B	200	100	87	77.8
VideoMAE-B	800	100	87	79.4
VideoMAE-B	1600	100	87	79.8
VideoMAE-B	1600	100 (w/ Repeated Aug)	87	80.7	94.7
SpatioTemporalLearner-B	800	150 (w/ Repeated Aug)	87	81.3	94.9
VideoConvMAE-B	200	100	86	80.1	94.3	Soon
VideoConvMAE-B	800	100	86	81.7	95.1	Soon
VideoConvMAE-B-MSD	800	100	86	82.7	95.5	Soon

Main Results on Something-Something V2

Models	Pretrain Epochs	Finetune Epochs	#Params(M)	Top1	Top5	logs/weights
VideoMAE-B	200	40	87	66.1
VideoMAE-B	800	40	87	69.3
VideoMAE-B	2400	40	87	70.3
VideoConvMAE-B	200	40	86	67.7	91.2	Soon
VideoConvMAE-B	800	40	86	69.9	92.4	Soon
VideoConvMAE-B-MSD	800	40	86	70.7	93.0	Soon

Getting Started

Prerequisites

Linux
Python 3.7+
CUDA 10.2+
GCC 5+

Training and evaluation

See PRETRAIN.md for pretraining.
See FINETUNE.md for pretrained model finetuning and linear probing.
See DETECTION.md for using pretrained backbone on Mask RCNN.
See SEGMENTATION.md for using pretrained backbone on UperNet.
See VideoConvMAE for video classification.

Visualization

Acknowledgement

The pretraining and finetuning of our project are based on DeiT and MAE. The object detection and semantic segmentation parts are based on MIMDet and MMSegmentation respectively. Thanks for their wonderful work.

License

ConvMAE is released under the MIT License.

Citation

@article{gao2022convmae,
  title={ConvMAE: Masked Convolution Meets Masked Autoencoders},
  author={Gao, Peng and Ma, Teli and Li, Hongsheng and Dai, Jifeng and Qiao, Yu},
  journal={arXiv preprint arXiv:2205.03892},
  year={2022}
}

convmae's People

Contributors

Stargazers

Watchers

convmae's Issues

Doubts about masking strategy

Hi! Thanks for the opensource code. I have the doubts about masking strategy.
In the paper: Uniformly masking stage-1 input tokens from the H/4 × W/4 featuremaps would cause all tokens of stage-3 to have partially visible information and requires keeping all stage-3 tokens. Why the visible information will pass to the stage-3, if the images was masked in the first.
Thanks very much!

about pretrain model convmae.pth

I downloaded your pretrained model. And when i tried to load it.
It gave me the following errors.

_IncompatibleKeys(missing_keys=['mask_token', 'decoder_pos_embed', 'stage1_output_decode.weight', 'stage1_output_decode.bias', 'stage2_output_decode.weight', 'stage2_output_decode.bias', 'decoder_embed.weight', 'decoder_embed.bias', 'decoder_blocks.0.norm1.weight', 'decoder_blocks.0.norm1.bias', 'decoder_blocks.0.attn.qkv.weight', 'decoder_blocks.0.attn.qkv.bias', 'decoder_blocks.0.attn.proj.weight', 'decoder_blocks.0.attn.proj.bias', 'decoder_blocks.0.norm2.weight', 'decoder_blocks.0.norm2.bias', 'decoder_blocks.0.mlp.fc1.weight', 'decoder_blocks.0.mlp.fc1.bias', 'decoder_blocks.0.mlp.fc2.weight', 'decoder_blocks.0.mlp.fc2.bias', 'decoder_blocks.1.norm1.weight', 'decoder_blocks.1.norm1.bias', 'decoder_blocks.1.attn.qkv.weight', 'decoder_blocks.1.attn.qkv.bias', 'decoder_blocks.1.attn.proj.weight', 'decoder_blocks.1.attn.proj.bias', 'decoder_blocks.1.norm2.weight', 'decoder_blocks.1.norm2.bias', 'decoder_blocks.1.mlp.fc1.weight', 'decoder_blocks.1.mlp.fc1.bias', 'decoder_blocks.1.mlp.fc2.weight', 'decoder_blocks.1.mlp.fc2.bias', 'decoder_blocks.2.norm1.weight', 'decoder_blocks.2.norm1.bias', 'decoder_blocks.2.attn.qkv.weight', 'decoder_blocks.2.attn.qkv.bias', 'decoder_blocks.2.attn.proj.weight', 'decoder_blocks.2.attn.proj.bias', 'decoder_blocks.2.norm2.weight', 'decoder_blocks.2.norm2.bias', 'decoder_blocks.2.mlp.fc1.weight', 'decoder_blocks.2.mlp.fc1.bias', 'decoder_blocks.2.mlp.fc2.weight', 'decoder_blocks.2.mlp.fc2.bias', 'decoder_blocks.3.norm1.weight', 'decoder_blocks.3.norm1.bias', 'decoder_blocks.3.attn.qkv.weight', 'decoder_blocks.3.attn.qkv.bias', 'decoder_blocks.3.attn.proj.weight', 'decoder_blocks.3.attn.proj.bias', 'decoder_blocks.3.norm2.weight', 'decoder_blocks.3.norm2.bias', 'decoder_blocks.3.mlp.fc1.weight', 'decoder_blocks.3.mlp.fc1.bias', 'decoder_blocks.3.mlp.fc2.weight', 'decoder_blocks.3.mlp.fc2.bias', 'decoder_blocks.4.norm1.weight', 'decoder_blocks.4.norm1.bias', 'decoder_blocks.4.attn.qkv.weight', 'decoder_blocks.4.attn.qkv.bias', 'decoder_blocks.4.attn.proj.weight', 'decoder_blocks.4.attn.proj.bias', 'decoder_blocks.4.norm2.weight', 'decoder_blocks.4.norm2.bias', 'decoder_blocks.4.mlp.fc1.weight', 'decoder_blocks.4.mlp.fc1.bias', 'decoder_blocks.4.mlp.fc2.weight', 'decoder_blocks.4.mlp.fc2.bias', 'decoder_blocks.5.norm1.weight', 'decoder_blocks.5.norm1.bias', 'decoder_blocks.5.attn.qkv.weight', 'decoder_blocks.5.attn.qkv.bias', 'decoder_blocks.5.attn.proj.weight', 'decoder_blocks.5.attn.proj.bias', 'decoder_blocks.5.norm2.weight', 'decoder_blocks.5.norm2.bias', 'decoder_blocks.5.mlp.fc1.weight', 'decoder_blocks.5.mlp.fc1.bias', 'decoder_blocks.5.mlp.fc2.weight', 'decoder_blocks.5.mlp.fc2.bias', 'decoder_blocks.6.norm1.weight', 'decoder_blocks.6.norm1.bias', 'decoder_blocks.6.attn.qkv.weight', 'decoder_blocks.6.attn.qkv.bias', 'decoder_blocks.6.attn.proj.weight', 'decoder_blocks.6.attn.proj.bias', 'decoder_blocks.6.norm2.weight', 'decoder_blocks.6.norm2.bias', 'decoder_blocks.6.mlp.fc1.weight', 'decoder_blocks.6.mlp.fc1.bias', 'decoder_blocks.6.mlp.fc2.weight', 'decoder_blocks.6.mlp.fc2.bias', 'decoder_blocks.7.norm1.weight', 'decoder_blocks.7.norm1.bias', 'decoder_blocks.7.attn.qkv.weight', 'decoder_blocks.7.attn.qkv.bias', 'decoder_blocks.7.attn.proj.weight', 'decoder_blocks.7.attn.proj.bias', 'decoder_blocks.7.norm2.weight', 'decoder_blocks.7.norm2.bias', 'decoder_blocks.7.mlp.fc1.weight', 'decoder_blocks.7.mlp.fc1.bias', 'decoder_blocks.7.mlp.fc2.weight', 'decoder_blocks.7.mlp.fc2.bias', 'decoder_norm.weight', 'decoder_norm.bias', 'decoder_pred.weight', 'decoder_pred.bias'], unexpected_keys=[])

full checkpoint

Hi, @gaopengpjlab, could you kindly provide the full checkpoint (including the decoder) of ConvMAE-v2-S? Lots of thanks!

Pretraining implementation

I have implemented pretraining codes based on MAE repo but I wonder one thing: in the decoder phase, (1) do you sum all features of 3 stages and then normalize it or (2) you normalize the feature of last stage and then sum it with 2 previous ones? Because I got nan loss after 270 epochs with (1) approach. Btw, Have you ever met Nan loss during training?

why still can't find the paper or details of ConvMAE-v2?

ConvMAE-v2 is much better than v1, what is the difference？please

Questions about convmae-v2

convmae-v2 is great work, I'm very interested in some of the details of the paper, Where can I find the code for convmae-v2

hi，i need help

Hello, I would like to ask you how to display the accuracy of each class output by finetune, and how to use the model of downstream tasks for visualization of detection

how many gpus and which type of gpu are needed for pretraining, down-stream task finetuning?

specifically, gpu type, gpu number and gpu training time for pretraining, detection training and segmentation training

How to pretrain CNN backbones like ResNet and EfficientNet?

Question about ConvMAE-v2

Thank you for your excellent work!

When I load ConvMAE-v2-Base pretrained checkpoints [https://drive.google.com/file/d/1gykVKNDlRn8eiuXk5bUj1PbSnHXFzLnI/view?usp=sharing], it has cls_token parameter, which not in models_convmae.py.

Does ConvMAE-v2 model different from models_convmae.py in some details, thanks!

Can you give us guide how to run your downloaded convmae.pth model to finetune?

How long will the the pretraining stage takes in V100?

Hi,

Thank you for your excellent work!
We would like to know how long would the pretraining of ImNet-1k take when running on the machine with 8 V100.
Also, will you release your manuscript about your work on Faster ConvMAE soon? We can't wait to know more details about the Faster ConvMAE.

refactor hard coded numbers for more control over parameters (MaskedAutoencoderConvViT)

Hi - I'd like to do patches of size 32x32, and a smaller model in general.
any thing I change breaks the entire code. It would be really helpful if you refactored out all of the places that specify 4,2,16...etc throughout the code for MaskedAutoencoderConvViT

Thanks,
Dan

When will you update the MR-MCMAE model?

I want to try the MR-MCMAE model, but quite confused about how to build a pretrain structure @Alpha-VL @TeleeMa

ImageNet Evaluation

Thanks for sharing the great work.
I encountered difficulties in reproducing the evaluation results on FINETUNE.md. My evaluation results are:
* Acc@1 1.090 Acc@5 2.188 loss 8.955
Accuracy of the network on the 50000 test images: 1.1%
That's obviously too big a gap.

I download the ImageNet-1K following your guidance and prepared the ImageNet-1K following Jasonlee1995. Are there any details I haven't noticed, or any specific requirements for preparing the dataset？

Will the code of MAE pretrain be updated recently?

How can i train 200 epoches for DET ?

Hi ,
I want to train the pretrained model in detectron2 framework for object detection.
But the code only train 1 epoch and then ended.
Is this a bug ?

about the training loss

Hello, dear, master! I observe that the training loss decrease from 0.42 to 0.39 spending too much epoch. So, I have a question that it really have a big different for the test result when the training loss decrease from 0.42 to 0.39?

output of FastconvMAE

I used your fastconvmae train imgnet data.

In your code, you said the output should be:

However, when i used the pretrained model to predict, it gave me prediction size= torch.Size([4, 196, 768]).
I also tested MAE mode, it can give prediction in torch.Size([1, 196, 768]).

Can you explain it why?

Time required to train one epoch.

Dear author:
Thank you for sharing the excellent work! May I ask how the time overhead of ConvMAE pre-training compares to MAE? Can you provide the time required to train an epoch for these two methods on the same type of GPU?

Question about VideoConvMAE

Thank you for your impressive work! VideoConvMAE seems still lack a code release, can you update it?

When to update the code for object detection？

Running pretrained convvit on larger image sizes

Hi,
I am looking to see how well the pretrained base model runs on my own dataset, but the current model is configured for an image size of 224
In the original MAE code, the 'interpolate_pos_embed' function would allow the user to increase the positional embedding to allow for larger image patches
In your linear probing code, that same script is commented out, and (obviously) doesn't function the same way, as there are multiple positional embeddings to take care of
Do you have a function that can allow the pretrained model to run on different image sizes?
Thanks

have you try pure convolution network?

have you try pure convolution network? does this work?

When to release other model weights rather than base model?

image unpatchify related problems

There is a little problem in the open source code, self.patch_embed is not defined in the unpatchify function of the model. The original dimension of the image cannot be restored. I hope it can be modified slightly for our convenience. Thank you for your answer.

The results of LIN pretrained with ConvMAE-Based for 200 epochs.

Thank you for your excellent work. I noticed that in Table 6, the results of LIN pretrained with ConvMAE-Based for 200 epochs are missed. May I ask what they are? Or is it convenient for you to provide the pretrained ConvMAE-Based for 200 epochs?

Hello, how to finetune own datasets

What should I do if I want to fine tune the current pre training model to my own dataset instead of Imagenet's Val dataset? Can you answer it? Thank you very much

ConvMAE flops for classification ?

Visualization VIT feature

Hi, author.

To visualize your results attention map, how can you visualize this?

Use Encoder (ViT)?
Use Decoder (VIT)?

given input x -> y = encoder(x) -> decoder(y). then use final vit of decoder(y)?

Train on

Could you provide a tutorial on how to train and finetune with custom dataset? And how to modify the input image size during the detection, the current code seems not to support custom image size.

Hi

Hi,
Great work ! Congratulation！
How to draw the pictures in the "Visualization" section of README.md?

Total memory consumption for training with 32 batch size.

I have tried training the convmae detector (as provided in this repository) with 2 GPUs with each 32GB (V-100). It looks like I can carry out training with only batch size = 2. Going beyond batch-size 2 raises CUDA out of memory. Also with such small batch size training does not seem to produce any well-trained model. Could you tell me the recommended memory size for training the model with batch size = 32?

Thank you so much.

mask convolution

Hi! Thanks for the opensource code.
I noticed that the mask convolution in the code only masks the residual block, but the skip connection does not have a mask, as shown in line 119 of "ConvMAE/vision_transformer.py". The corresponding code is as follows:
"x = x + self.drop_path(self.conv2(self.attn(mask * self.conv1(self.norm1(x.permute(0, 2, 3, 1)).permute(0, 3, 1, 2))))) "
Will this lead to information leakage in convolution stage?

Model Settings and checkpoint not match

Thanks for your great work!

But I have a problem about the model setting with your provided checkpoint.
I load your checkpoints, but the model setting that can be loaded correctly does not match what is written in the paper.

the mlp_ratio of Large and Huge
the patch_size of huge

I want to find out what's going on, thanks a lot!