Peng Gao1, Teli Ma1, Hongsheng Li2, Jifeng Dai3, Yu Qiao1,
1 Shanghai AI Laboratory, 2 MMLab, CUHK, 3 Sensetime Research.
This repo is the official implementation of ConvMAE: Masked Convolution Meets Masked Autoencoders. It currently concludes codes and models for the following tasks:
ImageNet Pretrain: See PRETRAIN.md.
ImageNet Finetune: See FINETUNE.md.
Object Detection: See DETECTION.md.
Semantic Segmentation: See SEGMENTATION.md.
Video Classification: See VideoConvMAE.
20/May/2022
Update results on video classification.
16/May/2022
The supported codes and models for COCO object detection and instance segmentation are available.
11/May/2022
- Pretrained models on ImageNet-1K for ConvMAE.
- The supported codes and models for ImageNet-1K finetuning and linear probing are provided.
08/May/2022
The preprint version is public at arxiv.
ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme.
- We present the strong and efficient self-supervised framework ConvMAE, which is easy to implement but show outstanding performances on downstream tasks.
- ConvMAE naturally generates hierarchical representations and exhibit promising performances on object detection and segmentation.
- ConvMAE-Base improves the ImageNet finetuning accuracy by 1.4% compared with MAE-Base. On object detection with Mask-RCNN, ConvMAE-Base achieves 53.2 box AP and 47.1 mask AP with a 25-epoch training schedule while MAE-Base attains 50.3 box AP and 44.9 mask AP with 100 training epochs. On ADE20K with UperNet, ConvMAE-Base surpasses MAE-Base by 3.6 mIoU (48.1 vs. 51.7).
The following table provides pretrained checkpoints and logs used in the paper.
ConvMAE-Base | |
---|---|
pretrained checkpoints | download |
logs | download |
Models | #Params(M) | Supervision | Encoder Ratio | Pretrain Epochs | FT acc@1(%) | LIN acc@1(%) | FT logs/weights | LIN logs/weights |
---|---|---|---|---|---|---|---|---|
BEiT | 88 | DALLE | 100% | 300 | 83.0 | 37.6 | - | - |
MAE | 88 | RGB | 25% | 1600 | 83.6 | 67.8 | - | - |
SimMIM | 88 | RGB | 100% | 800 | 84.0 | 56.7 | - | - |
MaskFeat | 88 | HOG | 100% | 300 | 83.6 | N/A | - | - |
data2vec | 88 | RGB | 100% | 800 | 84.2 | N/A | - | - |
ConvMAE-B | 88 | RGB | 25% | 1600 | 85.0 | 70.9 | log/weight |
Models | Pretrain | Pretrain Epochs | Finetune Epochs | #Params(M) | FLOPs(T) | box AP | mask AP | logs/weights |
---|---|---|---|---|---|---|---|---|
Swin-B | IN21K w/ labels | 90 | 36 | 109 | 0.7 | 51.4 | 45.4 | - |
Swin-L | IN21K w/ labels | 90 | 36 | 218 | 1.1 | 52.4 | 46.2 | - |
MViTv2-B | IN21K w/ labels | 90 | 36 | 73 | 0.6 | 53.1 | 47.4 | - |
MViTv2-L | IN21K w/ labels | 90 | 36 | 239 | 1.3 | 53.6 | 47.5 | - |
Benchmarking-ViT-B | IN1K w/o labels | 1600 | 100 | 118 | 0.9 | 50.4 | 44.9 | - |
Benchmarking-ViT-L | IN1K w/o labels | 1600 | 100 | 340 | 1.9 | 53.3 | 47.2 | - |
ViTDet | IN1K w/o labels | 1600 | 100 | 111 | 0.8 | 51.2 | 45.5 | - |
MIMDet-ViT-B | IN1K w/o labels | 1600 | 36 | 127 | 1.1 | 51.5 | 46.0 | - |
MIMDet-ViT-L | IN1K w/o labels | 1600 | 36 | 345 | 2.6 | 53.3 | 47.5 | - |
ConvMAE-B | IN1K w/o lables | 1600 | 25 | 104 | 0.9 | 53.2 | 47.1 | log/weight |
Models | Pretrain | Pretrain Epochs | Finetune Iters | #Params(M) | FLOPs(T) | mIoU | logs/weights |
---|---|---|---|---|---|---|---|
DeiT-B | IN1K w/ labels | 300 | 16K | 163 | 0.6 | 45.6 | - |
Swin-B | IN1K w/ labels | 300 | 16K | 121 | 0.3 | 48.1 | - |
MoCo V3 | IN1K | 300 | 16K | 163 | 0.6 | 47.3 | - |
DINO | IN1K | 400 | 16K | 163 | 0.6 | 47.2 | - |
BEiT | IN1K+DALLE | 1600 | 16K | 163 | 0.6 | 47.1 | - |
PeCo | IN1K | 300 | 16K | 163 | 0.6 | 46.7 | - |
CAE | IN1K+DALLE | 800 | 16K | 163 | 0.6 | 48.8 | - |
MAE | IN1K | 1600 | 16K | 163 | 0.6 | 48.1 | - |
ConvMAE-B | IN1K | 1600 | 16K | 153 | 0.6 | 51.7 | soon |
Models | Pretrain Epochs | Finetune Epochs | #Params(M) | Top1 | Top5 | logs/weights |
---|---|---|---|---|---|---|
VideoMAE-B | 200 | 100 | 87 | 77.8 | ||
VideoMAE-B | 800 | 100 | 87 | 79.4 | ||
VideoMAE-B | 1600 | 100 | 87 | 79.8 | ||
VideoMAE-B | 1600 | 100 (w/ Repeated Aug) | 87 | 80.7 | 94.7 | |
SpatioTemporalLearner-B | 800 | 150 (w/ Repeated Aug) | 87 | 81.3 | 94.9 | |
VideoConvMAE-B | 200 | 100 | 86 | 80.1 | 94.3 | Soon |
VideoConvMAE-B | 800 | 100 | 86 | 81.7 | 95.1 | Soon |
VideoConvMAE-B-MSD | 800 | 100 | 86 | 82.7 | 95.5 | Soon |
Models | Pretrain Epochs | Finetune Epochs | #Params(M) | Top1 | Top5 | logs/weights |
---|---|---|---|---|---|---|
VideoMAE-B | 200 | 40 | 87 | 66.1 | ||
VideoMAE-B | 800 | 40 | 87 | 69.3 | ||
VideoMAE-B | 2400 | 40 | 87 | 70.3 | ||
VideoConvMAE-B | 200 | 40 | 86 | 67.7 | 91.2 | Soon |
VideoConvMAE-B | 800 | 40 | 86 | 69.9 | 92.4 | Soon |
VideoConvMAE-B-MSD | 800 | 40 | 86 | 70.7 | 93.0 | Soon |
- Linux
- Python 3.7+
- CUDA 10.2+
- GCC 5+
- See PRETRAIN.md for pretraining.
- See FINETUNE.md for pretrained model finetuning and linear probing.
- See DETECTION.md for using pretrained backbone on Mask RCNN.
- See SEGMENTATION.md for using pretrained backbone on UperNet.
- See VideoConvMAE for video classification.
The pretraining and finetuning of our project are based on DeiT and MAE. The object detection and semantic segmentation parts are based on MIMDet and MMSegmentation respectively. Thanks for their wonderful work.
ConvMAE is released under the MIT License.
@article{gao2022convmae,
title={ConvMAE: Masked Convolution Meets Masked Autoencoders},
author={Gao, Peng and Ma, Teli and Li, Hongsheng and Dai, Jifeng and Qiao, Yu},
journal={arXiv preprint arXiv:2205.03892},
year={2022}
}