MFM

Unofficial code for paper "Masked Feature Prediction for Self-Supervised Visual Pre-Training" (https://arxiv.org/pdf/2206.07706.pdf)

Below are experiments with resnet50. Though better result is achieved, it seems that the baseline is also much higher than in paper.

	top-1 acc	pretrain	finetune
paper scratch	78.1	-	-
paper mfm pretrain	78.5	-	-
scratch	78.542	-	link
supervised pretrain	78.942	-	link
mfm pretrain	78.826	link	link

Note: Supervised pretrain means finetune from torchvision resnet weights (by setting pretrained=True). It seems that supervised pretrain is better than the proposed mfm pretrain.

Platform

pytorch 1.13.1
torchvision 0.14.1
dali 1.21.0
cuda 11.6
V100 GPU(32G) x 8
driver: 470.82.01

Dataset

Prepare imagenet val set in same method as pytorch official classification example, and then link them to the folder of this repo:

    $ mkdir -p imagenet
    $ ln -s /path/to/imagenet/train ./imagenet/train
    $ ln -s /path/to/imagenet/val ./imagenet/val

Train

Pretraining and finetuning Command is here.

More ablations

Here are some points that affects the results:

finetune --val-resize-size
When we eval the model after finetuning, we always resize the short side of the image to a fixed value before a center crop operation. Here I find sometimes the value of fixed short side size affects the acc by a noticeable margin. Take the "supervised pretrain" as example:

val-resize-size 234 235 236
top-1 acc 78.856 78.942 78.794
finetune with bce loss is important
We can see this by finetuning from scratch with CE(cross entropy) loss and BCE(binary cross entropy) loss, the result is:

loss CE BCE
top-1 acc 78.542 78.952
pretrain random crop area
We usually crop a part of the image with certain area ratio from the original image, and the default value of this ratio is 0.08-1.0 with torchvision RandomResizedCrop. Different self-supervised learning methods tend to prefer different random area ratios. For example, MAE uses 0.2-1.0, MAE3d uses 0.5-1.0, and SimMIM uses 0.67-1.0. Here I find a smaller lower bound of 0.2-1.0 is better:

random area ratio 0.67-1.0 0.2-1.0 0.1-1.0
top-1 acc 78.770 78.826 78.842

Though here 0.1-1.0 is better than 0.2-1.0, I still use the latter, since, with 0.1-1.0, the finetuning eval result is more affacted by val-resize-size:

val-resize-size 234 235 236
0.2-1.0 78.816 78.826 78.796

0.1-1.0 78.730 78.842 78.738

val-resize-size	234	235	236
top-1 acc	78.856	78.942	78.794

loss	CE	BCE
top-1 acc	78.542	78.952

random area ratio	0.67-1.0	0.2-1.0	0.1-1.0
top-1 acc	78.770	78.826	78.842

val-resize-size	234	235	236
0.2-1.0	78.816	78.826	78.796
0.1-1.0	78.730	78.842	78.738

model variance
Here I pretrain the model for 4 times(2 on 8 v100 gpu, and 2 on 8 p40 gpu) with identical configuration. Then I finetune 3 times for each of the pretrained model(with 8 p40). Results are listed below. We can see that the results varies between a big margin. Maybe the above good results are brought by a good luck. Hence, I cannot say that I have certainly reproduced the results in the paper now.

pretrain	finetune	acc1(235)	mean/std
round 1	round 1	78.654	78.644/0.024	78.621/0.08
	round 2	78.61
	round 3	78.668
round 2	round 1	78.646	78.642/0.122
	round 2	78.79
	round 3	78.49
round 3	round 1	78.516	78.612/0.073
	round 2	78.626
	round 3	78.694
round 4	round 1	78.608	78.584/0.080
	round 2	78.668
	round 3	78.476

coincheung / mfm Goto Github PK

mfm's Introduction

MFM

Platform

Dataset

Train

More ablations

mfm's People

Contributors

Stargazers

Watchers

Forkers

mfm's Issues

Pretrained weights

fft_masker.py Line 27, weights seem wrong.

On the focal frequency loss.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent