Code Monkey home page Code Monkey logo

mfm's Introduction

MFM

Unofficial code for paper "Masked Feature Prediction for Self-Supervised Visual Pre-Training" (https://arxiv.org/pdf/2206.07706.pdf)

Below are experiments with resnet50. Though better result is achieved, it seems that the baseline is also much higher than in paper.

top-1 acc pretrain finetune
paper scratch 78.1 - -
paper mfm pretrain 78.5 - -
scratch 78.542 - link
supervised pretrain 78.942 - link
mfm pretrain 78.826 link link

Note: Supervised pretrain means finetune from torchvision resnet weights (by setting pretrained=True). It seems that supervised pretrain is better than the proposed mfm pretrain.

Platform

  • pytorch 1.13.1
  • torchvision 0.14.1
  • dali 1.21.0
  • cuda 11.6
  • V100 GPU(32G) x 8
  • driver: 470.82.01

Dataset

Prepare imagenet val set in same method as pytorch official classification example, and then link them to the folder of this repo:

    $ mkdir -p imagenet
    $ ln -s /path/to/imagenet/train ./imagenet/train
    $ ln -s /path/to/imagenet/val ./imagenet/val

Train

Pretraining and finetuning Command is here.

More ablations

Here are some points that affects the results:

  1. finetune --val-resize-size
    When we eval the model after finetuning, we always resize the short side of the image to a fixed value before a center crop operation. Here I find sometimes the value of fixed short side size affects the acc by a noticeable margin. Take the "supervised pretrain" as example:

    val-resize-size 234 235 236
    top-1 acc 78.856 78.942 78.794
  2. finetune with bce loss is important
    We can see this by finetuning from scratch with CE(cross entropy) loss and BCE(binary cross entropy) loss, the result is:

    loss CE BCE
    top-1 acc 78.542 78.952
  3. pretrain random crop area
    We usually crop a part of the image with certain area ratio from the original image, and the default value of this ratio is 0.08-1.0 with torchvision RandomResizedCrop. Different self-supervised learning methods tend to prefer different random area ratios. For example, MAE uses 0.2-1.0, MAE3d uses 0.5-1.0, and SimMIM uses 0.67-1.0. Here I find a smaller lower bound of 0.2-1.0 is better:

    random area ratio 0.67-1.0 0.2-1.0 0.1-1.0
    top-1 acc 78.770 78.826 78.842

    Though here 0.1-1.0 is better than 0.2-1.0, I still use the latter, since, with 0.1-1.0, the finetuning eval result is more affacted by val-resize-size:

    val-resize-size 234 235 236
    0.2-1.0 78.816 78.826 78.796
    0.1-1.0 78.730 78.842 78.738
  4. model variance
    Here I pretrain the model for 4 times(2 on 8 v100 gpu, and 2 on 8 p40 gpu) with identical configuration. Then I finetune 3 times for each of the pretrained model(with 8 p40). Results are listed below. We can see that the results varies between a big margin. Maybe the above good results are brought by a good luck. Hence, I cannot say that I have certainly reproduced the results in the paper now.

    pretrain finetune acc1(235) mean/std
    round 1 round 1 78.654 78.644/0.024 78.621/0.08
    round 2 78.61
    round 3 78.668
    round 2 round 1 78.646 78.642/0.122
    round 2 78.79
    round 3 78.49
    round 3 round 1 78.516 78.612/0.073
    round 2 78.626
    round 3 78.694
    round 4 round 1 78.608 78.584/0.080
    round 2 78.668
    round 3 78.476

mfm's People

Contributors

coincheung avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mfm's Issues

Pretrained weights

Hi, thanks for your great reproduction. Could you provide the current pretrained weights of your reproduction model?

Thanks

fft_masker.py Line 27, weights seem wrong.

Hi, the weights of converting RGB image to gray image are 0.299, 0.587, and 0.114. But they are 0.229, 0.587, and 0.114 in your code. Perhaps that would be the potential reason. Thanks for your great work.

On the focal frequency loss.

Hi,

I noticed that you removed the weight that balances the losses of different positions, which is a part of the focal frequency loss.

I wonder why that is. Did you observe a performance degradation?

I'm working on a similar project that involves spectrum reconstruction, and could use your advice. Intuitively, given the unbalanced distribution of the spectrum, the weight seems a good choice.

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.