Code Monkey home page Code Monkey logo

Comments (11)

IcarusWizard avatar IcarusWizard commented on August 28, 2024

Hi,

The PatchShuffle class is doing two things in sequence:

  1. create the mask, in which the cnn output here is only help to specify the dimensions.
  2. use the mask to mask out the input.

You can of course implement these two things separately with two classes or functions. I implemented in this way only for convinent. And it is different with the official implementation since when I wrote the code, the official one was not yet released.

And it is also very straightforward to understand which patch comes from which region of the image. Say your input is 224x224 image, and patch size is 14, then you will get a 16x16 grid of patches from the conv and each patch on this grid is from a 14x14 region from the original image without overlapping.

from mae.

amirrezadolatpour2000 avatar amirrezadolatpour2000 commented on August 28, 2024

Hi, thank you for sharing the code.
why did not you use sine-cosine positional embedding as it is mentioned in the paper?

from mae.

IcarusWizard avatar IcarusWizard commented on August 28, 2024

I don't find where they mention of using sin-cos positional embedding in the paper. Actually, the original ViT paper clearly mentioned that a "learned" positional encoding is added after patchfication. Also for images, it is not necessary to use the sin-cos positional encoding since there is no extrapolation beyond the trained length. Could you point out where you read it?

from mae.

amirrezadolatpour2000 avatar amirrezadolatpour2000 commented on August 28, 2024

Sure, in the paper https://arxiv.org/abs/2111.06377, on page 11, first paragraph.
image

from mae.

IcarusWizard avatar IcarusWizard commented on August 28, 2024

ah, I see. Thanks for the reference. I didn't pay much attention to this detail. But, as I said, I don't think it will make a large difference to the result. Feel free to experiment with that.

from mae.

IcarusWizard avatar IcarusWizard commented on August 28, 2024

Also, I just checked their official code and they don't even follow this detail. The code uses the ViT model from timm which follows the details in the ViT paper with learned positional encoding.

from mae.

amirrezadolatpour2000 avatar amirrezadolatpour2000 commented on August 28, 2024

https://github.com/facebookresearch/mae/blob/main/models_mae.py
You can see that they utilized the frozen positional embedding using the sine-cosine approach.

from mae.

IcarusWizard avatar IcarusWizard commented on August 28, 2024

Ah, thanks for the correction. I had looked at a wrong file. Then I don't know why they don't like to follow the ViT architecture precisely.

from mae.

amirrezadolatpour2000 avatar amirrezadolatpour2000 commented on August 28, 2024

Based on what I studied, we do not have specific rules for choosing the positional embedding. However, I want to try the sine-cosine approach, and see the result. If I test it, I will inform you.
I want to be sure that this implementation considers other details mentioned in the paper. I checked it, however, I want to be sure.

from mae.

IcarusWizard avatar IcarusWizard commented on August 28, 2024

Oh, I don't think I followed all the details from the paper precisely. As in the readme, the purpose of this code is only to verify the idea of mae, not a replicate. For example, I think I didn't implement the normalization for reconstruction loss. There could be more details that I missed.

from mae.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.