Code Monkey home page Code Monkey logo

Comments (6)

thuanz123 avatar thuanz123 commented on August 23, 2024 1

Yeah

Thanks for reply, I will try and see the results. There is another question about feed forward layers in the model. In the paper section 3.1, the author says "At the output of transformer blocks, we apply a two-layer feed-forward network with a tanh activation layer in the middle." In my opinion and in the implemention of BEIT2(which use the similar architecture as vit-vqgan)https://github.com/microsoft/unilm/blob/152193af4b295ae39cf0c2a492da3ee5cc5abe29/beit2/modeling_vqkd.py#L87, they add these layer after the encoder and the decoder, while in your implemention you just change the mlp's active funtion as tanh in each transformer block. If you have some experiments can explain which one is better?

Yeah, BEIT2 is more closer to the authors than mine but my implementation is good enough for me so I dont want any breaking changes to the model. But in my mail to the authors as well as some others mail, the author answer that the extra FFN or the activation in the FFN does not matter very much

from enhancing-transformers.

thuanz123 avatar thuanz123 commented on August 23, 2024 1

I will try to reports about FID but quite busy these days so maybe next week, I will calculate that

from enhancing-transformers.

thuanz123 avatar thuanz123 commented on August 23, 2024

Hi, thanks for sharing the implementation. I wonder why you both use pre-norm for each layer in transformer block and also norm the output of both the vit-encoder and the vit-decoder.
hi @zyf0619sjtu, for all of my experiments, that last simple norm turns out to be very, very, very important. Without it, the output quality is very poor and the training is very unstable. You can try yourself and see 😅

from enhancing-transformers.

zyf0619sjtu avatar zyf0619sjtu commented on August 23, 2024

Thanks for reply, I will try and see the results. There is another question about feed forward layers in the model. In the paper section 3.1, the author says "At
the output of transformer blocks, we apply a two-layer feed-forward network with a tanh activation
layer in the middle." In my opinion and in the implemention of BEIT2(which use the similar architecture as vit-vqgan)https://github.com/microsoft/unilm/blob/152193af4b295ae39cf0c2a492da3ee5cc5abe29/beit2/modeling_vqkd.py#L87, they add these layer after the encoder and the decoder, while in your implemention you just change the mlp's active funtion as tanh in each transformer block. If you have some experiments can explain which one is better?

from enhancing-transformers.

zyf0619sjtu avatar zyf0619sjtu commented on August 23, 2024

Thanks for reply, I will try and see the results. There is another question about feed forward layers in the model. In the paper section 3.1, the author says "At the output of transformer blocks, we apply a two-layer feed-forward network with a tanh activation layer in the middle." In my opinion and in the implemention of BEIT2(which use the similar architecture as vit-vqgan)https://github.com/microsoft/unilm/blob/152193af4b295ae39cf0c2a492da3ee5cc5abe29/beit2/modeling_vqkd.py#L87, they add these layer after the encoder and the decoder, while in your implemention you just change the mlp's active funtion as tanh in each transformer block. If you have some experiments can explain which one is better?

And In BEIT2's implementation they keep the default GELU() active function of VIT in all mlp layers.

from enhancing-transformers.

zyf0619sjtu avatar zyf0619sjtu commented on August 23, 2024

Thanks for these information, it's a great help. About your released imagenet stage1 models, how about the FID performance on validation set?

from enhancing-transformers.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.