Pre-norm and Post-norm about enhancing-transformers HOT 6 CLOSED

thuanz123 commented on August 23, 2024

Pre-norm and Post-norm

from enhancing-transformers.

Comments (6)

thuanz123 commented on August 23, 2024 1

Yeah

Thanks for reply, I will try and see the results. There is another question about feed forward layers in the model. In the paper section 3.1, the author says "At the output of transformer blocks, we apply a two-layer feed-forward network with a tanh activation layer in the middle." In my opinion and in the implemention of BEIT2(which use the similar architecture as vit-vqgan)https://github.com/microsoft/unilm/blob/152193af4b295ae39cf0c2a492da3ee5cc5abe29/beit2/modeling_vqkd.py#L87, they add these layer after the encoder and the decoder, while in your implemention you just change the mlp's active funtion as tanh in each transformer block. If you have some experiments can explain which one is better?

Yeah, BEIT2 is more closer to the authors than mine but my implementation is good enough for me so I dont want any breaking changes to the model. But in my mail to the authors as well as some others mail, the author answer that the extra FFN or the activation in the FFN does not matter very much

from enhancing-transformers.

thuanz123 commented on August 23, 2024 1

I will try to reports about FID but quite busy these days so maybe next week, I will calculate that

from enhancing-transformers.

thuanz123 commented on August 23, 2024

enhancing-transformers/enhancing/modules/stage1/layers.py

Line 150 in fc0ea1e

return self.norm(x)

Hi, thanks for sharing the implementation. I wonder why you both use pre-norm for each layer in transformer block and also norm the output of both the vit-encoder and the vit-decoder.
hi @zyf0619sjtu, for all of my experiments, that last simple norm turns out to be very, very, very important. Without it, the output quality is very poor and the training is very unstable. You can try yourself and see 😅

from enhancing-transformers.

zyf0619sjtu commented on August 23, 2024

Thanks for reply, I will try and see the results. There is another question about feed forward layers in the model. In the paper section 3.1, the author says "At
the output of transformer blocks, we apply a two-layer feed-forward network with a tanh activation
layer in the middle." In my opinion and in the implemention of BEIT2(which use the similar architecture as vit-vqgan)https://github.com/microsoft/unilm/blob/152193af4b295ae39cf0c2a492da3ee5cc5abe29/beit2/modeling_vqkd.py#L87, they add these layer after the encoder and the decoder, while in your implemention you just change the mlp's active funtion as tanh in each transformer block. If you have some experiments can explain which one is better?

from enhancing-transformers.

zyf0619sjtu commented on August 23, 2024

Thanks for reply, I will try and see the results. There is another question about feed forward layers in the model. In the paper section 3.1, the author says "At the output of transformer blocks, we apply a two-layer feed-forward network with a tanh activation layer in the middle." In my opinion and in the implemention of BEIT2(which use the similar architecture as vit-vqgan)https://github.com/microsoft/unilm/blob/152193af4b295ae39cf0c2a492da3ee5cc5abe29/beit2/modeling_vqkd.py#L87, they add these layer after the encoder and the decoder, while in your implemention you just change the mlp's active funtion as tanh in each transformer block. If you have some experiments can explain which one is better?

And In BEIT2's implementation they keep the default GELU() active function of VIT in all mlp layers.

from enhancing-transformers.

zyf0619sjtu commented on August 23, 2024

Thanks for these information, it's a great help. About your released imagenet stage1 models, how about the FID performance on validation set?

from enhancing-transformers.

Pre-norm and Post-norm about enhancing-transformers HOT 6 CLOSED

Comments (6)

Related Issues (18)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent