I am wondering why don't you use the standard nn version of LayerNorm

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The LayerNorm implementation about bert-pytorch HOT 4 CLOSED

codertimo commented on May 18, 2024

The LayerNorm implementation

from bert-pytorch.

Comments (4)

codertimo commented on May 18, 2024

@egg-west Well the reason why I used this layer norm is "Attention All you need" implementation Annotated Transformer used this code, and just copied from there. So.. if anyone can answer this question, would be seriously awesome

from bert-pytorch.

briandw commented on May 18, 2024

I believe that they should do similar things, however there is a difference in implementation.

For a given input:
x = torch.tensor([1.,0.,0.,0.])
The Annotated Transformer version gives the output:
tensor([ 1.5000, -0.5000, -0.5000, -0.5000], grad_fn=<ThAddBackward>)

While torch.nn.LayerNorm gives:
tensor([ 1.7320, -0.5773, -0.5773, -0.5773], grad_fn=<AddcmulBackward>)

The layer_norm implementation in PyTorch is here:
https://github.com/pytorch/pytorch/blob/cca247635c6edb323176eeac7a18d3e9ab71c558/caffe2/python/helpers/normalization.py

from bert-pytorch.

codertimo commented on May 18, 2024

@egg-west Is your question is solved? 👍

from bert-pytorch.

egg-west commented on May 18, 2024

Thank you for your clarification, I guess pulling the epsilon out of sqrt may speed up the computation.
But yes, they did the same thing.

from bert-pytorch.

The LayerNorm implementation about bert-pytorch HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent