I've seen some discussion about it, but I'm afraid I still don't get it: <p dir="a

Consistency with the article (HATT) about textclassifier HOT 5 OPEN

richliao commented on May 19, 2024

Consistency with the article (HATT)

from textclassifier.

Comments (5)

richliao commented on May 19, 2024

every sequence of LSTM output is a 2D, and the context vector is 1D. The product of them is 1D. The context vector is trained to assign weight to the 2D so that you can think of it as a weighted vector, such that ideally, it will give more weight to the important token.

from textclassifier.

miclatar commented on May 19, 2024

Hi, thanks for your answer.
However I'm afraid I already understood this concept - my issue is with the tanh activation. In the paper, it is performed on the dense layer before the multiplication with the context vector. In your implementation, it is performed on the dot product of these vectors.

According to the code, we actually stack two linear operations on the output of the GRU layer - first the Dense layer, and then the dot multiplication with self.W, without non-linearity in between. Theoretically, this could be converted with a single linear layer (as explained here).

Again, maybe I miss something, will be glad for an explanation :)

from textclassifier.

richliao commented on May 19, 2024

Which equation are you referring to? The tanh activation at my code refers to equation (5) and (8). h_it is from GRU output.

from textclassifier.

miclatar commented on May 19, 2024

I'll try to be as rigorous as possible:

(194) l_lstm_sent = Bidirectional(GRU(100, return_sequences=True))(review_encoder)
(195) l_dense_sent = TimeDistributed(Dense(200))(l_lstm_sent)
(196) l_att_sent = AttLayer()(l_dense_sent)

These are lines 194-196 in the code, referring to the upper hierarchy layer.

(5) u_it = tanh(W_w * h_it + b_w)
(6) a_it = exp(u_it * u_w) / sigma(exp(u_it * u_w))

And these are equations 5, 6 from the paper. The case is the same for lines 187-189 in the code and for equations 8-10, however I'll demonstrate only on these parts.

As you've said, h_it is the GRU output. In line 195, it is being passed through a Dense layer, therefore implementing the W_w * h_it + b_w part. My question refers to the next step.

According to the code, this output is now passed through the Attention layer. Note we do not have any activation in line 195, so we proceed only with the inner linear part of equation 5, rather than with u_it. More specifically, the next operation takes place in the call() method of the layer:

(174) eij = K.tanh(K.dot(x, self.W))
(175)
(176) ai = K.exp(eij)
(177) weights = ai/K.sum(ai, axis=1).dimshuffle(0,'x')

x being the input of the layer, or literally W_w * h_it + b_w. The next thing we know in the inner parantheses of line 174 is (W_w * h_it + b_w) * u_w, where u_w == self.W is the context vector. However this product happens, according to the paper, only in equation 6. We have skipped the tanh operation.

Only then, by line 174 in the code, we apply the tanh on the product. Note that this product is directly inserted to the exp in equation 6, without any non-linearities in between.

To my understanding, this is a different procedure than the one practiced in the paper. I may be wrong, or possibly this somehow leads to similar behavior, but I'd just like to hear why :)

Thanks!

from textclassifier.

richliao commented on May 19, 2024

Ha, you found a HUGE bug in my code that I didn't realize. I'm quite sure you are the first one to point out even someone asked why I use time distributed dense function (depricated).

The bug is I placed the tanh in the wrong place and wrong order. The TimeDistributed(Dense(200))(l_lstm_sent) is intended to do a one layer MLP, and as you said, there should be a tanh activation function before the dot product. The solution is either

(195) l_dense_sent = TimeDistributed(Dense(200, activation='tanh'))(l_lstm_sent)
eij = K.dot(x, self.W) (by removing tanh)
or
l_dense_sent = TimeDistributed(Dense(200))(l_lstm_sent) is the same
then eij = K.dot(K.tanh(x), self.W) (by changing order)

It has been so long that I have to reread the paper to bring backs the memory. I hope I didn't make mistakes again. Let me know :)

from textclassifier.

Consistency with the article (HATT) about textclassifier HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent