Code Monkey home page Code Monkey logo

Comments (5)

richliao avatar richliao commented on May 19, 2024

every sequence of LSTM output is a 2D, and the context vector is 1D. The product of them is 1D. The context vector is trained to assign weight to the 2D so that you can think of it as a weighted vector, such that ideally, it will give more weight to the important token.

from textclassifier.

miclatar avatar miclatar commented on May 19, 2024

Hi, thanks for your answer.
However I'm afraid I already understood this concept - my issue is with the tanh activation. In the paper, it is performed on the dense layer before the multiplication with the context vector. In your implementation, it is performed on the dot product of these vectors.

According to the code, we actually stack two linear operations on the output of the GRU layer - first the Dense layer, and then the dot multiplication with self.W, without non-linearity in between. Theoretically, this could be converted with a single linear layer (as explained here).

Again, maybe I miss something, will be glad for an explanation :)

from textclassifier.

richliao avatar richliao commented on May 19, 2024

Which equation are you referring to? The tanh activation at my code refers to equation (5) and (8). h_it is from GRU output.

from textclassifier.

miclatar avatar miclatar commented on May 19, 2024

I'll try to be as rigorous as possible:

(194) l_lstm_sent = Bidirectional(GRU(100, return_sequences=True))(review_encoder)
(195) l_dense_sent = TimeDistributed(Dense(200))(l_lstm_sent)
(196) l_att_sent = AttLayer()(l_dense_sent)

These are lines 194-196 in the code, referring to the upper hierarchy layer.

(5) u_it = tanh(W_w * h_it + b_w)
(6) a_it = exp(u_it * u_w) / sigma(exp(u_it * u_w))

And these are equations 5, 6 from the paper. The case is the same for lines 187-189 in the code and for equations 8-10, however I'll demonstrate only on these parts.

As you've said, h_it is the GRU output. In line 195, it is being passed through a Dense layer, therefore implementing the W_w * h_it + b_w part. My question refers to the next step.

According to the code, this output is now passed through the Attention layer. Note we do not have any activation in line 195, so we proceed only with the inner linear part of equation 5, rather than with u_it. More specifically, the next operation takes place in the call() method of the layer:

(174) eij = K.tanh(K.dot(x, self.W))
(175)
(176) ai = K.exp(eij)
(177) weights = ai/K.sum(ai, axis=1).dimshuffle(0,'x')

x being the input of the layer, or literally W_w * h_it + b_w. The next thing we know in the inner parantheses of line 174 is (W_w * h_it + b_w) * u_w, where u_w == self.W is the context vector. However this product happens, according to the paper, only in equation 6. We have skipped the tanh operation.

Only then, by line 174 in the code, we apply the tanh on the product. Note that this product is directly inserted to the exp in equation 6, without any non-linearities in between.

To my understanding, this is a different procedure than the one practiced in the paper. I may be wrong, or possibly this somehow leads to similar behavior, but I'd just like to hear why :)

Thanks!

from textclassifier.

richliao avatar richliao commented on May 19, 2024

Ha, you found a HUGE bug in my code that I didn't realize. I'm quite sure you are the first one to point out even someone asked why I use time distributed dense function (depricated).

The bug is I placed the tanh in the wrong place and wrong order. The TimeDistributed(Dense(200))(l_lstm_sent) is intended to do a one layer MLP, and as you said, there should be a tanh activation function before the dot product. The solution is either

  1. (195) l_dense_sent = TimeDistributed(Dense(200, activation='tanh'))(l_lstm_sent)
    eij = K.dot(x, self.W) (by removing tanh)
    or
  2. l_dense_sent = TimeDistributed(Dense(200))(l_lstm_sent) is the same
    then eij = K.dot(K.tanh(x), self.W) (by changing order)

It has been so long that I have to reread the paper to bring backs the memory. I hope I didn't make mistakes again. Let me know :)

from textclassifier.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.