datalogue / keras-attention Goto Github PK

Visualizing RNNs using the attention mechanism

Home Page: https://medium.com/datalogue/attention-in-keras-1892773a4f22

License: GNU Affero General Public License v3.0

Python 100.00%

attention-mechanism deep-learning deep-neural-networks machine-learning natural-language-processing recurrent-neural-networks translation

keras-attention's People

Contributors

Stargazers

Watchers

Forkers

sachuin23 jonkrohn szsen stevenlol yydxlv bityangke benjamesbabala projectafey rubenszimbres thesage21 stevelizcano lplenka andrewrothstein peterwilliams97 tanduong collawolley hitxujian vanova tensoralex primemover2011 shagunsodhani nikita-solvvy sandy4321 weijingzhu wen036 read-papers codeaudit ravitejaanantha leezqcst salarora raghavendranpm viksit tks-tud roszcz ufukhurriyetoglu laisun rahulptel nhartman94 jeammimi zchen0420 yangvict zhengyu19921215 xiaonanchong96 tozammel victorpu assulan ggraham inimah will-rice marichalloran xiaoyeye1117 youngd3v dddragons ethanyhzhang guidachengong pydemia shubhampachori12110095 sunnysai12345 pappagari sinianyutian ravi-code-ranjan aritzbi caoxu915683474 lamyaaa hai-pham zherongz k-sandhu zhangyang5511 steven0129 loriluo95 kyoungrok0517 oldmonk101 hanfeijp pked01 fujiyuu75 wangzihaooooo phongtranng juliegkim1 waldstein1983 wjwenoch luisfredgs aivanni sunshinezhihuo neuron888 kinect59 vatchalabakthavatchalu pb-pravin niuwan1 zjl1234569090 williamwhe arbi11 gperakis stabgan hhh920406 luxuriance19 dangxuanhong linjucs ermao0828 hanimiao fyubang

keras-attention's Issues

Keras in tensorflow 1.5

Your code seems a good job to understand the attention technique:
After some little modifications of your code in order to work in tensorflow 1.5, I have another error:

code:
import tensorflow as tf
from tensorflow.python.keras._impl.keras import backend as K
from tensorflow.python.keras._impl.keras import regularizers, constraints, initializers, activations
from tensorflow.python.keras._impl.keras.layers.recurrent import Recurrent
from tensorflow.python.keras._impl.keras.layers import InputSpec
from .tdd import time_distributed_dense
tfPrint = lambda d, T: tf.Print(input=T, data=[T, tf.shape(T)], message=d)
class AttentionDecoder(Recurrent):
def init(self, units, output_dim,
activation='tanh',
return_probabilities=False,
name='AttentionDecoder', ....

Error:
Traceback (most recent call last):
File "/home/adrien/Tensorflow/Projets/keras-attention/run.py", line 109, in
main(args)
File "/home/adrien/Tensorflow/Projets/keras-attention/run.py", line 53, in main
return_probabilities=False)
File "/home/adrien/Tensorflow/Projets/keras-attention/models/NMT.py", line 47, in simpleNMT
trainable=trainable)(rnn_encoded)
File "/home/adrien/Tensorflow/Projets/keras-attention/models/custom_recurrents.py", line 54, in init
self.name = name
AttributeError: can't set attribute

I thank you in advance

Adrien

ValueError and TypeError in the custom_recurrents.py

Traceback (most recent call last):
File "C:\Users\Bruce Rogers\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 455, in _apply_op_helper
as_ref=input_arg.is_ref)
File "C:\Users\Bruce Rogers\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1211, in internal_convert_n_to_tensor
ctx=ctx))
File "C:\Users\Bruce Rogers\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1146, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "C:\Users\Bruce Rogers\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py", line 229, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "C:\Users\Bruce Rogers\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py", line 208, in constant
value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "C:\Users\Bruce Rogers\Anaconda3\lib\site-packages\tensorflow\python\framework\tensor_util.py", line 430, in make_tensor_proto
raise ValueError("None values not supported.")
ValueError: None values not supported.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:/Users/Bruce Rogers/Artificial Intelligence/IMU Developer/conversation module/embedding+BidLstm+attentiondecoder/keras_FunctionAPI_model.py", line 18, in
attention_decoder_outputs = AttentionDecoder(settings.LSTM_neurons, data_setting.output_vocab_size)(encoder_bid_lstm_outputs)
File "C:\Users\Bruce Rogers\Anaconda3\lib\site-packages\keras\legacy\layers.py", line 513, in call
return super(Recurrent, self).call(inputs, **kwargs)
File "C:\Users\Bruce Rogers\Anaconda3\lib\site-packages\keras\engine\base_layer.py", line 457, in call
output = self.call(inputs, **kwargs)
File "C:\Users\Bruce Rogers\Artificial Intelligence\IMU Developer\conversation module\embedding+BidLstm+attentiondecoder\Attention.py", line 422, in call
return super(AttentionDecoder, self).call(x)
File "C:\Users\Bruce Rogers\Anaconda3\lib\site-packages\keras\legacy\layers.py", line 590, in call
input_length=timesteps)
File "C:\Users\Bruce Rogers\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py", line 2922, in rnn
outputs, _ = step_function(inputs[0], initial_states + constants)
File "C:\Users\Bruce Rogers\Artificial Intelligence\IMU Developer\conversation module\embedding+BidLstm+attentiondecoder\Attention.py", line 466, in step
_stm = K.repeat(stm, self.timesteps)
File "C:\Users\Bruce Rogers\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py", line 2137, in repeat
pattern = tf.stack([1, n, 1])
File "C:\Users\Bruce Rogers\Anaconda3\lib\site-packages\tensorflow\python\ops\array_ops.py", line 874, in stack
return gen_array_ops.pack(values, axis=axis, name=name)
File "C:\Users\Bruce Rogers\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 5856, in pack
"Pack", values=values, axis=axis, name=name)
File "C:\Users\Bruce Rogers\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 483, in _apply_op_helper
raise TypeError("%s that don't all match." % prefix)
TypeError: Tensors in list passed to 'values' of 'Pack' Op have types [int32, , int32] that don't all match.

Missing file

keras-attention/data/reader.py

Line 146 in 693f951

ds = Data('./fake.csv', input_vocab, output_vocab)

File fake.csv is missing, can you provide it please ?

How is the dimension of encoded sequence determined?

Hello,

I have a minor question, how is the dimension of encoded sequence determined? Is there any instructions of calculating the dimension?

about lstm layer numer

hi:
thanks for sharing!
given the code
rnn_encoded = Bidirectional(LSTM(encoder_units, return_sequences=True),
name='bidirectional_1',
merge_mode='concat',
trainable=trainable)(input_embed)
is this code means that LSTM only haven one hidden layer and the size is specified by encoder_units?
if so how to add more hidden layers?

Vanishing Gradient Problem Occurred During Training

Hi, I am new to the attention mechanism and I found your codes, tutorials very helpful to beginners like me!

Currently, I am trying to use your attention decoder to do the sentiment analysis of the Sentiment140 Dataset. I have successfully constructed the following BiLSTM-with-attention model to split the positive and negative tweets:

def get_bi_lstm_with_attention_model(timesteps, features):
    input_shape = (timesteps, features)
    input = Input(shape=input_shape, dtype='float32')
    enc = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True, input_shape=input_shape),
                        merge_mode='concat', name='bidirectional_1')(input)
    y_hat = AttentionDecoder(units=100,output_dim=1, name='attention_decoder_1', activation='sigmoid')(enc)
    bilstm_attention_model = Model(inputs=input, outputs=y_hat)
    bilstm_attention_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    return bilstm_attention_model

However, when I use this model to fit my training data(which is a 1280000*50 matrix, batch_size=128, I definitely reshape the data first to (int(1280000/5), 5, 50) following the rule of the input_shape = (batch_size, time_steps, input_dim)), the accuracy is very low(around 50%). My BiLSTM without attention model could at least reach 80% accuracy using the same hyperparameter settings. Hence, my question is: what's wrong with my current BiLSTM with Attention model? I think it should be a vanishing gradient problem. I would really appreciate it if anyone could give me some guidelines about how to deal with this issue. Thank you very much!!

Attention model can not learn simple behaviour

For a long time I've tried to adapt your model to OCR problem. At some point I found out that even with frozen encoder features, recieved by CTC model (that performs well) I can not reproduce the same performance. Then I made the simpliest problem which any reasonable classifier should solve, kind of autoencoder

    import keras
    from keras.layers import Input, LSTM, Embedding, Dense
    from keras.models import Model

    import numpy as np
    n, t = 10000, 5
    n_labels = 10
    y = np.random.randint(0, n_labels, size=(n,t))

    inp = Input(shape=(t,), dtype='int64')
    emb = Embedding(n_labels, 10)(inp)
    #outp = Dense(n_labels, activation='softmax')(emb)
    outp = AttentionDecoder(10, n_labels)(emb)

    model = Model(inputs=[inp],outputs=[outp])
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    model.fit(y, np.expand_dims(y, -1), epochs=15)

To no surprise, if Dense is used instead of AttentionDecoder we will recieve accuracy = 1 immediately after the first epoch. But with AttentionDecoder model stalemates at around accuracy = 0.5 with no further progress at all.

It seems to be working well only if t <= 2, maybe due to initial_state which is initialized from first timestep: s0 = activations.tanh(K.dot(inputs[:, 0], self.W_s)), and attention is overfitted on the second timestep. But even with t = 3 accuracy does not exceed 0.7, which is close to guessing two labels and returning last one at random.

Any ideas?

Recurrent layer is no longer supported by keras

Dears, please note that the recurrent layer is no longer supported by keras and obtaining its source code to run it independently is not feasible as the source code, too, has modules which are no longer supported. How do u suggest addressing this

Instability during training

I'm fairly new to this and for some reason I'm having during training. I've witnessed over 10% decrease in validation accuracy at some point.

It's a many-to-many problem similar to pos tagging (vocab size much smaller). Input is an array of 40 integers (zero-padded), output is an array of 40 one-hot vectors. Any idea what I'm doing wrong?

max_seqlen = 40
s_vocabsize = 17
t_vocabsize = 124

embed_size = 64
hidden_size = 128

input_ = Input(shape=(max_seqlen,), dtype='float32')
input_embed = Embedding(s_vocabsize, embed_size, input_length=max_seqlen , mask_zero=True)(input_)

bi_lstm = Bidirectional(LSTM(HIDDEN_SIZE, dropout=0.2, recurrent_dropout=0.2, return_sequences=True), merge_mode='concat')(input_embed)
dropout = Dropout(0.8)(bi_lstm)

y_hat = AttentionDecoder(hidden_size , alphabet_size=t_vocabsize, embedding_dim=embed_size )(dropout)

model = Model(inputs=input_, outputs=y_hat)
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=["accuracy"])```

An operation has `None` for gradient

#one sample example
Input = [1015 4 2 0 0 0 0 0 0 0]
output = [ 65 116 2 0 0 0 0 0 0 0] (formated in one hot)

Also I have changed the following function in the AttentionDecoder, everything else remains the same:

from --> self._uxpb = _time_distributed_dense(self.x_seq, self.U_a, b=self.b_a, input_dim=self.input_dim, timesteps=self.timesteps, output_dim=self.units)

to --> dense = Dense(self.units, weights=self.U_a, input_dim=self.input_dim, bias=self.b_a)
self._uxpb = TimeDistributed(dense)(self.x_seq)

My model architecture is as follows:

encoder_inputs = keras.layers.Input(shape=(input_max_sentence_length,),name='encoder_inputs')

encoder_embedding = keras.layers.Embedding(input_dim=input_vocabulary_size, output_dim=embedding_dimension, input_length=input_max_sentence_length, mask_zero=True, name='encoder_embedding', trainable=True)(encoder_inputs)

lstm0_output_hidden_sequence = keras.layers.Bidirectional(keras.layers.LSTM(hidden_units, dropout=0, return_sequences=True, return_state=False, name='bidirectional', trainable=True))(encoder_embedding)

lstm01_output_hidden_sequence, lstm01_output_h, lstm01_output_c = keras.layers.LSTM(hidden_units, dropout=dropout_lstm_encoder, return_sequences=True, return_state=True, name='summarization')(lstm0_output_hidden_sequence)

attention_decoder = AttentionDecoder(hidden_units, output_vocabulary_size, trainable=True)(lstm01_output_hidden_sequence)

full_model = keras.models.Model(inputs=[encoder_inputs], outputs=[attention_decoder])

And I am facing the following error:

raise ValueError('An operation has None for gradient. '
ValueError: An operation has None for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.

Attention on different input and output length

Hello
Thanks a lot for providing easy to understand tutorial and attention layer implementation.
I am trying to use attention on a dataset with different input and output length.
My training data sequence consists of size 6004 (600 4-dimensional points) and output one hot encoded is of size 7066 (66 symbols represented in a 70 length vector). I have to map the 600 points sequence to the 70 symbols for ~15000 such sequences.
Just after applying LSTM layer, I tried using a Repeated Vector with the output length for a small dataset. I read that Repeated Vector is used in encoder decoder models where output and input sequence are not of same length. Here is what I tried:
x_train.shape=(50,600,4)
y_train.shape=(50,70,66)
inputs = Input(shape=(x_train.shape[1:]))
rnn_encoded = Bidirectional(LSTM(32, return_sequences=False),name='bidirectional_1',merge_mode='concat',trainable=True)(inputs)
encoded = RepeatVector(y_train.shape[1])(rnn_encoded)
y_hat = AttentionDecoder(70,name='attention_decoder_1',output_dim=y_train.shape[2], return_probabilities=False, trainable=True)(encoded)

But the prediction from this model always gives same symbols in the output sequence after every run:
'decoded model output:', ['d', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I'])
('decoded original output:', ['A', ' ', 'M', 'O', 'V', 'E', ' ', 't', 'o', ' ', 's', 't', 'o', 'p', ' ', 'M', 'r', ' ', '.', ' ', 'G', 'a', 'i', 't', 's', 'k', 'e', 'l', 'l', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0'])
Could you please give an idea where I am going wrong and what can I possibly do to solve the problem?
Any help would be much appreciated.

Thanks
Aayushee

how to use pre-trained word embedding

model = Sequential()
model.add(Embedding(vocab_size, VOCAB_REP_DIM, input_length=WINDOW_SIZE, weights=[embedding_matrix]))
model.add(Bidirectional(LSTM(HIDDEN_DIM, return_sequences=True)))
model.add(AttentionDecoder(HIDDEN_DIM, vocab_size))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
model.summary()

I would like to use pre-trained word2vec embedding.
vocab_size = 149
VOCAB_REP_DIM = 100
WINDOW_SIZE = 10

But I got this error
ValueError: Error when checking input: expected embedding_1_input to have 2 dimensions, but got array with shape (152548, 10, 149)

Anyone know how to use pre-trained word embedding here?
Thanks in advance

Problem loading sample weights

I received an error when trying to load your saved weights:

padding = 50
weights = "weights/NMT.49-0.01.hdf5"
weights_file = os.path.expanduser(weights)
viz = Visualizer(padding=padding)
pred_model = simpleNMT(trainable=False,
                       pad_length=padding,
                       n_chars=viz.input_vocab.size(),
                       n_labels=viz.output_vocab.size())
pred_model.load_weights(weights_file, by_name=True)

Here's the error:

 ---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/common_shapes.py in _call_cpp_shape_fn_impl(op, input_tensors_needed, input_tensors_as_shapes_needed, debug_python_shape_fn, require_shape_fn)
    670           graph_def_version, node_def_str, input_shapes, input_tensors,
--> 671           input_tensors_as_shapes, status)
    672   except errors.InvalidArgumentError as err:

/opt/conda/lib/python3.6/contextlib.py in __exit__(self, type, value, traceback)
     88             try:
---> 89                 next(self.gen)
     90             except StopIteration:

/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py in raise_exception_on_not_ok_status()
    465           compat.as_text(pywrap_tensorflow.TF_Message(status)),
--> 466           pywrap_tensorflow.TF_GetCode(status))
    467   finally:

InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 1367 and 1426 for 'Assign' (op: 'Assign') with input shapes: [1367,1024], [1426,1024].

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-6-4acc701a7fe8> in <module>()
      7                        n_chars=viz.input_vocab.size(),
      8                        n_labels=viz.output_vocab.size())
----> 9 pred_model.load_weights(weights_file, by_name=True)

/opt/conda/lib/python3.6/site-packages/keras/engine/topology.py in load_weights(self, filepath, by_name)
   2568             f = f['model_weights']
   2569         if by_name:
-> 2570             load_weights_from_hdf5_group_by_name(f, self.layers)
   2571         else:
   2572             load_weights_from_hdf5_group(f, self.layers)

/opt/conda/lib/python3.6/site-packages/keras/engine/topology.py in load_weights_from_hdf5_group_by_name(f, layers)
   3109                 weight_value_tuples.append((symbolic_weights[i],
   3110                                             weight_values[i]))
-> 3111     K.batch_set_value(weight_value_tuples)

/opt/conda/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in batch_set_value(tuples)
   2181                 assign_placeholder = tf.placeholder(tf_dtype,
   2182                                                     shape=value.shape)
-> 2183                 assign_op = x.assign(assign_placeholder)
   2184                 x._assign_placeholder = assign_placeholder
   2185                 x._assign_op = assign_op

/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/variables.py in assign(self, value, use_locking)
    514       the assignment has completed.
    515     """
--> 516     return state_ops.assign(self._variable, value, use_locking=use_locking)
    517 
    518   def assign_add(self, delta, use_locking=False):

/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/state_ops.py in assign(ref, value, validate_shape, use_locking, name)
    269     return gen_state_ops.assign(
    270         ref, value, use_locking=use_locking, name=name,
--> 271         validate_shape=validate_shape)
    272   return ref.assign(value)

/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/gen_state_ops.py in assign(ref, value, validate_shape, use_locking, name)
     43   result = _op_def_lib.apply_op("Assign", ref=ref, value=value,
     44                                 validate_shape=validate_shape,
---> 45                                 use_locking=use_locking, name=name)
     46   return result
     47 

/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py in apply_op(self, op_type_name, name, **keywords)
    765         op = g.create_op(op_type_name, inputs, output_types, name=scope,
    766                          input_types=input_types, attrs=attr_protos,
--> 767                          op_def=op_def)
    768         if output_structure:
    769           outputs = op.outputs

/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in create_op(self, op_type, inputs, dtypes, input_types, name, attrs, op_def, compute_shapes, compute_device)
   2506                     original_op=self._default_original_op, op_def=op_def)
   2507     if compute_shapes:
-> 2508       set_shapes_for_outputs(ret)
   2509     self._add_op(ret)
   2510     self._record_op_seen_by_control_dependencies(ret)

/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in set_shapes_for_outputs(op)
   1871       shape_func = _call_cpp_shape_fn_and_require_op
   1872 
-> 1873   shapes = shape_func(op)
   1874   if shapes is None:
   1875     raise RuntimeError(

/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in call_with_requiring(op)
   1821 
   1822   def call_with_requiring(op):
-> 1823     return call_cpp_shape_fn(op, require_shape_fn=True)
   1824 
   1825   _call_cpp_shape_fn_and_require_op = call_with_requiring

/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/common_shapes.py in call_cpp_shape_fn(op, input_tensors_needed, input_tensors_as_shapes_needed, debug_python_shape_fn, require_shape_fn)
    608     res = _call_cpp_shape_fn_impl(op, input_tensors_needed,
    609                                   input_tensors_as_shapes_needed,
--> 610                                   debug_python_shape_fn, require_shape_fn)
    611     if not isinstance(res, dict):
    612       # Handles the case where _call_cpp_shape_fn_impl calls unknown_shape(op).

/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/common_shapes.py in _call_cpp_shape_fn_impl(op, input_tensors_needed, input_tensors_as_shapes_needed, debug_python_shape_fn, require_shape_fn)
    674       missing_shape_fn = True
    675     else:
--> 676       raise ValueError(err.message)
    677 
    678   if missing_shape_fn:

ValueError: Dimension 0 in both shapes must be equal, but are 1367 and 1426 for 'Assign' (op: 'Assign') with input shapes: [1367,1024], [1426,1024].

In [ ]:

#####

I'm using Keras 2.0.6 and tensorflow 1.2.1

Thanks

Attention probabilites

Observing your code and trying to work with different input and output lengths,I saw that in AttentionDecoder implementation for return probabilites = True,the shape of returned probabilites is (None, self.timesteps, self.timesteps).
So how do you get probabilites for varied input and output lengths?

attention mode

I do want to know the mode of this attention

cannot import name 'Layer' from 'keras.engine'

I want to add a custom attention layer to my model but when I run the code in
#15 (mzbac commented on Feb 8, 2018), I received the following error
cannot import name 'Layer' from 'keras.engine'
I think the problem is with importing Recurrent from keras.layers.recurrent
any suggestions to solve this issue?

cannot import name 'Recurrent' from 'keras.layers.recurrent'

Attention Decoder for OutputDimension in tens of thousands.

Hi Zafarali,

I am trying to use your attention network to learn seq2seq machine translation with attention. My spurce lang output vocab is of size 32,000 and target vocab size 34,000. The following step blows up the RAM usage while making the model (understandably, as its trying to manage a 34K x 34K float matrix):

	self.W_o = self.add_weight(shape=(self.output_dim, self.output_dim),
							   name='W_o',
							   initializer=self.recurrent_initializer,
							   regularizer=self.recurrent_regularizer,
							   constraint=self.recurrent_constraint)

Here is my model:
n_units:128, src_vocab_size:32000,tar_vocab_size:34000,src_max_length:11, tar_max_length:11

	def define_model(n_units, src_vocab_size, tar_vocab_size, src_max_length, tar_max_length):
		model = Sequential()
		model.add(Embedding(src_vocab_size, n_units, input_length=src_max_length, mask_zero=True))
		model.add(LSTM(n_units, return_sequences=True))
		model.add(AttentionDecoder(n_units, tar_vocab_size))
		return model

Is there any fix for this?

Wrong (according to the article) hidden state computation

Hi!
Thank you for the code implementing attention, it was really helpful.

You are referencing the article https://arxiv.org/pdf/1409.0473.pdf
But it seems that your implementation differs from that in the article when you are computing hidden state (proposed).
The original formulas in the article:

Pay attention at the weights before y_{i-1}, W - is the weights matrix but E - is embedding matrix

Now let's look what is impremented in AttentionDecoder:

        # now calculate the "z" gate
        zt = activations.sigmoid(
            K.dot(ytm, self.W_z)
            + K.dot(stm, self.U_z)
            + K.dot(context, self.C_z)
            + self.b_z)

        # calculate the proposal hidden state:
        s_tp = activations.tanh(
            K.dot(ytm, self.W_p)
            + K.dot((rt * stm), self.U_p)
            + K.dot(context, self.C_p)
            + self.b_p)

        # new hidden state:
        st = (1-zt)*stm + zt * s_tp

        yt = activations.softmax(
            K.dot(ytm, self.W_o)
            + K.dot(stm, self.U_o)
            + K.dot(context, self.C_o)
            + self.b_o)

So instead of previous output word ytm is the softmax probability of the labels distribution of the previous timestep. And that probabilities is passed into further computation instead of embedding vector.

This is a bug, at least it can easily lead to overfitting - the sofrmax probabilities give much more information than the embedding of the previous word. Moreover the size of one of the weights is (output_dim x output_dim). I was suffered from it even with the small character vocabulary (hieroglyphs) of about 3000, and can't imagine that this code may be used to provide translation with vocabe sizes of hungreds of thousands.

Hope this will be fixed soon

Codec Error

UnicodeEncodeError: 'charmap' codec can't encode characters in position 5-9: character maps to <undefined>

Using encoding='utf-8' on the opens fixed it as this stackoverflow post has:
https://stackoverflow.com/questions/44391671/python3-unicodeencodeerror-charmap-codec-cant-encode-characters-in-position

Significance of resetting states

In the build function, what's the significance of resetting the states ?

if self.stateful: super(AttentionDecoder, self).reset_states()

How to apply beam search ?

I want to apply beam search to the decoder, as described here. This requires some changes to the decoder but I can't figure out where. Any ideas?

UnicodeEncodeError: 'charmap' codec can't encode characters in position

python generate.py
creating dataset Traceback (most recent call last):
File "D:\Anaconda3\lib\encodings\cp437.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 4-8: character maps to

_time_distributed_dense is no longer available

_time_distributed_dense seems to be deprecated in recent versions of Keras. You are using it inside call method in custom_recurrents.py:

self._uxpb = _time_distributed_dense(self.x_seq, self.U_a, b=self.b_a,
input_dim=self.input_dim,
timesteps=self.timesteps,
output_dim=self.units)

Would the following replacement be valid?

dense = Dense(self.units, weights=self.U_a, input_dim=self.input_dim)
self._uxpb = TimeDistributed(dense)(self.x_seq)

Issue with the initial state in the attention-decoder

I left a comment on the Medium article, but I realized that it would be better placed here.

I am trying to use your simpleNMT and AttentionDecoder code and I get an error when trying to compile the model:

/PYTHONPATH/keras/layers/recurrent.py”, line 314, in call
if len(initial_state) != len(self.states):
TypeError: object of type ‘NoneType’ has no len()

You set the states in your attentiondecoder to

self.states = [None, None] # y, s

Since the initial_state variable will always be set in the recurrent.py code, self.states must be incorrectly empty. Since len([None,None]) will still return a value of 2, something must be happening to this variable that I cannot figure out.

Long processing time when output voc size large

I tried to use this layer in a machine translation task. I set the output_dim in your Attention layer as the output voc size (10k+). It seems to take the network forever to compile on my laptop. Any advice? Thanks!

Error in y_t output calculation

As pointed out in https://medium.com/@charlesenglebert/hello-f890944e39fd there is an issue in calculating the y_t in our model.

keras-attention/models/custom_recurrents.py

Lines 275 to 279 in 856dbe6

    
           yt = activations.softmax( 
        
               K.dot(ytm, self.W_o) 
        
               + K.dot(stm, self.U_o) 
        
               + K.dot(context, self.C_o) 
        
               + self.b_o)

Should be st and not stm (see also eqn 4 in the Bahdanau paper)

output_shapes[i] list index out of range

Hi. I am trying to do something very similar to your simpleNMT method at https://github.com/datalogue/keras-attention/blob/master/models/NMT.py . I found your AttentionDecoder and also the _time_distributed_dense() method at another site. I am getting the error below.

File "model.py", line 418, in embedding_model_lstm
recurrent_b, inner_lstmb_h, inner_lstmb_c = lstm_b(recurrent_a) ## <--- here
File "/home/dave/.local/lib/python3.6/site-packages/keras/legacy/layers.py", line 969, in call
return super(Recurrent, self).call(inputs, **kwargs)
File "/home/dave/.local/lib/python3.6/site-packages/keras/engine/topology.py", line 657, in call
arguments=user_kwargs)
File "/home/dave/.local/lib/python3.6/site-packages/keras/engine/topology.py", line 719, in _add_inbound_node
output_tensors[i]._keras_shape = output_shapes[i]
IndexError: list index out of range

some code is included below. Thanks for looking at this.

def embedding_model_lstm(words, units, tokens_per_sentence,
                         embedding_weights_a):

    lstm_unit_a = units
    lstm_unit_b = units  * 2
    embed_unit = 20000

    x_shape = (tokens_per_sentence,)
    decoder_dim = units * 2 

    valid_word_a = Input(shape=x_shape)

    embeddings_a = Embedding(words,embed_unit ,
                             weights=[embedding_weights_a],
                             input_length=tokens_per_sentence,
                             trainable=True)

    embed_a = embeddings_a(valid_word_a)

    lstm_a = Bidirectional(LSTM(units=lstm_unit_a,
                                return_sequences=True,
                                input_shape=(None,)
                                ), merge_mode='concat', trainable=True)

    recurrent_a = lstm_a(embed_a)

    lstm_b = AttentionDecoder(units=lstm_unit_b , output_dim=decoder_dim,
                  return_state=True)

    recurrent_b, inner_lstmb_h, inner_lstmb_c  = lstm_b(recurrent_a) ## <--- here

    dense_b = Dense(embed_unit, input_shape=(tokens_per_sentence,),
                    activation='softmax')

    decoder_b = dense_b(recurrent_b) 

    dropout_b = Dropout(0.15)(decoder_b)

    model = Model([valid_word_a], dropout_b)

Understanding the output when returning probbilities

Hi,
Thanks for this great job.
I don´t fully understand the shape of the return when returning probbilities. The shape is (self.timesteps, self.timesteps). In the visualization you squezze this to shape (0:output_length, 0:input_length), if I am not mistaken.
How do relatre these two? Could you elaborate a little bit?
I am trying to use it in a many-to-one rnn and it´s a little complicated to get useful visualization.
Thanks!

Can I visualize features selected from image sequence data

Thanks for your great work!
I am doing some research on analysing feature sequences by CNN+RNN. CNN model extracts the features from the input sequences and send it to my RNN model. The input sequence length for RNN is 5, and for every time point, there are 300 features. I see the example of your work is about the text analysis. Can I visualize my RNN model on my image feature data?
Thanks for your reading, and looking forward to your reply!

_time_distributed_dense is not available in keras 2.0.9

First, Thank you for your code!!
In keras 2.0.9, it will report an error that:
Could not import time distributed dense

Stacked architecture

Hi Zafareli

I was wondering how one would implement this code for an arbitrary number of stacked encoding and decoding layers. E,.g. my architecture (shown below contains 2 stacked LSTM layers, both in the encoding phase and the decoding phase (there is bi-directionality in the encoding phase):

PS. I am showing a batch model, I also have stateful inference model, in which i transfer the weights and states over to the decoding LSTMS.

# Canonical Model
charset = list(vocab.charset)
# Callbacks
h = History()
rlr = ReduceLROnPlateau(monitor='val_loss', factor=0.5,patience=10, min_lr=0.000001, verbose=1, min_delta=1e-5)
es = EarlyStopping(monitor='val_loss', min_delta=1e-6, patience=10, verbose=0, mode='auto')
l2 = 0.00002 # Don't use a high regularization.
tb = TensorBoard(log_dir='./logs', histogram_freq=0, batch_size=250, update_freq='epoch')

#---------------------- Encoder--------------------------#
# Peviously used 250
lstm_dim = 500
# Input_shape= X_train.shape[1:]

# Input shape
canonical_encoder_input = Input(shape=(None, len(charset)))

# First encoder layer
encoder_LSTM = Bidirectional(CuDNNLSTM(lstm_dim // 2, return_state=True, return_sequences=True, name='bd_enc_LSTM_01'))
encoder1_output, forward_h1, forward_c1, backward_h1, backward_c1 = encoder_LSTM (canonical_encoder_input)

# Second encoder layer
encoder_LSTM2 = Bidirectional(CuDNNLSTM(lstm_dim // 2, return_state=True, return_sequences=True, name='bd_enc_LSTM_02'))
encoder2_output, forward_h2, forward_c2, backward_h2, backward_c2 = encoder_LSTM2 (encoder1_output)

# Concatenate all states together
encoder_states = Concatenate(axis=-1)([forward_h1, forward_c1, forward_h2, forward_c2,
                                       backward_h1, backward_c1, backward_h2, backward_c2])

encoder_dense_layer = Dense(lstm_dim, kernel_regularizer=regularizers.l2(l2), activation='relu', name="enc_dense")
encoder_dense = encoder_dense_layer(encoder_states)
# Add dropout here?

print(type(encoder_dense))

#---------------------- states--------------------------#

# States for the first LSTM layer
canonical_decoder_input = Input(shape= (None, len(charset))) #teacher forcing
dense_h1 = Dense(lstm_dim, kernel_regularizer=regularizers.l2(l2), activation='relu', name="dec_dense_h1")
dense_c1 = Dense(lstm_dim, kernel_regularizer=regularizers.l2(l2), activation='relu', name="dec_dense_c1")
state_h1 = dense_h1(encoder_dense)
state_c1 = dense_c1(encoder_dense)
states1 =[state_h1, state_c1]


# States for the second LSTM layer
dense_h2 = Dense(lstm_dim, activation='relu', name="dec_dense_h2")
dense_c2 = Dense(lstm_dim, activation='relu', name="dec_dense_c2")
state_h2 = dense_h2(encoder_dense)
state_c2 = dense_c2(encoder_dense)
states2 =[state_h2, state_c2]

#------------------------Decoder------------------------#

# This goes through a decoding lstm
decoder_LSTM1 = CuDNNLSTM(lstm_dim, return_sequences=True, return_state=True, name='dec_LSTM_01')
decoder1_output,_,_ = decoder_LSTM1(canonical_decoder_input, initial_state=states1)


# Couple the first LSTM with the 2nd LSTM
decoder_LSTM2 = CuDNNLSTM(lstm_dim, return_sequences=True, return_state=True, name='dec_LSTM_02')
decoder2_output,_,_ = decoder_LSTM2(decoder1_output, initial_state=states2) 


# Pass hidden states of decoder2_outputs to dense layer with softmax
decoder_dense = Dense(len(charset), kernel_regularizer=regularizers.l2(l2),  activation='softmax', name="dec_dense_softmax")
decoder_out = decoder_dense(decoder2_output)

#----------------------compilations------------#

# Model compilation (canonical)
model = Model(inputs=[canonical_encoder_input, canonical_decoder_input], outputs=[decoder_out])
#Run training
start = time.time()

# Optimizers
#learning_rate = 0.002, #comment out if using exponential LearningRateScheduler
adam=Adam() 
rms=RMSprop() 

# Full (canonical) model
model.compile(optimizer=adam, loss='categorical_crossentropy')

# Custom exponential learning rate scheduler
lr_schedule = LearningRateSchedule(epoch_to_start=50, last_epoch=349)

lr_scheduler = LearningRateScheduler(schedule=lr_schedule.exp_decay, verbose=1)


# Fit
model.fit(x=[X_train, Y_train], 
          y=Y_train_target, 
          batch_size=250, 
          epochs=350,
          shuffle = True,
          validation_data = ([X_test, Y_test], Y_test_target),
          callbacks = [h, tb, lr_scheduler])

end = time.time()
print(end - start)
model.summary()

Is it possible to implement your code for this architecture?

best,

Dean

UnicodeEncodeError

I got this error running generate.py
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-6: ordinal not in range(128)
I solved it by replacing generate.py

f.write('"'+h + '","' + m + '"\n')
with

f.write(('"'+h + '","' + m + '"\n').encode('utf-8'))
Ivo

Bad results

Hello. I trained the model and got literally no results. I used example data from generate.py (dates). After more than hour of training I stopped (early stop) and got no results. As follows:

Model Compiled.
Training. Ctrl+C to end early.
Epoch 1/50
100/100 [==============================] - 593s 6s/step - loss: 0.5084 - accuracy: 0.8446 - all_acc: 0.0000e+00 - val_loss: 0.3177 - val_accuracy: 0.8925 - val_all_acc: 0.0000e+00
Epoch 2/50
100/100 [==============================] - 566s 6s/step - loss: 0.2793 - accuracy: 0.8904 - all_acc: 0.0000e+00 - val_loss: 0.2405 - val_accuracy: 0.8974 - val_all_acc: 0.0000e+00
Epoch 3/50
100/100 [==============================] - 559s 6s/step - loss: 0.2366 - accuracy: 0.8967 - all_acc: 0.0000e+00 - val_loss: 0.2349 - val_accuracy: 0.8966 - val_all_acc: 0.0000e+00
Epoch 4/50
100/100 [==============================] - 561s 6s/step - loss: 0.2362 - accuracy: 0.8959 - all_acc: 0.0000e+00 - val_loss: 0.2305 - val_accuracy: 0.8975 - val_all_acc: 0.0000e+00
Epoch 5/50
100/100 [==============================] - 567s 6s/step - loss: 0.2346 - accuracy: 0.8980 - all_acc: 0.0000e+00 - val_loss: 0.2386 - val_accuracy: 0.9010 - val_all_acc: 0.0000e+00
Epoch 6/50
100/100 [==============================] - 560s 6s/step - loss: 0.2322 - accuracy: 0.9020 - all_acc: 0.0000e+00 - val_loss: 0.2316 - val_accuracy: 0.9045 - val_all_acc: 0.0000e+00
Epoch 7/50
 77/100 [======================>.......] - ETA: 1:38 - loss: 0.2277 - accuracy: 0.9052 - all_acc: 0.0000e+00Model training stopped early.
Model training complete.
~~~~~
input: 26th January 2016
output: 1902-02-22<eot><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>
~~~~~
input: 3 April 1989
output: 1970-01-00<eot><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>
~~~~~
input: 5 Dec 09
output: 1970-01-01<eot><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>
~~~~~
input: Sat 8 Jun 2017
output: 1970-01-20<eot><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>

I'm not sure what's the case there. Maybe I should use some other dates, but while typing run_examples(model, input_vocab, output_vocab) after change of string dates I get NameError: name 'model' is not defined. Where is the model?

I have some files like "NMT.01-0.32.hdf5" but don't know how to reuse them.

Local attention

Hi, first thanks a lot for the tutorial and the code that work right away!
In my case during the training I have sequence of well defined length,
but when I apply the model I work with sequences from size 100 to 100000.
So one problem is that the timestep is encoded in hard in the AttentionEncoder.
I could rebuild the model each time that I apply it but it is a bit heavy.

So one option would be to change the code to make it independent of the self.timesteps .
Another one would be to put a hardcoded window of fixed length, because I don't expect
a dependency larger than 10 to 20 letters around any given step.

what do you think will be the easiest / more interesting ?

Thanks

How to use AttentionDecoder after adding embedding layer?

Hi,
I'm trying to use your AttentionDecoder, but get stuck after adding embedding layer.
I've used tokenizer and pad_sequences to encode each doc (total number is 100) and its corresponding label (ytrain below), and actually what I want to try is echo partial sequence in each doc.

The shape of input data looks like:

Xtrain.shape = (100, 148)

ytrain.shape = (100, 148) -> this is the partial sequence of Xtrain

This is how I build the model:

model = Sequential()
model.add(Embedding(vocab_size, 150, input_length=max_length))
model.add(Bidirectional(LSTM(150, return_sequences=True)))
model.add(AttentionDecoder(150, n_features)) # n_features = vocab_size
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
model.summary()
model.fit(Xtrain, ytrain, epochs=epochs, verbose=1)

But it throws me an error:

ValueError: Error when checking target: expected AttentionDecoder to have 3 dimensions, but got array with shape (100, 148)

Can you tell me how to solve it? Thank you very much.

Cannot reproduce visualization

Enjoyed your blog post and thank you for the code. I am, however, having trouble reproducing your visualization of the attention layer. If I use your sample vocabulary (sample_*_vocab.json) and your sample weights (sample_NMT.49.0.01.hdf5) the visualization works as expected. However, if I generate a fresh training and validation set, train for ~50 epochs (acc=0.9952), and generate the visualization plots with those weights it makes little sense. Unlike the pristine plots on your blog post, these plots seem to indicate that there are just 2 or 3 points of interest in the attention map. The predicted text however is correct every time. Here's an example - http://bit.ly/2umNcvv

Is the problem that the attention layer takes significantly longer to train, or is it something that I am missing?

Also, during training all_acc and val_all_acc are always zero. Any insights?

I am running TF-1.1.0 and Keras-2.0.6.

Variable Input and Output Sequnce Time Series Data

I am trying to implement this model for time series data.

I want to use past 5 days data (time_step=5) and predict the next day(y=1)

How can I prepare the dataset and use it for my scenario ?

where is "Recurrent" in tensorflow 1.5?

Hello!
Your code seems a good job to understand attention technics!
I'm working on tensorflow last version 1.5 including keras. My issue is simple , I don't find Recurrent to import...
in the code:

import tensorflow as tf
from tensorflow.python.keras import backend as K
from tensorflow.python.keras import regularizers, constraints, initializers, activations
from tensorflow.python.keras.layers.recurrent import Recurrent
the error is here
from tensorflow.python.keras.layers import InputSpec
from .tdd import time_distributed_dense
tfPrint = lambda d, T: tf.Print(input=T, data=[T, tf.shape(T)], message=d)
class AttentionDecoder(Recurrent):.....

the error is the following
File "/home/adrien/Tensorflow/Projets/keras-attention/models/custom_recurrents.py", line 4, in
from tensorflow.python.keras.layers import Recurrent
ImportError: cannot import name 'Recurrent'

I thank you in advance

Working with real output

Hi - it seems that the original paper and this implementation addresses a target output that is one-hot encoded. To have this work with targets / y-values that are real numbers (I use lstms to experiment with non-quantized audio), would I just have to change the softmax activation in the yt calculation of the step() function below? Eg. change the activation to sigmoid or tanh?

        yt = activations.softmax(
            K.dot(ytm, self.W_o)
            + K.dot(st, self.U_o)
            + K.dot(context, self.C_o)
            + self.b_o)

Or does the attention concept, as described in the paper, not work with real-valued targets/outputs? I know that lstm models typically function best with quantized data/ one-hot vectors quashed with a softmax function... but real output is what i am playing with.
This is neat work... thanks!

Sequence encoder decoder

I am using your decoder to implement my sequence encoder/decoder but actually i don't know how can I do to get the decoder output the same shape as my input. My input is (None, MAX_SEQ_LENGTH) these are my dialogue turns that I encode and the decode to get the next dialogue turn which has the same shape i.e (None, MAX_SEQ_LENGTH) although the decoder returns 3D dimension tensor because of return_sequences=True.

Can you help me please to do this ?

ValueError: Cannot create a tensor proto whose content is larger than 2GB.

I ran into the following issue when using the custom s2s attention layer. (Keras 2.1.3 and python3.6.1). Anyone can help?

File "/apps/keras/2.1.3-py36/lib/python3.6/site-packages/keras/legacy/layers.py", line 968, in call
return super(Recurrent, self).call(inputs, **kwargs)
File "/apps/keras/2.1.3-py36/lib/python3.6/site-packages/keras/engine/topology.py", line 590, in call
self.build(input_shapes[0])
File "attention/custom_recurrents.py", line 180, in build
constraint=self.recurrent_constraint)
File "/apps/keras/2.1.3-py36/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/apps/keras/2.1.3-py36/lib/python3.6/site-packages/keras/engine/topology.py", line 414, in add_weight
constraint=constraint)
File "/apps/keras/2.1.3-py36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 392, in variable
v = tf.Variable(value, dtype=tf.as_dtype(dtype), name=name)
File "/apps/python/3.6.1/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 197, in init
expected_shape=expected_shape)
File "/apps/python/3.6.1/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 285, in _init_from_args
initial_value, name="initial_value", dtype=dtype)
File "/apps/python/3.6.1/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 637, in convert_to_tensor
as_ref=False)
File "/apps/python/3.6.1/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 702, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/apps/python/3.6.1/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 110, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/apps/python/3.6.1/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 99, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/apps/python/3.6.1/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 430, in make_tensor_proto
"Cannot create a tensor proto whose content is larger than 2GB.")
ValueError: Cannot create a tensor proto whose content is larger than 2GB.

Error: Dimensions must be equal, but are 33 and 4

Hi, I was trying to compile the attention decoder model:

# check to see if it compiles
if __name__ == '__main__':
    from keras.layers import Input, LSTM
    from keras.models import Model
    from keras.layers.wrappers import Bidirectional
    i = Input(shape=(100,104), dtype='float32')
    enc = Bidirectional(LSTM(64, return_sequences=True), merge_mode='concat')(i)
    dec = AttentionDecoder(32, 4)(enc)
    model = Model(inputs=i, outputs=dec)
    model.summary()

But it reported the following error:

    raise ValueError(err.message)
ValueError: Dimensions must be equal, but are 33 and 4 for 'AttentionDecoder/MatMul_4' (op: 'MatMul') with input shapes: [?,33], [4,33].

Any ideas why that would happen? I would greatly appreciate any helps!

Differences with Bahdanau et al., 2014

Thank you for sharing your code!

I spotted some minor differences between your implementation and the attention mechanism described in Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio, "Neural machine translation by jointly learning to align and translate.", [arXiv preprint arXiv:1409.0473] (2014) (https://arxiv.org/pdf/1409.0473.pdf).

I) In the initialization part, you have s0 = activations.tanh(K.dot(inputs[:, 0], self.W_s)).
If I'm not wrong, inputs[:, 0] should correspond to h_forward_1, whilst in the reference paper should be h_backward_1

II) In the output generation step you have yt = activations.softmax(K.dot(ytm, self.W_o)+ K.dot(stm, self.U_o)+ K.dot(context, self.C_o)+ self.b_o) whilst in the paper it's a little bit more complicated, see section A.2.2.

Thank you again for sharing this attention implementation.

Compile issue - Tensors in list passed to 'values' of 'Pack' Op have types [int32, <NOT CONVERTIBLE TO TENSOR>, int32] that don't all match

Hi,

I tried using this in a really simple model:

def simple_model(input_dim, units, output_dim=29):
    input_data = Input(name='the_input', shape=(None, input_dim))
    rnn = GRU(units, return_sequences=True, implementation=2)(input_data)
    decoded = AttentionDecoder(units,output_dim)(rnn)
    y_pred = Activation('softmax', name='softmax')(decoded)
    model = Model(inputs=input_data, outputs=y_pred)
    print(model.summary())
    return model
if __name__=='__main__':
    model_5 = simple_model(input_dim=161,units=200,output_dim=29)

and keep getting a strange error, see below. Am using keras 2.0.5, Tensorflow 1.1

Any idea what could be the matter?
Thanks a lot!

C:\Users\egork\Anaconda3\envs\aind-vui\python.exe "C:\Program Files\JetBrains\PyCharm 2017.2.2\helpers\pydev\pydevd.py" --multiproc --qt-support=auto --client 127.0.0.1 --port 50765 --file C:/Users/egork/Dropbox/GitHub/aind/AIND-VUI-Capstone/sample_models.py
pydev debugger: process 2560 is connecting

Connected to pydev debugger (build 172.3968.37)
Using TensorFlow backend.
inputs shape: (?, ?, 200)
Traceback (most recent call last):
  File "C:\Users\egork\Anaconda3\envs\aind-vui\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 435, in apply_op
    as_ref=input_arg.is_ref)
  File "C:\Users\egork\Anaconda3\envs\aind-vui\lib\site-packages\tensorflow\python\framework\ops.py", line 737, in internal_convert_n_to_tensor
    preferred_dtype=preferred_dtype))
  File "C:\Users\egork\Anaconda3\envs\aind-vui\lib\site-packages\tensorflow\python\framework\ops.py", line 676, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "C:\Users\egork\Anaconda3\envs\aind-vui\lib\site-packages\tensorflow\python\framework\constant_op.py", line 121, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "C:\Users\egork\Anaconda3\envs\aind-vui\lib\site-packages\tensorflow\python\framework\constant_op.py", line 102, in constant
    tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
  File "C:\Users\egork\Anaconda3\envs\aind-vui\lib\site-packages\tensorflow\python\framework\tensor_util.py", line 364, in make_tensor_proto
    raise ValueError("None values not supported.")
ValueError: None values not supported.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm 2017.2.2\helpers\pydev\pydevd.py", line 1599, in <module>
    globals = debugger.run(setup['file'], None, None, is_module)
  File "C:\Program Files\JetBrains\PyCharm 2017.2.2\helpers\pydev\pydevd.py", line 1026, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm 2017.2.2\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/egork/Dropbox/GitHub/aind/AIND-VUI-Capstone/sample_models.py", line 238, in <module>
    model_5 = simple_model(input_dim=161,units=200,output_dim=29)
  File "C:/Users/egork/Dropbox/GitHub/aind/AIND-VUI-Capstone/sample_models.py", line 177, in simple_model
    decoded = AttentionDecoder(units,output_dim)(rnn)
  File "C:\Users\egork\Anaconda3\envs\aind-vui\lib\site-packages\keras\layers\recurrent.py", line 262, in __call__
    return super(Recurrent, self).__call__(inputs, **kwargs)
  File "C:\Users\egork\Anaconda3\envs\aind-vui\lib\site-packages\keras\engine\topology.py", line 596, in __call__
    output = self.call(inputs, **kwargs)
  File "C:/Users/egork/Dropbox/GitHub/aind/AIND-VUI-Capstone\custom_recurrents.py", line 210, in call
    return super(AttentionDecoder, self).call(x)
  File "C:\Users\egork\Anaconda3\envs\aind-vui\lib\site-packages\keras\layers\recurrent.py", line 341, in call
    input_length=input_shape[1])
  File "C:\Users\egork\Anaconda3\envs\aind-vui\lib\site-packages\keras\backend\tensorflow_backend.py", line 2459, in rnn
    outputs, _ = step_function(inputs[0], initial_states + constants)
  File "C:/Users/egork/Dropbox/GitHub/aind/AIND-VUI-Capstone\custom_recurrents.py", line 232, in step
    _stm = K.repeat(stm, self.timesteps)
  File "C:\Users\egork\Anaconda3\envs\aind-vui\lib\site-packages\keras\backend\tensorflow_backend.py", line 1864, in repeat
    pattern = tf.stack([1, n, 1])
  File "C:\Users\egork\Anaconda3\envs\aind-vui\lib\site-packages\tensorflow\python\ops\array_ops.py", line 856, in stack
    return gen_array_ops._pack(values, axis=axis, name=name)
  File "C:\Users\egork\Anaconda3\envs\aind-vui\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 1949, in _pack
    result = _op_def_lib.apply_op("Pack", values=values, axis=axis, name=name)
  File "C:\Users\egork\Anaconda3\envs\aind-vui\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 463, in apply_op
    raise TypeError("%s that don't all match." % prefix)
TypeError: Tensors in list passed to 'values' of 'Pack' Op have types [int32, <NOT CONVERTIBLE TO TENSOR>, int32] that don't all match.

new keras version

in the new keras version,for example 2.1.3

class Recurrent is in the legacy package,
is this means that it't bettter implements RNN class instead of Recurrent?
class AttentionDecoder(Recurrent):
class AttentionDecoder(RNN):

README should include link to Blog (once published)

Concatenate two AttentionDecoders raise ValueError

Hi,

I have two networks that I want to concatenate. So, here is the piece of code

...
a = Bidirectional(LSTM(256, return_sequences=True))(input_a)
a = AttentionDecoder(128, 128)(a)
...
b = Bidirectional(LSTM(256, return_sequences=True))(input_b)
b = AttentionDecoder(128, 128)(b)
...
c = concatenate([a, b])
d = Model([input_a, input_b], c)

This raise ValueError: The name "AttentionDecoder" is used 2 times in the model. All layer names should be unique.

Any idea how to deal with this problem? I already comment name=AttentionDecoder inside the class/funcion.

How do I pass the output of AttentionDecoder to an RNN layer.

I am trying to the pass the decoder output to another layer of rnn. However it gives me the error. #TypeError: float() argument must be a string or a number, not 'Dimension'

x_in= Input(shape=(x_train.shape[1], x_train.shape[2]), name='x_in')

meta_in= Input(shape=(x_meta_train.shape[1], x_meta_train.shape[2]), name='meta_in')

x=Bidirectional(LSTM(100, input_shape=(x_train.shape[1], x_train.shape[2]), activation='tanh', return_sequences=True))(x_in)

y=LSTM(100, input_shape=(x_meta_train.shape[1], x_meta_train.shape[2]), activation='tanh', return_sequences=True)(meta_in)

x_=AttentionDecoder(50, x.shape[2], name='AD1')(x)

y_= AttentionDecoder(50, y.shape[2],name='AD2')(y)

x__=Bidirectional(LSTM(20, input_shape=(50, x_.shape[2].value), activation='tanh', return_sequences=True))(x_) #TypeError: float() argument must be a string or a number, not 'Dimension'

y__=Bidirectional(LSTM(20, input_shape=(50, y_.shape[2].value), activation='tanh', return_sequences=True))(y_)

	yt = activations.softmax(
	K.dot(ytm, self.W_o)
	+ K.dot(stm, self.U_o)
	+ K.dot(context, self.C_o)
	+ self.b_o)