serversidehannes / las Goto Github PK
View Code? Open in Web Editor NEWtf 2.0 implementation of Listen, attend and spell
tf 2.0 implementation of Listen, attend and spell
Hi. This is Yong Joon Lee. I am implementing LAS model based on your code. I know you might not remember the actual code cuz obviously you implemented it 3 years ago. But I think I found out that class att_rnn might have a tiny mistake in code ordering. If you see the class att_rnn's call part. you define s twice in a row then move onto c, which is a attention context.
your ordering is as below:
s = self.rnn(inputs = inputs, states = states) # s = m_{t}, [m_{t}, c_{t}] #m is memory(hidden) and c is carry(cell)
s = self.rnn2(inputs=s[0], states = s[1])[1] # s = m_{t+1}, c_{t+1}
c = self.attention_context([s[0], h])
but isn't it supposed to be as below?
s = self.rnn(inputs = inputs, states = states) # s = m_{t}, [m_{t}, c_{t}]
c = self.attention_context([s[0], h])
s = self.rnn2(inputs=s[0], states = s[1])[1] # s = m_{t+1}, c_{t+1}
As the original paper suggests, attention context vector at timestep t is made by applying attention to the s_t and h, where h is a result of pBLSTM. But I think by your way of ordering you are deriving attention context vector from s_{t+1} and h. Thank you for your great work.
Please why reshaping the output of pBLSTM by a factor of 4
is it necessary to use one hot encoding or we can use tf.keras.preprocessing.text.Tokenizer for encoding?
x_2 should have shape (Batch-size, no_prev_tokens, No_tokens).
x_2 = np.random.random((1, 12, 16))
When you say number of previous token, what exactly does it mean?
At training time I would know all the tokens, right?
I think the dimension is not right in part Listen
:
x = pBLSTM( dim//2 )(input_1) # (..., audio_len//2, dim*2)
x = pBLSTM( dim//2 )(x) # (..., audio_len//4, dim*2)
x = pBLSTM( dim//4 )(x) # (..., audio_len//8, dim)
which I corrected as :
x = pBLSTM( dim//2 )(input_1) # (..., audio_len//4, dim*2)
x = pBLSTM( dim//2 )(x) # (..., audio_len//16, dim*2)
x = pBLSTM( dim//4 )(x) # (..., audio_len//64, dim)
Is it right?
Have you tried to train the model on e.g. Librispeech dataset? I would like to see the word error rate.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.