asappresearch / sru Goto Github PK
View Code? Open in Web Editor NEWTraining RNNs as Fast as CNNs (https://arxiv.org/abs/1709.02755)
License: MIT License
Training RNNs as Fast as CNNs (https://arxiv.org/abs/1709.02755)
License: MIT License
Hi,
when I use SRU in DrQA to instead of LSTM, this error happened to me,
File "/home/hebian.ww/DrQA/drqa/reader/cuda_functional.py", line 359, in forward stream=SRU_STREAM File "cupy/cuda/function.pyx", line 129, in cupy.cuda.function.Function.__call__ (cupy/cuda/function.cpp:3963) File "cupy/cuda/function.pyx", line 111, in cupy.cuda.function._launch (cupy/cuda/function.cpp:3600) File "cupy/cuda/driver.pyx", line 127, in cupy.cuda.driver.launchKernel (cupy/cuda/driver.cpp:2541) File "cupy/cuda/driver.pyx", line 62, in cupy.cuda.driver.check_status (cupy/cuda/driver.cpp:1446) cupy.cuda.driver.CUDADriverError: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
same error with #4 , but I received this error with single GPU
I encountered this error with Python 3.6.
Changing line 355 of cuda_functional.py into num_block = (ncols-1)//thread_per_block+1
solved the problem.
Hi @taolei87 ,
I have a question about weight matrix dimension,
In the SRUCell code, I found the k = 4 if n_in != out_size else 3
But When I read the paper, it's only have 3 weight matrix, W, Wf, Wr,
And I found the n_in
will not equal to out_size
when the layer number is 0, but I don't understand why k = 4, what's those weight other than W, Wf, Wr ?
below is init code:
class SRUCell(nn.Module):
def __init__(self, n_in, n_out, dropout=0, vari_dropout=0,
use_tanh=1, bidirectional=False):
....
out_size = n_out*2 if bidirectional else n_out
k = 4 if n_in != out_size else 3
self.size_per_dir = n_out*k
self.weight = nn.Parameter(torch.Tensor(
n_in,
self.size_per_dir*2 if bidirectional else self.size_per_dir
))
...
below is when in_in is not equal to out_size:
class SRU(nn.Module):
def __init__(self, input_size, hidden_size,
num_layers=2, dropout=0, vari_dropout=0,
use_tanh=1, bidirectional=False):
...
...
self.n_in = input_size
self.n_out = hidden_size
...
self.out_size = hidden_size*2 if bidirectional else hidden_size
for i in range(num_layers):
l = SRUCell(n_in=self.n_in if i == 0 else self.out_size,
n_out=self.n_out,
dropout=dropout if i+1 != num_layers else 0,
vari_dropout=vari_dropout,
use_tanh=use_tanh,
bidirectional=bidirectional)
self.rnn_lst.append(l)
Thanks
Hi,
I some people asked you to add recurrent batch norm. There is SELU (self-normalizing ReLU), which is actually easy to implement and works well. Do you mind to add it?
ref: https://arxiv.org/abs/1706.02515
Would SRU provide any benefits for a decoder that decodes step by step based on the previous decoded output? It will no longer be parallelizable over time steps so the only latency saving would come from the reduced number of operations per time step (compared to LSTM)?
i.e Leaky relu
Hi!
I have some issues running language_model example.
$ python3 train_lm.py --train train.txt --dev valid.txt --test test.txt
Namespace(batch_size=32, bias=-3, clip_grad=5, d=910, depth=6, dev='valid.txt', dropout=0.7, lr=1.0, lr_decay=0.98, lr_decay_epoch=175, lstm=False, max_epoch=300, rnn_dropout=0.2, test='test.txt', train='train.txt', unroll_size=35, weight_decay=1e-05)
WARNING: set_bias() is deprecated. use `highway_bias` option in SRUCell() constructor.
WARNING: set_bias() is deprecated. use `highway_bias` option in SRUCell() constructor.
WARNING: set_bias() is deprecated. use `highway_bias` option in SRUCell() constructor.
WARNING: set_bias() is deprecated. use `highway_bias` option in SRUCell() constructor.
WARNING: set_bias() is deprecated. use `highway_bias` option in SRUCell() constructor.
WARNING: set_bias() is deprecated. use `highway_bias` option in SRUCell() constructor.
vocab size: 10000
num of parameters: 24026720
train_lm.py:110: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
norms = [ "{:.0f}".format(x.norm().data[0]) for x in self.parameters() ]
p_norm: ['100', '45', '90', '45', '90', '45', '90', '45', '90', '45', '90', '47', '90', '0']
SRU loaded for gpu 0
train_lm.py:140: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.
torch.nn.utils.clip_grad_norm(model.parameters(), args.clip_grad)
Traceback (most recent call last):
File "train_lm.py", line 262, in <module>
main(args)
File "train_lm.py", line 206, in main
train_ppl = train_model(epoch, model, train)
File "train_lm.py", line 145, in train_model
p.data.add_(-lr, p.grad.data)
RuntimeError: expand(torch.cuda.FloatTensor{[2, 910]}, size=[1820]): the number of sizes provided (1) must be greater or equal to the number of dimensions in the tensor (2)
It looks like it is because some weight and grad has different shapes.
Thanks for your awesome work on this model, right now we're using it to process language and visual features to generate segmentation masks from referral expressions, however, we're experiencing major issues with the latest master revision.
Until commit 43c85ed, our model did train perfectly (without gradient clipping or any additional techniques), however, after updating to the latest version, the model fails to converge during training. Also, our old weights are not compatible with the current SRU version and we haven't touched our code at all. We would like to know if is there any incompatibility introduced after the multigpu branch was merged?
Here's a little snippet depicting our current declaration and usage of the SRU model:
class Net(nn.Module):
def __init__(self, *args, **kwargs):
...
self.lang_model = SRU(emb_size, hid_size, num_layers=lang_layers)
...
self.mrnn = SRU(mixed_size, hid_mixed_size,
num_layers=mixed_layers)
def forward(self, vis, lang):
...
lang = self.emb(lang)
# LxB representation
lang = torch.transpose(lang, 0, 1)
# input has dimensions: seq_length x batch_size (1) x we_dim
lang, _ = self.lang_model(lang)
...
# input has dimensions: seq_length x batch_size (1) x mix_size
output, _ = self.mrnn(q)
...
I would appreciate any guidance to solve this issue. Right now, our model takes 5 days to train and we would like to parallelize it on multiple GPUs, such that we can reduce that time.
Hi Tao,
I recently found there is issue in bi-direction case,
such as input_size = 6, hidden_size = 3, direction_count = 2, length = 2, batch_size = 2
in this case, k == 3,
the x matrix will be like this:
l(the f
or the r
latter before x's number is mark for forward and reverse (flip == 1))
[[[f-0.302948 f-0.255578 f-0.110915 r0.1591 r0.928114 r0.92241 ]
[f-0.50604 f0.391675 f-0.187608 r0.468802 r-0.648262 r-0.177739]]
[[ f0.50936 f0.67189 f-0.619738 r0.377355 r0.545083 r-0.971449]
[ f0.948531 f-0.551092 f0.227567 r-0.46116 r-0.496896 r-0.769874]]]
In the forward kernel:
I did print in forward kernel:
[F] col:0 L:0 N:0 D:0 DIR:0 act:1 k:3 d:3 x: -0.302948
[F] col:1 L:0 N:0 D:1 DIR:0 act:1 k:3 d:3 x: -0.255578
[F] col:2 L:0 N:0 D:2 DIR:0 act:1 k:3 d:3 x: -0.110915
[F] col:3 L:0 N:0 D:0 DIR:1 act:1 k:3 d:3 x: 0.377355
[F] col:4 L:0 N:0 D:1 DIR:1 act:1 k:3 d:3 x: 0.545083
[F] col:5 L:0 N:0 D:2 DIR:1 act:1 k:3 d:3 x: -0.971449
[F] col:6 L:0 N:1 D:0 DIR:0 act:1 k:3 d:3 x: -0.506040
[F] col:7 L:0 N:1 D:1 DIR:0 act:1 k:3 d:3 x: 0.391675
[F] col:8 L:0 N:1 D:2 DIR:0 act:1 k:3 d:3 x: -0.187608
[F] col:9 L:0 N:1 D:0 DIR:1 act:1 k:3 d:3 x: -0.461160
[F] col:10 L:0 N:1 D:1 DIR:1 act:1 k:3 d:3 x: -0.496896
[F] col:11 L:0 N:1 D:2 DIR:1 act:1 k:3 d:3 x: -0.769874
[F] col:0 L:1 N:0 D:0 DIR:0 act:1 k:3 d:3 x: 0.509360
[F] col:1 L:1 N:0 D:1 DIR:0 act:1 k:3 d:3 x: 0.671890
[F] col:2 L:1 N:0 D:2 DIR:0 act:1 k:3 d:3 x: -0.619738
[F] col:3 L:1 N:0 D:0 DIR:1 act:1 k:3 d:3 x: 0.159100
[F] col:4 L:1 N:0 D:1 DIR:1 act:1 k:3 d:3 x: 0.928114
[F] col:5 L:1 N:0 D:2 DIR:1 act:1 k:3 d:3 x: 0.922410
[F] col:6 L:1 N:1 D:0 DIR:0 act:1 k:3 d:3 x: 0.948531
[F] col:7 L:1 N:1 D:1 DIR:0 act:1 k:3 d:3 x: -0.551092
[F] col:8 L:1 N:1 D:2 DIR:0 act:1 k:3 d:3 x: 0.227567
[F] col:9 L:1 N:1 D:0 DIR:1 act:1 k:3 d:3 x: 0.468802
[F] col:10 L:1 N:1 D:1 DIR:1 act:1 k:3 d:3 x: -0.648262
[F] col:11 L:1 N:1 D:2 DIR:1 act:1 k:3 d:3 x: -0.177739
F
For you reference, this is code add to print value.
void sru_bi_fwd(...) {
for (int row = 0; row < len; ++row )
{
...
...
*hp = (val*mask-(*xp))*g2 + (*xp);
printf("[F] col:%d L:%d N:%d D:%d DIR:%d act:%d k:%d d:%d x:%f\n",
col, cnt, (col/d2), (col%d), flip, activation_type, k, d, *(xp) );
And I found for the forward direction( flip == 0 (print as DIR)) code, only access the left half of input x, and backward direction(flip == 1) only access the right half of input x.
This behavior is very different from every other case in SRU (k == 3, uni-direction, k == 4, uni/bi-direction), Since it only can keep half information of input by the reset gate, but the activation will see all the x's input.
Do you think this is a issue ?
Hi,
I'm trying to understand what exactly is "output", and what is "hidden". Based on the code in cuda_functional.py
, it seems that "output" is corresponding to the hidden states of LSTM, and "hidden" is corresponding to the cell states of LSTM. Is this understanding right?
Or in fact, both hidden and output are the same, hidden is just the last time-step hidden state for each layer, while output is the all-time-step hidden states for the last layer?
output, hidden = rnn(x) # forward pass
# output is (length, batch size, hidden size * number of directions)
# hidden is (layers, batch size, hidden size * number of directions)
prevx = input
lstc = []
for i, rnn in enumerate(self.rnn_lst):
h, c = rnn(prevx, c0[i])
prevx = h
lstc.append(c)
if return_hidden:
return prevx, torch.stack(lstc)
else:
return prevx
Hi, I'm trying to implement a naive version of this paper in Keras, and was wondering how is the case that - n_in != n_out handled.
I went through the code a few times, and couldn't understand the element wise multiplication of (1 - r_t) with x_t, if x_t is of a different shape than r_t.
I'm not sure how, but I'm seeing this error when I try to compute the backwards function. Don't know if you've come across this during your debug?
Traceback (most recent call last):
File "gan_language.py", line 341, in <module>
G.backward(one)
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 156, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/__init__.py", line 98, in backward
variables, grad_variables, retain_graph)
File "/home/nick/wgan-gp/sru/cuda_functional.py", line 417, in backward
stream=SRU_STREAM
File "cupy/cuda/function.pyx", line 129, in cupy.cuda.function.Function.__call__ (cupy/cuda/function.cpp:4010) File "cupy/cuda/function.pyx", line 111, in cupy.cuda.function._launch (cupy/cuda/function.cpp:3647)
File "cupy/cuda/driver.pyx", line 127, in cupy.cuda.driver.launchKernel (cupy/cuda/driver.cpp:2541)
File "cupy/cuda/driver.pyx", line 62, in cupy.cuda.driver.check_status (cupy/cuda/driver.cpp:1446)
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
"k = 4 if n_in != out_size else 3".
I notice that many places involve the variable k which sometimes be 4 and sometimes be 3. It seems that it is related to the different layer. I think the k is related to the 3 different W matrix in the paper which are W, W_f, W_r. Could you please explain about it? Thanks.
Hi
Thanks for releasing the code. I am wondering whether it is possible to release a torch binding?
Thanks!
Can you please provide more detailed instructions on how to run the speech experiments from your paper? I am a graduate student participating in the ICLR 2018 Reproduciblity Challenage and we are having some difficulty reproducing the speech experiments as described by the installation instructions provided in the sru/speech
directory. I was able to install Kaldi and CUDA 8, but I am uncertain how to build @yzhang87's forked CNTK version. According to your instructions:
Build KaldiReader in CNTK (follow the instruction).
I assume "KaldiReader" is the name of the file in @yzhang87's fork, is this correct? Are Microsoft's official instructions the correct instructions or are there specific instructions we should look for in the forked repository? Several steps from the official instructions are no longer appropriate for the fork. Notably, the fork expects a missing file called mkl.h
which is no longer provided in any of the open source MKLML releases, which are suggested by the Microsoft instructions. Can you clarify how to obtain the correct MKLML dependencies to run your experiment? Thank you!
first thanks for the impressive code of SRU,
when should the multi-gpu supported sru be done?
error:
Traceback (most recent call last):
File "setup.py", line 53, in <module>
version=get_version(),
File "setup.py", line 21, in get_version
with open(version_py) as fh:
FileNotFoundError: [Errno 2] No such file or directory: 'version.py'
Hi, I use the SRU in tensorflow for a seq2seq model.
I compare the loss with SRU and LSTM (hyper param, network structure is same), after 10 epoch, the SRU's convergence speed is slow down.
And at the 50 epoch, SRU's loss is 0.05, and LSTM's loss is only 0.01. But the validate loss is very close.
My code is here: https://github.com/johnnykthink/SRU-Tensorflow/blob/master/sru.py
So, how can I fix it? Thanks.
The training loss graph, data pair num: 300k. (blue is SRU, gray is LSTM)
@taolei87 hi, I want to thank you very much firstly. Your attribution about SRU helps me a lot. Seeing models with SRU running so fast makes me live in dream. So, I deploy SRU to my own code, replacing lstm(gru). However, my new model with SRU always has the problem: loss=nan. I study previous issues and found that we should use gradient clipping. Can you give some example about how to set gradient clipping, because I did not find in codes published? And the new 'Pull requests' may talk about it.Wish more and more people enjoy the speed of SRU, it's magic!
Hi Tao,
Thanks for the great job, I had the implement you paper in tensorflow.
In a seq2seq task, the SRU training speed is 1.6x faster than the tensorflow's BasicLSTMCell !
And the accuracy is a little better than LSTM.
But how can I get the 5-10x faster in your paper? Now I'm using feed_dict feed the data to model, I will use tfrecords later for compare.
Thanks.
@taolei87 thanks for the repo again. Really good code.
One thing that I was analyzing was that it doesn't seem that the sru class doesn't have skip connections. Shouldn't it be:
prevx = input
lstc = []
for i, rnn in enumerate(self.rnn_lst):
h, c = rnn(prevx, c0[i])
prevx += h #you have prevx = h
In this way the connections are residual which is useful for stacking multiple layers.
Is that same with LSTM?
In the main function of language_model, the word vocabulary should contain all word from dev+train+test. otherwise the program will through a exception.
model = Model(train+dev+test, args)
Hi,
Would you mind sharing the --d parameter for --depth 6? I was trying to find the right one and maintain 24M parameter budget but was having some difficulties.
Thank you!
I wanted to know if there was a score available for the test set.
Hi,
Part of experiments with SRU were conducted on Speech recognition tasks (section 4.5) however there is no work
on end-to-end models like deepspeech2 evaluated with SRU. Have you tried this? Are there any architecture problems with applying SRU for deepspeech2 model and end-to-end Speech recognition models in general?
Many thanks for a wonderful work on SRU
Best Regards,
Jacek
Hi, one of the merits of RNNs is support for variable sequence lengths. In many applications (speech, NLP etc.), a batch of samples consists of sequences that do not have the same length, thus requiring padding for batching. To avoid training on the padded and non-informative parts, pytorch uses PackedSequence
objects. With SRU, I get the following error using a PackedSequence
as input:
in forward assert input.dim() == 3 # (len, batch, n_in)
AttributeError: 'PackedSequence' object has no attribute 'dim'
Do you plan to add support for PackedSequence
objects?
I am a green hand of word2vec, so I'm confuse about the complete process of "Download pre-trained word embeddings such as word2vec; make it into text format" in the classification task. I have been cloned the word2vec repo and followed the quick-start in https://code.google.com/archive/p/word2vec/, then run the demo script ./demo-word.sh and ./demo-phrases.sh. But I don't know how to "make it into text format". Could you please give a more precise description?
Thanks!
when I use SRU in pytorch=0.4, wrong as this:
optimizer.step()
File "/usr/local/lib/python3.5/dist-packages/torch/optim/sgd.py", line 93, in step
d_p.add_(weight_decay, p.data)
RuntimeError: The expanded size of the tensor (56) must match the existing size (112) at non-singleton dimension 1
when I repalce SRU to GRU, it won't happen.
I feel sorry that I don't futher study where the wrong happen, but notice it.
I do not know how the effect can be increased speed
thanks!
I tried to implement a discriminator of WGAN-GP via SRU but failed with this error.
Hi,
when I use SRU in NMT
this error happened to me.
bidirectional = True
output, hidden = enc(x) # forward pass
# (4, 64, 400) - (layers x batch_size x rnn_size*2)
hidden = hidden[:, :, 0:rnn_size] + hidden[:, :, rnn_size:]
output, hidden = dec(y, hidden)
...
loss.div(x.size(1)).backward()
loss.div(x.size(1)).backward()
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 156, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/init.py", line 98, in backward
variables, grad_variables, retain_graph)
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/function.py", line 91, in apply
return self._forward_cls.backward(self, *args)
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/_functions/tensor.py", line 29, in backward
grad_input[ctx.index] = grad_output
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 85, in setitem
return SetItem.apply(self, key, value)
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/_functions/tensor.py", line 43, in forward
i._set_index(ctx.index, value)
RuntimeError: invalid argument 2: sizes do not match at /pytorch/torch/lib/THC/THCTensorCopy.cu:31
Hi @taolei87, do you think your SRU could improve further with the use of BatchNorm as in Cooijmans et al. (2017), which has been implemented by @jihunchoi here for an LSTM?
Thank you.
I have installed cupy by pip install cupy
.However, it said No module named 'cupy'.
What is the reason of it?
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-2-52f03c45a3e3> in <module>()
11 from torch import optim
12 import torch.nn.functional as F
---> 13 from cuda_functional import SRU, SRUCell
/home/quoniammm/version-control/mine-pytorch-examples/torch_basic/cuda_functional.py in <module>()
7 import torch.nn as nn
8 from torch.autograd import Function, Variable
----> 9 from cupy.cuda import function
10 from pynvrtc.compiler import Program
11 from collections import namedtuple
ModuleNotFoundError: No module named 'cupy'
treminal——conda list
:
certifi 2016.2.28 py36_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
cupy 1.0.3 <pip>
......
Here's a simple test. Train the model for 2 epochs on the SQUAD problem. Here's a bash dump from my machine.
(venv) ubuntu@x:~/work/sru/DrQA$ python train.py -e 0 -bs 32 --resume checkpoint_epoch_2.pt
seed: 937
10/09/2017 09:36:19 [program starts.]
10/09/2017 09:36:46 [Data loaded.]
10/09/2017 09:36:46 [loading previous model...]
2806118 parameters
10/09/2017 09:37:11 [dev EM: 55.638599810785244 F1: 66.81417432626549]
(venv) ubuntu@x:~/work/sru/DrQA$ python train.py -e 0 -bs 1 --resume checkpoint_epoch_2.pt
seed: 937
10/09/2017 09:37:36 [program starts.]
10/09/2017 09:38:03 [Data loaded.]
10/09/2017 09:38:03 [loading previous model...]
2806118 parameters
10/09/2017 09:39:33 [dev EM: 55.68590350047304 F1: 66.70301086732972]
(venv) ubuntu@ip-172-31-9-81:~/work/sru/DrQA$ python train.py -e 0 -bs 32 --resume checkpoint_epoch_2.pt
seed: 937
10/09/2017 09:42:17 [program starts.]
10/09/2017 09:42:45 [Data loaded.]
10/09/2017 09:42:45 [loading previous model...]
2806118 parameters
10/09/2017 09:43:10 [dev EM: 55.638599810785244 F1: 66.81417432626549]
The predictions should not be depending on batch size right? Also what's the point of setting the batch size to 1 in this?
Does the existing codebase work with CUDA 9 as well? On the Titan V GPU we could take advantage of FP16 to further speed up training.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.cu line=66 error=30 : unknown error
Traceback (most recent call last):
File "/home/lai/filespace/eclipse-workpplace/sru-master/language_model/train_lm.py", line 14, in
import cuda_functional as MF
File "/home/lai/filespace/eclipse-workpplace/sru-master/cuda_functional.py", line 13, in
tmp_ = torch.rand(1,1).cuda()
File "/home/lai/anaconda3/lib/python3.6/site-packages/torch/_utils.py", line 66, in cuda
return new_type(self.size()).copy(self, async)
File "/home/lai/anaconda3/lib/python3.6/site-packages/torch/cuda/init.py", line 269, in _lazy_new
return super(_CudaBase, cls).new(cls, *args, **kwargs)
RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.cu:66
Could you please add the possibility to feed hidden states to the forward pass likewise nn.LSTMCell ?
So how can we deal with the variable length of inputs to reduce the computation?
Hi,
Did you use seed in your language model experiments? I tried to run LSTM experiment with default papameters, but the training was terminated at epoch 53. Looking at train_lm.py , I think this is because:
if math.isnan(loss.data[0]) or math.isinf(loss.data[0]):
Thanks,
Firstly i ran download.sh
, and it succesfully downloaded glove and train/dev jsons for SQuAD. However, python prepro.py
gave me this:
Traceback (most recent call last):
File "prepro.py", line 243, in <module>
vocab_tag = list(nlp.tagger.tag_names)
AttributeError: 'Tagger' object has no attribute 'tag_names'
My Spacy version is 2.0.3
, and it seems like something broke in update from 1.x
that is written in requirements, and I didn't succeed in fixing it myself.
Any suggests?
Hi
The current version of your paper on arxiv (https://arxiv.org/pdf/1709.02755.pdf) uses the forget gate in the output computation rather than the reset gate.
Based on your code and other formulas in the paper, it seems like it should say r
there rather than f
.
When using with DataParallel, received the following error
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
single gpu works fine
How can I use the ELMo vectors (https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md) in the DrQA?
Hi Taolei,
In you sru implement, the backward step will update a grad_u matrix, but in many framework like tensorflow, the grad operation will only rqeuire to calc the input parameter 's gradient,
U = X.dot( [Wx, Wf, Wr] )
as my understanding, U's dim size is [seq_length, batch_size, n_out * 3 * direction_cnt], but W's dim is [n_in, n_out * 3 * direction_cnt],
If I can compute the grad_u, how can I convert this to grad_w ?
I notice musyoku's code did some convert like that (https://github.com/musyoku/chainer-sru/blob/master/sru/sru.py), but I have trouble to understand that.
Could you give me some advise ?
Just curious, how much would it impact the accuracy and speed (w/ roughly same amount of parameters)? Have you done some experiments on it?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.