asappresearch / sru Goto Github PK

View Code? Open in Web Editor NEW

2.1K 66.0 308.0 1.01 MB

Training RNNs as Fast as CNNs (https://arxiv.org/abs/1709.02755)

License: MIT License

Python 59.90% Shell 10.32% C++ 12.67% Cuda 16.43% CMake 0.67%

pytorch deep-learning recurrent-neural-networks nlp

sru's People

Contributors

Stargazers

Watchers

Forkers

ml-lab meelement shivam11 mindis cclauss kastnerkyle sakyad zhouyonglong phpmind codeaudit ruohoruotsi ml-ai-nlp-ir jdc08161063 sanjeeku allensmile chelovekhe vietvudanh bigmaye limin2021 hihacoder cwlseu xueguohua qianqzhang qoboty lxianwei003 linpingchuan dp-aixball coderx7 kingofoz a75c6 coopertian hsakas 10183308 yuxianzhi hhy5277 ieee820 hxl1990 jianhuaixie deneutoy adrianhust namisan hulalazz chris1132 maggie0830 winnerineast junfengduan xuanhan863 ogugugugugua runngezhang caibing1872 yifan-luo techscientist hzckid chunde orchestor hitum-dev benjamesbabala binbinbian lyk125 xylophone234 thsu wind2008hxy angelasunny tonylibing little1tow thesage21 zeyu-h dimplesl hedgefair ajsyp oakyms ali5h fengchangfight qsong4 kaniblu scpei pan-zhou fucusy zwjyyc cautree mathias3 jliangnku bearcat1 duke24k ashishkej jemisa docdok lxw4939 coder3344 mazzzystar wangyangcharles1 fireae brucedai003 tony32769 choiyeren qyhboy huyanxin leezqcst stefbraun hdubey

sru's Issues

Error when use SRU in DrQA

Hi,
when I use SRU in DrQA to instead of LSTM, this error happened to me,
File "/home/hebian.ww/DrQA/drqa/reader/cuda_functional.py", line 359, in forward stream=SRU_STREAM File "cupy/cuda/function.pyx", line 129, in cupy.cuda.function.Function.__call__ (cupy/cuda/function.cpp:3963) File "cupy/cuda/function.pyx", line 111, in cupy.cuda.function._launch (cupy/cuda/function.cpp:3600) File "cupy/cuda/driver.pyx", line 127, in cupy.cuda.driver.launchKernel (cupy/cuda/driver.cpp:2541) File "cupy/cuda/driver.pyx", line 62, in cupy.cuda.driver.check_status (cupy/cuda/driver.cpp:1446) cupy.cuda.driver.CUDADriverError: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
same error with #4 , but I received this error with single GPU

TypeError: 'float' object cannot be interpreted as an integer

I encountered this error with Python 3.6.
Changing line 355 of cuda_functional.py into num_block = (ncols-1)//thread_per_block+1 solved the problem.

About k value in SRUCell , why it's can be 4 when n_in != out_size

Hi @taolei87 ,

I have a question about weight matrix dimension,
In the SRUCell code, I found the k = 4 if n_in != out_size else 3
But When I read the paper, it's only have 3 weight matrix, W, Wf, Wr,

And I found the n_in will not equal to out_size when the layer number is 0, but I don't understand why k = 4, what's those weight other than W, Wf, Wr ?

below is init code:

class SRUCell(nn.Module):
    def __init__(self, n_in, n_out, dropout=0, vari_dropout=0,
                 use_tanh=1, bidirectional=False):
 ....
        out_size = n_out*2 if bidirectional else n_out
        k = 4 if n_in != out_size else 3
        self.size_per_dir = n_out*k
        self.weight = nn.Parameter(torch.Tensor(
            n_in,
            self.size_per_dir*2 if bidirectional else self.size_per_dir
        ))
  ...

below is when in_in is not equal to out_size:

class SRU(nn.Module):
    def __init__(self, input_size, hidden_size,
                 num_layers=2, dropout=0, vari_dropout=0,
                 use_tanh=1, bidirectional=False):
     ...
    ...
        self.n_in = input_size
        self.n_out = hidden_size
       ...
        self.out_size = hidden_size*2 if bidirectional else hidden_size

        for i in range(num_layers):
            l = SRUCell(n_in=self.n_in if i == 0 else self.out_size,
                        n_out=self.n_out,
                        dropout=dropout if i+1 != num_layers else 0,
                        vari_dropout=vari_dropout,
                        use_tanh=use_tanh,
                        bidirectional=bidirectional)
            self.rnn_lst.append(l)

Thanks

self-normalizing activation

Hi,

I some people asked you to add recurrent batch norm. There is SELU (self-normalizing ReLU), which is actually easy to implement and works well. Do you mind to add it?
ref: https://arxiv.org/abs/1706.02515

how does SRU work for decoder?

Would SRU provide any benefits for a decoder that decodes step by step based on the previous decoded output? It will no longer be parallelizable over time steps so the only latency saving would come from the reduced number of operations per time step (compared to LSTM)?

How to use custom activation function?

i.e Leaky relu

Weight and grad shapes mismatch

Hi!

I have some issues running language_model example.

$ python3 train_lm.py --train train.txt --dev valid.txt --test test.txt 
Namespace(batch_size=32, bias=-3, clip_grad=5, d=910, depth=6, dev='valid.txt', dropout=0.7, lr=1.0, lr_decay=0.98, lr_decay_epoch=175, lstm=False, max_epoch=300, rnn_dropout=0.2, test='test.txt', train='train.txt', unroll_size=35, weight_decay=1e-05)

WARNING: set_bias() is deprecated. use `highway_bias` option in SRUCell() constructor.

WARNING: set_bias() is deprecated. use `highway_bias` option in SRUCell() constructor.

WARNING: set_bias() is deprecated. use `highway_bias` option in SRUCell() constructor.

WARNING: set_bias() is deprecated. use `highway_bias` option in SRUCell() constructor.

WARNING: set_bias() is deprecated. use `highway_bias` option in SRUCell() constructor.

WARNING: set_bias() is deprecated. use `highway_bias` option in SRUCell() constructor.
vocab size: 10000
num of parameters: 24026720
train_lm.py:110: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
  norms = [ "{:.0f}".format(x.norm().data[0]) for x in self.parameters() ]
        p_norm: ['100', '45', '90', '45', '90', '45', '90', '45', '90', '45', '90', '47', '90', '0']

SRU loaded for gpu 0
train_lm.py:140: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.
  torch.nn.utils.clip_grad_norm(model.parameters(), args.clip_grad)
Traceback (most recent call last):
  File "train_lm.py", line 262, in <module>
    main(args)
  File "train_lm.py", line 206, in main
    train_ppl = train_model(epoch, model, train)
  File "train_lm.py", line 145, in train_model
    p.data.add_(-lr, p.grad.data)
RuntimeError: expand(torch.cuda.FloatTensor{[2, 910]}, size=[1820]): the number of sizes provided (1) must be greater or equal to the number of dimensions in the tensor (2)

It looks like it is because some weight and grad has different shapes.

Latest master SRU fails to train

Thanks for your awesome work on this model, right now we're using it to process language and visual features to generate segmentation masks from referral expressions, however, we're experiencing major issues with the latest master revision.

Until commit 43c85ed, our model did train perfectly (without gradient clipping or any additional techniques), however, after updating to the latest version, the model fails to converge during training. Also, our old weights are not compatible with the current SRU version and we haven't touched our code at all. We would like to know if is there any incompatibility introduced after the multigpu branch was merged?

Here's a little snippet depicting our current declaration and usage of the SRU model:

class Net(nn.Module):
    def __init__(self, *args, **kwargs):
        ...
        self.lang_model = SRU(emb_size, hid_size, num_layers=lang_layers)
        ...
        self.mrnn = SRU(mixed_size, hid_mixed_size,
                        num_layers=mixed_layers)

    def forward(self, vis, lang):
        ...
        lang = self.emb(lang)
        # LxB representation
        lang = torch.transpose(lang, 0, 1)
        # input has dimensions: seq_length x batch_size (1) x we_dim
        lang, _ = self.lang_model(lang)
        ...
        # input has dimensions: seq_length x batch_size (1) x mix_size
        output, _ = self.mrnn(q)
        ...

I would appreciate any guidance to solve this issue. Right now, our model takes 5 days to train and we would like to parallelize it on multiple GPUs, such that we can reduce that time.

Bi-Direction forward and backward seems incorrect, only capture half of input_x in each direction in element-wise

Hi Tao,

I recently found there is issue in bi-direction case,
such as input_size = 6, hidden_size = 3, direction_count = 2, length = 2, batch_size = 2
in this case, k == 3,

the x matrix will be like this:
l(the f or the r latter before x's number is mark for forward and reverse (flip == 1))

[[[f-0.302948 f-0.255578 f-0.110915  r0.1591    r0.928114  r0.92241 ]

  [f-0.50604   f0.391675 f-0.187608  r0.468802 r-0.648262 r-0.177739]]

 [[ f0.50936   f0.67189  f-0.619738  r0.377355  r0.545083 r-0.971449]
  [ f0.948531 f-0.551092  f0.227567 r-0.46116  r-0.496896 r-0.769874]]]

In the forward kernel:
I did print in forward kernel:

[F] col:0  L:0 N:0 D:0 DIR:0 act:1 k:3 d:3 x: -0.302948 
[F] col:1  L:0 N:0 D:1 DIR:0 act:1 k:3 d:3 x: -0.255578 
[F] col:2  L:0 N:0 D:2 DIR:0 act:1 k:3 d:3 x: -0.110915 
[F] col:3  L:0 N:0 D:0 DIR:1 act:1 k:3 d:3 x: 0.377355 
[F] col:4  L:0 N:0 D:1 DIR:1 act:1 k:3 d:3 x: 0.545083 
[F] col:5  L:0 N:0 D:2 DIR:1 act:1 k:3 d:3 x: -0.971449 
[F] col:6  L:0 N:1 D:0 DIR:0 act:1 k:3 d:3 x: -0.506040 
[F] col:7  L:0 N:1 D:1 DIR:0 act:1 k:3 d:3 x: 0.391675 
[F] col:8  L:0 N:1 D:2 DIR:0 act:1 k:3 d:3 x: -0.187608 
[F] col:9  L:0 N:1 D:0 DIR:1 act:1 k:3 d:3 x: -0.461160 
[F] col:10  L:0 N:1 D:1 DIR:1 act:1 k:3 d:3 x: -0.496896
[F] col:11  L:0 N:1 D:2 DIR:1 act:1 k:3 d:3 x: -0.769874
[F] col:0  L:1 N:0 D:0 DIR:0 act:1 k:3 d:3 x: 0.509360 
[F] col:1  L:1 N:0 D:1 DIR:0 act:1 k:3 d:3 x: 0.671890 
[F] col:2  L:1 N:0 D:2 DIR:0 act:1 k:3 d:3 x: -0.619738 
[F] col:3  L:1 N:0 D:0 DIR:1 act:1 k:3 d:3 x: 0.159100 
[F] col:4  L:1 N:0 D:1 DIR:1 act:1 k:3 d:3 x: 0.928114 
[F] col:5  L:1 N:0 D:2 DIR:1 act:1 k:3 d:3 x: 0.922410 
[F] col:6  L:1 N:1 D:0 DIR:0 act:1 k:3 d:3 x: 0.948531 
[F] col:7  L:1 N:1 D:1 DIR:0 act:1 k:3 d:3 x: -0.551092 
[F] col:8  L:1 N:1 D:2 DIR:0 act:1 k:3 d:3 x: 0.227567 
[F] col:9  L:1 N:1 D:0 DIR:1 act:1 k:3 d:3 x: 0.468802 
[F] col:10  L:1 N:1 D:1 DIR:1 act:1 k:3 d:3 x: -0.648262 
[F] col:11  L:1 N:1 D:2 DIR:1 act:1 k:3 d:3 x: -0.177739 
F

For you reference, this is code add to print value.

void sru_bi_fwd(...) {
for (int row = 0; row < len; ++row )
{
...
...
      *hp = (val*mask-(*xp))*g2 + (*xp);
      printf("[F] col:%d  L:%d N:%d D:%d DIR:%d act:%d k:%d d:%d x:%f\n",
             col, cnt, (col/d2), (col%d), flip, activation_type, k, d, *(xp) );

And I found for the forward direction( flip == 0 (print as DIR)) code, only access the left half of input x, and backward direction(flip == 1) only access the right half of input x.

This behavior is very different from every other case in SRU (k == 3, uni-direction, k == 4, uni/bi-direction), Since it only can keep half information of input by the reset gate, but the activation will see all the x's input.

Do you think this is a issue ?

output vs. hidden

Hi,

I'm trying to understand what exactly is "output", and what is "hidden". Based on the code in cuda_functional.py, it seems that "output" is corresponding to the hidden states of LSTM, and "hidden" is corresponding to the cell states of LSTM. Is this understanding right?

Or in fact, both hidden and output are the same, hidden is just the last time-step hidden state for each layer, while output is the all-time-step hidden states for the last layer?

output, hidden = rnn(x)      # forward pass

# output is (length, batch size, hidden size * number of directions)
# hidden is (layers, batch size, hidden size * number of directions)

prevx = input
lstc = []
for i, rnn in enumerate(self.rnn_lst):
        h, c = rnn(prevx, c0[i])
        prevx = h
        lstc.append(c)

if return_hidden:
        return prevx, torch.stack(lstc)
else:
        return prevx

Different input dimention compared to output dimension

Hi, I'm trying to implement a naive version of this paper in Keras, and was wondering how is the case that - n_in != n_out handled.

I went through the code a few times, and couldn't understand the element wise multiplication of (1 - r_t) with x_t, if x_t is of a different shape than r_t.

Calculating Backwards For SRU Results in CUDA error.

I'm not sure how, but I'm seeing this error when I try to compute the backwards function. Don't know if you've come across this during your debug?

Traceback (most recent call last):
  File "gan_language.py", line 341, in <module>
    G.backward(one)
  File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 156, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/usr/local/lib/python2.7/dist-packages/torch/autograd/__init__.py", line 98, in backward
    variables, grad_variables, retain_graph)
  File "/home/nick/wgan-gp/sru/cuda_functional.py", line 417, in backward
    stream=SRU_STREAM
  File "cupy/cuda/function.pyx", line 129, in cupy.cuda.function.Function.__call__ (cupy/cuda/function.cpp:4010)  File "cupy/cuda/function.pyx", line 111, in cupy.cuda.function._launch (cupy/cuda/function.cpp:3647)
  File "cupy/cuda/driver.pyx", line 127, in cupy.cuda.driver.launchKernel (cupy/cuda/driver.cpp:2541)
  File "cupy/cuda/driver.pyx", line 62, in cupy.cuda.driver.check_status (cupy/cuda/driver.cpp:1446)
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_INVALID_HANDLE: invalid resource handle

how fast on inference/testing for seq2seq task?

Confused about the variable k in the code.

"k = 4 if n_in != out_size else 3".
I notice that many places involve the variable k which sometimes be 4 and sometimes be 3. It seems that it is related to the different layer. I think the k is related to the 3 different W matrix in the paper which are W, W_f, W_r. Could you please explain about it? Thanks.

torch binding?

Hi
Thanks for releasing the code. I am wondering whether it is possible to release a torch binding?
Thanks!

Unable to reproduce speech experiments

Can you please provide more detailed instructions on how to run the speech experiments from your paper? I am a graduate student participating in the ICLR 2018 Reproduciblity Challenage and we are having some difficulty reproducing the speech experiments as described by the installation instructions provided in the sru/speech directory. I was able to install Kaldi and CUDA 8, but I am uncertain how to build @yzhang87's forked CNTK version. According to your instructions:

Build KaldiReader in CNTK (follow the instruction).

I assume "KaldiReader" is the name of the file in @yzhang87's fork, is this correct? Are Microsoft's official instructions the correct instructions or are there specific instructions we should look for in the forked repository? Several steps from the official instructions are no longer appropriate for the fork. Notably, the fork expects a missing file called mkl.h which is no longer provided in any of the open source MKLML releases, which are suggested by the Microsoft instructions. Can you clarify how to obtain the correct MKLML dependencies to run your experiment? Thank you!

Do you plan to implement multi-gpu support ?

first thanks for the impressive code of SRU,
when should the multi-gpu supported sru be done？

python setup.py install fail

error:

Traceback (most recent call last):
  File "setup.py", line 53, in <module>
    version=get_version(),
  File "setup.py", line 21, in get_version
    with open(version_py) as fh:
FileNotFoundError: [Errno 2] No such file or directory: 'version.py'

libnvrtc.so: undefined symbol: nvrtcAddNameExpression

I get the following error message when running the code example you give on the readme page:

I'm using Cuda 7.5

Thanks for the help.

language_model

SRU's convergence speed is slow down in the second half of training compare to lstm

Hi, I use the SRU in tensorflow for a seq2seq model.

I compare the loss with SRU and LSTM (hyper param, network structure is same), after 10 epoch, the SRU's convergence speed is slow down.
And at the 50 epoch, SRU's loss is 0.05, and LSTM's loss is only 0.01. But the validate loss is very close.
My code is here: https://github.com/johnnykthink/SRU-Tensorflow/blob/master/sru.py

So, how can I fix it? Thanks.

The training loss graph, data pair num: 300k. (blue is SRU, gray is LSTM)

The validate loss graph, data pair num: 30k

loss=nan

@taolei87 hi, I want to thank you very much firstly. Your attribution about SRU helps me a lot. Seeing models with SRU running so fast makes me live in dream. So, I deploy SRU to my own code, replacing lstm(gru). However, my new model with SRU always has the problem: loss=nan. I study previous issues and found that we should use gradient clipping. Can you give some example about how to set gradient clipping, because I did not find in codes published? And the new 'Pull requests' may talk about it.Wish more and more people enjoy the speed of SRU, it's magic!

Question about the SRU training speed in tensorflow

Hi Tao,

Thanks for the great job, I had the implement you paper in tensorflow.
In a seq2seq task, the SRU training speed is 1.6x faster than the tensorflow's BasicLSTMCell !
And the accuracy is a little better than LSTM.

But how can I get the 5-10x faster in your paper? Now I'm using feed_dict feed the data to model, I will use tfrecords later for compare.

Thanks.

SRU Module Doesn't appear to Use Residual/skip connections

@taolei87 thanks for the repo again. Really good code.

One thing that I was analyzing was that it doesn't seem that the sru class doesn't have skip connections. Shouldn't it be:

        prevx = input
        lstc = []
        for i, rnn in enumerate(self.rnn_lst):
            h, c = rnn(prevx, c0[i])
            prevx += h #you have prevx = h

In this way the connections are residual which is useful for stacking multiple layers.

Use relu or tanh?

Is that same with LSTM?

There is a bug in language model experiment

In the main function of language_model, the word vocabulary should contain all word from dev+train+test. otherwise the program will through a exception.

model = Model(train+dev+test, args)

A little question about the architecture

Could anyone tell me some tips about the architechture?

In the last equation, does that mean input x_t should have the same size as output h_t ?As it is element-wise product......
Thanks a lot.

Hyperparameter for LM experiment with Depth 6

Hi,

Would you mind sharing the --d parameter for --depth 6? I was trying to find the right one and maintain 24M parameter budget but was having some difficulties.

Thank you!

Has the DrQA model been submitted for evaluation on the SQuAD test set?

I wanted to know if there was a score available for the test set.

question: SRU and end-to-end speech recognition models

Hi,

Part of experiments with SRU were conducted on Speech recognition tasks (section 4.5) however there is no work
on end-to-end models like deepspeech2 evaluated with SRU. Have you tried this? Are there any architecture problems with applying SRU for deepspeech2 model and end-to-end Speech recognition models in general?

Many thanks for a wonderful work on SRU

Best Regards,
Jacek

no support for variable sequence lengths with pytorch PackedSequence

Hi, one of the merits of RNNs is support for variable sequence lengths. In many applications (speech, NLP etc.), a batch of samples consists of sequences that do not have the same length, thus requiring padding for batching. To avoid training on the padded and non-informative parts, pytorch uses PackedSequence objects. With SRU, I get the following error using a PackedSequence as input:

in forward assert input.dim() == 3 # (len, batch, n_in)
AttributeError: 'PackedSequence' object has no attribute 'dim'

Do you plan to add support for PackedSequence objects?

word2vec processing procedure

I am a green hand of word2vec, so I'm confuse about the complete process of "Download pre-trained word embeddings such as word2vec; make it into text format" in the classification task. I have been cloned the word2vec repo and followed the quick-start in https://code.google.com/archive/p/word2vec/, then run the demo script ./demo-word.sh and ./demo-phrases.sh. But I don't know how to "make it into text format". Could you please give a more precise description?
Thanks!

wrong in pytorch=0.4

when I use SRU in pytorch=0.4, wrong as this:
optimizer.step()
File "/usr/local/lib/python3.5/dist-packages/torch/optim/sgd.py", line 93, in step
d_p.add_(weight_decay, p.data)
RuntimeError: The expanded size of the tensor (56) must match the existing size (112) at non-singleton dimension 1
when I repalce SRU to GRU, it won't happen.
I feel sorry that I don't futher study where the wrong happen, but notice it.

Looking forward to a MNIST training example

I do not know how the effect can be increased speed
thanks！

RuntimeError: SRU_Compute_GPULegacyBackward is not differentiable twice

I tried to implement a discriminator of WGAN-GP via SRU but failed with this error.

An error occurred in the backward pass after adding the slice of hidden state.

Hi,
when I use SRU in NMT
this error happened to me.

bidirectional = True
output, hidden = enc(x)      # forward pass

# (4, 64, 400) - (layers x batch_size x rnn_size*2)
hidden = hidden[:, :, 0:rnn_size] + hidden[:, :, rnn_size:]

output, hidden = dec(y, hidden)

...

loss.div(x.size(1)).backward()

loss.div(x.size(1)).backward()
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 156, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/init.py", line 98, in backward
variables, grad_variables, retain_graph)
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/function.py", line 91, in apply
return self._forward_cls.backward(self, *args)
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/_functions/tensor.py", line 29, in backward
grad_input[ctx.index] = grad_output
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 85, in setitem
return SetItem.apply(self, key, value)
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/_functions/tensor.py", line 43, in forward
i._set_index(ctx.index, value)
RuntimeError: invalid argument 2: sizes do not match at /pytorch/torch/lib/THC/THCTensorCopy.cu:31

Recurrent BatchNorm for SRU?

Hi @taolei87, do you think your SRU could improve further with the use of BatchNorm as in Cooijmans et al. (2017), which has been implemented by @jihunchoi here for an LSTM?
Thank you.

No module named 'cupy'

I have installed cupy by pip install cupy.However, it said No module named 'cupy'.
What is the reason of it?

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-2-52f03c45a3e3> in <module>()
     11 from torch import optim
     12 import torch.nn.functional as F
---> 13 from cuda_functional import SRU, SRUCell

/home/quoniammm/version-control/mine-pytorch-examples/torch_basic/cuda_functional.py in <module>()
      7 import torch.nn as nn
      8 from torch.autograd import Function, Variable
----> 9 from cupy.cuda import function
     10 from pynvrtc.compiler import Program
     11 from collections import namedtuple

ModuleNotFoundError: No module named 'cupy'

treminal——conda list:

certifi                   2016.2.28                py36_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
cupy                      1.0.3                     <pip>
......

Different F1 scores in DrQA when resuming with different batch sizes

Here's a simple test. Train the model for 2 epochs on the SQUAD problem. Here's a bash dump from my machine.

(venv) ubuntu@x:~/work/sru/DrQA$ python train.py -e 0 -bs 32  --resume checkpoint_epoch_2.pt
seed: 937
10/09/2017 09:36:19 [program starts.]
10/09/2017 09:36:46 [Data loaded.]
10/09/2017 09:36:46 [loading previous model...]
2806118 parameters
10/09/2017 09:37:11 [dev EM: 55.638599810785244 F1: 66.81417432626549]
(venv) ubuntu@x:~/work/sru/DrQA$ python train.py -e 0 -bs 1  --resume checkpoint_epoch_2.pt
seed: 937
10/09/2017 09:37:36 [program starts.]
10/09/2017 09:38:03 [Data loaded.]
10/09/2017 09:38:03 [loading previous model...]
2806118 parameters
10/09/2017 09:39:33 [dev EM: 55.68590350047304 F1: 66.70301086732972]
(venv) ubuntu@ip-172-31-9-81:~/work/sru/DrQA$ python train.py -e 0 -bs 32  --resume checkpoint_epoch_2.pt
seed: 937
10/09/2017 09:42:17 [program starts.]
10/09/2017 09:42:45 [Data loaded.]
10/09/2017 09:42:45 [loading previous model...]
2806118 parameters
10/09/2017 09:43:10 [dev EM: 55.638599810785244 F1: 66.81417432626549]

The predictions should not be depending on batch size right? Also what's the point of setting the batch size to 1 in this?

CUDA 9 Support

Does the existing codebase work with CUDA 9 as well? On the Titan V GPU we could take advantage of FP16 to further speed up training.

I couldn't run it, because I had a mistake, but I couldn't find the reason.

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.cu line=66 error=30 : unknown error
Traceback (most recent call last):
File "/home/lai/filespace/eclipse-workpplace/sru-master/language_model/train_lm.py", line 14, in
import cuda_functional as MF
File "/home/lai/filespace/eclipse-workpplace/sru-master/cuda_functional.py", line 13, in
tmp_ = torch.rand(1,1).cuda()
File "/home/lai/anaconda3/lib/python3.6/site-packages/torch/_utils.py", line 66, in cuda
return new_type(self.size()).copy(self, async)
File "/home/lai/anaconda3/lib/python3.6/site-packages/torch/cuda/init.py", line 269, in _lazy_new
return super(_CudaBase, cls).new(cls, *args, **kwargs)
RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.cu:66

hidden states in input

Could you please add the possibility to feed hidden states to the forward pass likewise nn.LSTMCell ?

It seems that sru doesn't support pack_padded_sequence()

So how can we deal with the variable length of inputs to reduce the computation?

train_lm.py using LSTM results nan/inf

Hi,
Did you use seed in your language model experiments? I tried to run LSTM experiment with default papameters, but the training was terminated at epoch 53. Looking at train_lm.py , I think this is because:

if math.isnan(loss.data[0]) or math.isinf(loss.data[0]):

Thanks,

AttributeError when preprocessing data for DrQA

Firstly i ran download.sh, and it succesfully downloaded glove and train/dev jsons for SQuAD. However, python prepro.py gave me this:

Traceback (most recent call last):
  File "prepro.py", line 243, in <module>
    vocab_tag = list(nlp.tagger.tag_names)
AttributeError: 'Tagger' object has no attribute 'tag_names'

My Spacy version is 2.0.3, and it seems like something broke in update from 1.x that is written in requirements, and I didn't succeed in fixing it myself.
Any suggests?

Pseudocode typo in output state

The current version of your paper on arxiv (https://arxiv.org/pdf/1709.02755.pdf) uses the forget gate in the output computation rather than the reset gate.

Based on your code and other formulas in the paper, it seems like it should say r there rather than f.

Error when using DataParallel

When using with DataParallel, received the following error
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_INVALID_HANDLE: invalid resource handle

single gpu works fine

using ELMo in DrQA

How can I use the ELMo vectors (https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md) in the DrQA?

About Grad: gradient check failed in some case, how to correct calculate x's gradient ?

Hi Taolei,

In you sru implement, the backward step will update a grad_u matrix, but in many framework like tensorflow, the grad operation will only rqeuire to calc the input parameter 's gradient,

U = X.dot( [Wx, Wf, Wr] )

as my understanding, U's dim size is [seq_length, batch_size, n_out * 3 * direction_cnt], but W's dim is [n_in, n_out * 3 * direction_cnt],

If I can compute the grad_u, how can I convert this to grad_w ?

I notice musyoku's code did some convert like that (https://github.com/musyoku/chainer-sru/blob/master/sru/sru.py), but I have trouble to understand that.

Could you give me some advise ?

What if turning off the highway connection?

Just curious, how much would it impact the accuracy and speed (w/ roughly same amount of parameters)? Have you done some experiments on it?

Thanks!