lisa-groundhog / groundhog Goto Github PK

This project forked from pascanur/groundhog

Library for implementing RNNs with Theano

License: BSD 3-Clause "New" or "Revised" License

Python 92.29% Perl 5.60% PHP 0.07% Shell 0.44% HTML 1.60%

groundhog's Introduction

GroundHog by lisa-groundhog

WARNING: Groundhog development is over. Please consider using Blocks instead. For an example of machine translation using Blocks please see Blocks-examples repository

GroundHog is a python framework on top of Theano (http://deeplearning.net/software/theano/) that aims to provide a flexible, yet efficient way of implementing complex recurrent neural network models. It supports a variety of recurrent layers, such as DT-RNN, DOT-RNN, RNN with gated hidden units and LSTM. Furthermore, it enables the flexible combination of various layers, for instance, to build a neural translation model.

This is a version forked from the original GroundHog (https://github.com/pascanur/GroundHog) developed by Razvan Pascanu, Caglar Gulcehre and Kyunghyun Cho. This fork will be the version developed and maintained by the members of the LISA Lab at the University of Montreal. The main contributors and maintainers of this fork are currently Dzmitry Bahdanau and Kyunghyun Cho.

Most of the library documentation is still work in progress, but check the files containing Tut (in tutorials) for a quick tutorial on how to use the library.

The library is under the 3-clause BSD license, so it may be used for commercial purposes.

Installation

To install Groundhog in a multi-user setting (such as the LISA lab)

python setup.py develop --user

For general installation, simply use

python setup.py develop

NOTE: This will install the development version of Theano, if Theano is not currently installed.

Neural Machine Translation

See experiments/nmt/README.md

groundhog's People

Contributors

Stargazers

Watchers

Forkers

hans kastnerkyle pbrakel yiiwood phddone fanfannothing jeanru maydaygmail sunqf nordhuang georgesuperman chagge ty01csbaidu lixiangnlp edersantana guker rootlessweed wycg1984 sdutheone kyunghyuncho caglar bartvm fh295 mpezeshki saizheng orhanf kelvinxu aalmah yingzha raphael-forks panyang wwhu dmitrykey casperkaae xdshang xuehui1991 gchrupala npow pepsbn zhangaustin rsk700 jiesizhao077 heeyoulchoi beronx86 hezhenghao zuiwufenghua bamine sxjscience kaishengyao h2oloopan ronuchit evander-dacosta tarzain jmomarty msiahbani tigerneil gouchenyi czlnlp srivignessh timwee gautamb85 remotebrilliant thenghiapham vseledkin cognitronz batman2013 gagazhn nagyistoce xishazgh fangyw wangdongfrank zhoujialinmumu vsooda yangyuphd phvu zerkh z-tao dapeng2018 zendevelopmentsystems colin1988 jianbotang liangkai rualzk shixing ekanshgupta wuhaiyangit milesqli edeas123 duum sunzzz glaceon31 miradel51 caigaojiang tomekd chagri magicbupt guoli-ye tangyaohua lemaoliu krzwolk

groundhog's Issues

Update code structure.

We should mention preprocessing scripts in the "Code Structure" section. Kyung-Hyun, can you please do it?

what is the loopiters means, state['loopIters'] = 3000000, thanks!

ValueError: Cannot construct a ufunc with more than 32 operands

This error happens when using search mode. But program can still run. Does it impact the compiling of attention framework?

2016-01-21 17:59:38,114: groundhog.trainer.SGD_adadelta: DEBUG: Constructing grad function
2016-01-21 17:59:39,072: groundhog.trainer.SGD_adadelta: DEBUG: Compiling grad function
ERROR (theano.gof.opt): SeqOptimizer apply <theano.tensor.opt.FusionOptimizer object at 0x3717f50>
2016-01-21 18:01:21,598: theano.gof.opt: ERROR: SeqOptimizer apply <theano.tensor.opt.FusionOptimizer object at 0x3717f50>
ERROR (theano.gof.opt): Traceback:
2016-01-21 18:01:21,598: theano.gof.opt: ERROR: Traceback:
ERROR (theano.gof.opt): Traceback (most recent call last):
File "/nfs/nlphome/nlp/tools/python27/root/usr/lib64/python2.7/site-packages/theano/gof/opt.py", line 195, in apply
sub_prof = optimizer.optimize(fgraph)
File "/nfs/nlphome/nlp/tools/python27/root/usr/lib64/python2.7/site-packages/theano/gof/opt.py", line 81, in optimize
ret = self.apply(fgraph, _args, *_kwargs)
File "/nfs/nlphome/nlp/tools/python27/root/usr/lib64/python2.7/site-packages/theano/tensor/opt.py", line 5498, in apply
new_outputs = self.optimizer(node)
File "/nfs/nlphome/nlp/tools/python27/root/usr/lib64/python2.7/site-packages/theano/tensor/opt.py", line 5433, in local_fuse
n = OP(C)(*inputs).owner
File "/nfs/nlphome/nlp/tools/python27/root/usr/lib64/python2.7/site-packages/theano/tensor/elemwise.py", line 496, in init
scalar_op.nout)
ValueError: Cannot construct a ufunc with more than 32 operands (requested number were: inputs = 40 and outputs = 1)

2016-01-21 18:01:21,648: theano.gof.opt: ERROR: Traceback (most recent call last):
File "/nfs/nlphome/nlp/tools/python27/root/usr/lib64/python2.7/site-packages/theano/gof/opt.py", line 195, in apply
sub_prof = optimizer.optimize(fgraph)
File "/nfs/nlphome/nlp/tools/python27/root/usr/lib64/python2.7/site-packages/theano/gof/opt.py", line 81, in optimize
ret = self.apply(fgraph, _args, *_kwargs)
File "/nfs/nlphome/nlp/tools/python27/root/usr/lib64/python2.7/site-packages/theano/tensor/opt.py", line 5498, in apply
new_outputs = self.optimizer(node)
File "/nfs/nlphome/nlp/tools/python27/root/usr/lib64/python2.7/site-packages/theano/tensor/opt.py", line 5433, in local_fuse
n = OP(C)(*inputs).owner
File "/nfs/nlphome/nlp/tools/python27/root/usr/lib64/python2.7/site-packages/theano/tensor/elemwise.py", line 496, in init
scalar_op.nout)
ValueError: Cannot construct a ufunc with more than 32 operands (requested number were: inputs = 40 and outputs = 1)

Error when training a model

Hi!
I'm trying to train my model but I get this error:

./train.py --proto=prototype_encdec_state --state mydata.py

python-2.7/lib/python2.7/site-packages/theano/sandbox/rng_mrg.py:768: UserWarning: MRG_RandomStreams Can't determine #streams from size (Elemwise{Cast{int32}}.0), guessing 60*256
  nstreams = self.n_streams(size)
Traceback (most recent call last):
  File "./train.py", line 100, in <module>
    main()
  File "./train.py", line 81, in main
    enc_dec.build()
  File "build/bdist.linux-x86_64/egg/experiments/nmt/encdec.py", line 1385, in build
    c=self.sampling_c)
  File "build/bdist.linux-x86_64/egg/experiments/nmt/encdec.py", line 1252, in build_sampler
    name="{}_sampler_scan".format(self.prefix))
  File "/soft_orokorra_linux_x86_64_rhel6/python-2.7/lib/python2.7/site-packages/theano/scan_module/scan.py", line 1007, in scan
    scan_outs = local_op(*scan_inputs)
  File "/soft_orokorra_linux_x86_64_rhel6/python-2.7/lib/python2.7/site-packages/theano/gof/op.py", line 399, in __call__
    node = self.make_node(*inputs, **kwargs)
  File "/soft_orokorra_linux_x86_64_rhel6/python-2.7/lib/python2.7/site-packages/theano/scan_module/scan_op.py", line 370, in make_node
    inner_sitsot_out.type.ndim))
ValueError: When compiling the inner function of scan the following error has been encountered: The initial state (outputs_info in scan nomenclature) of variable IncSubtensor{Set;:int64:}.0 (argument number 2) has dtype float32 and 2 dimension(s), while the result of the inner function for this output has dtype float64 and 1 dimension(s). This could happen if the inner graph of scan results in an upcast or downcast. Please make sure that you use dtypes consistently

I'm new on neural networks and GroundHog and I don't understand the error message.
Could anyone help me with it?

Thanks!

GPU is required for code execution

In cost_layers.py, the dependency blocksparse in not default installed in computers without NVIDIA GPU.

Is CUDA a minimum system requirement for future versions of lisa-groundHog?

Error in convert-pkl2hdf5.py

Hello everyone,

When running the following command:

python convert-pkl2hdf5.py binarized_text.en.pkl binarized_text.en.h5

I'm getting the following error:

'File' object has no attribute 'createEArray'

Anybody got insights on what could be the mistake I'm doing?

TypeError: int() argument must be a string or a number, not 'NoneType'

I have installed it using python setup.py develop --user
I have downloaded the code and when i try to run train.py , it is showing the below error (please check the attached file). please check it. in addition I have a English and Hindi data. could you please help me regarding how to apply it for English to Hindi. I have a two text file wi

th me that contains English and Hindi

IOError when saving model on disk

Hi,

I'm trying to train a translation model on mac 64 bit using experiments/nmt/train.py

The options:

python train.py --proto=prototype_encdec_state --state ru-data.py

where ru-data.py is:

dict(
  source=["ru-en/binarized.ru.shuf.h5"],
  target="ru-en/binarized_text.en.shuf.h5",
  word_indx="ru-en/vocab.ru.pkl",
  word_indx_trgt="ru-en/vocab.en.pkl",
  indx_word="ru-en/ivocab.ru.pkl",
  indx_word_target="ru-en/ivocab.en.pkl",
  reload=False
)

export THEANO_FLAGS=floatX=float32

The IO error happens upon model saving:

2014-12-09 21:37:31,407: __main__: DEBUG: Load data
2014-12-09 21:37:31,407: __main__: DEBUG: Compile trainer
2014-12-09 21:37:32,011: groundhog.trainer.SGD_adadelta: DEBUG: Constructing grad function
2014-12-09 21:37:32,253: groundhog.trainer.SGD_adadelta: DEBUG: Compiling grad function
2014-12-09 21:38:31,497: groundhog.trainer.SGD_adadelta: DEBUG: took 59.2441170216
2014-12-09 21:38:35,031: __main__: DEBUG: Run training
Validation computed every 500
Saving the model...
Model saved, took 4.50665020943
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "~/projects/machine_translation/rnn/GroundHog/groundhog/datasets/TM_dataset.py", line 175, in run
    target_table = tables.open_file(diter.target_file, 'r', driver=driver)
  File "/Library/Python/2.7/site-packages/tables/file.py", line 318, in open_file
    return File(filename, mode, title, root_uep, filters, **kwargs)
  File "/Library/Python/2.7/site-packages/tables/file.py", line 791, in __init__
    self._g_new(filename, mode, **params)
  File "tables/hdf5extension.pyx", line 359, in tables.hdf5extension.File._g_new (tables/hdf5extension.c:3875)
  File "/Library/Python/2.7/site-packages/tables/utils.py", line 157, in check_file_access
    raise IOError("``%s`` does not exist" % (filename,))
IOError: ``r`` does not exist

Am I missing something?

warning for not being able to determine streams size

Hi!

Upon starting training from a saved state, I see this warning in the logs. Should I be worried about it and can it be fixed?

2015-02-24 07:24:28,242: experiments.nmt.encdec: DEBUG: Build log-likelihood computation graph
2015-02-24 07:24:28,313: groundhog.layers.cost_layers: DEBUG: Get grads
2015-02-24 07:24:29,292: groundhog.layers.cost_layers: DEBUG: Got grads
2015-02-24 07:24:29,292: experiments.nmt.encdec: DEBUG: Build sampling computation graph
/Users/dmitry/Theano/theano/sandbox/rng_mrg.py:1195: UserWarning: MRG_RandomStreams Can't determine #streams from size (Elemwise{Cast{int32}}.0), guessing 60*256
  nstreams = self.n_streams(size)
2015-02-24 07:24:29,666: experiments.nmt.encdec: DEBUG: Create auxiliary variables
2015-02-24 07:24:29,666: experiments.nmt.encdec: DEBUG: Compile sampler
/Users/dmitry/Theano/theano/scan_module/scan_perform_ext.py:133: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility

Running test.bash outside of LISA lab

Hi,
thank you for your work and making it open-source!

I wanted to run the test for NMT, but I see the following line in test.bash:

Until we upload the models to a server the test data is stored in LISA data storage, which means that running tests is currently possible only from the LISA lab.
DATAURL='file://localhost/data/lisatmp3/bahdanau/testdata'

Is it possible for you to upload the test model any time soon?

Thanks again,
E. Matusov.

How to find the data iterator used for validation

Hi everybody,

I want to validate the training progress using perplexity after some iterations. As it has been mentioned in mainLoop.py, I have to specify "param valid_data: data iterator used for validation". But I don't find the Iterator class for validation. As I used the same like train_data, I get an error considering the queue. It would be appreciated, if someone let me know, how to validate my training which needs validation iterator.

Thanks

Cleaning Up: groundhog.layers.ff_layers

This issue is a part of my initiative to leave only useful/used code in this main fork of Grounhog.

The following classes from groundhog.layers.ff_layers:

BinaryOp
MinPooling
MaxPooling
GaussianNoise

are not used in any groundhog-based project I know. May I just delete them?

Beam search k-steps limiting output.

I have a trained model and tested on 1000 lines en-fr corpus, the beam search sample just fed me back less than 20 translations. In the line 60 of sample.py: https://github.com/lisa-groundhog/GroundHog/blob/master/experiments/nmt/sample.py#L60

for k in range(3 * len(seq)):

Is there a specific reason we need to use this number of steps? Since I think it's not large enough for my model to generate results end with enc_dec.state['null_sym_target']. And in my understanding the early truncation of beam search process does not make sense? - When do we stop it? Or just because my trained model is bad (so can't get correct output in infinite steps).

How to train nmt on GPU

Hi, can somebody tell me how to train the nmt on the GPU?
Thanks!

ImportError: No module named groundhog.trainer.SGD_adadelta

I am trying to train a translation model for English and French. When I run train.py, it gives the error

Traceback (most recent call last):
File "train.py", line 8, in
from groundhog.trainer.SGD_adadelta import SGD as SGD_adadelta
ImportError: No module named groundhog.trainer.SGD_adadelta

Any help is appreciated.

How to initialize MultiLayer with existed parameters?

For example initialing approx_embedder with pretrained word vectors. Any suggestion?

Remove Explicit Parameter Tracking

In Groundhog currently every layer has self.params: list of parameters its output depends on. As Jan thoughtfully pointed out about a month ago, it is not necessary since they can all be all retrieved by traversing computation graph. Then self.params_grad_scale elements should be attached to the parameters, which could be probably done by subclassing shared variable class (the problem is not quite clear what to subclass...).

RAM issue for training large models

Hello!

It is quite affordable to train NMT model for a relatively small parallel corpus, order of 300k sentence pairs.

When I tried to train a 1m+ model, I get memory issues on Linux and very slow iterations (up to an hour) on Mac. Is there some easy win to mitigate this problem on servers with up to 16 GB RAM? Could you share some technical details of models you have trained, including the used hardware?

Layer Application Result Is Not a Layer

Currently call function of groundhog layers returns another layer as a result. This is actually a design flow and very confusing. It should return just a theano variable with maybe some annotation, as @janchorowski have been proposing since a month.

GPU out of memory problem

I'm running this program with Geforce GTX 750 Ti, where I consistently encounter out of memory problem when the program iterates a specific number of runs.

groundhog.layers.basic: DEBUG: loading W_0_dec_prev_readout_0 of (620, 1000)
GPU status : used 1510.312 Mb free 537.375 Mb, total 2047.688 Mb [context start]
Saving the model...
Model saved.
groundhog.dataset.TM_dataset: DEBUG: 2007723 entries
groundhog.dataset.TM_dataset: DEBUG: Starting from the entry 0
.. iter 0 cost .. # some data
.. iter 5 cost .. # failed on iter 5
Error allocating 74402480 bytes of device memory (out of memory). Driver report 74268672 bytes free and 2147155968 bytes total
# and caused some function error such as attribute is missing

Tried to change state parameters to smaller batch but it doesn't count too much in GPU consuming. Is there a way to control the batches loading to GPU, or if not, how much memory is at least needed?

How to get the embeddings (vector representations) of the learned phrases

In Cho's 2014 paper "Learning Phrase Representation using RNN encoder-decoder for SMT", they demonstrate the 2-D embedding of the learned phrase representation. I wonder how the phrase vectors can be quickly taken from a learned model? Is there any API or function to do this?

error when I change dropout value

when I change dropout value to 0.5, there is an error when I run the program? Could anyone come across this problem?

theano error

when I run python sample.py --source finaltest.txt --state search_state.pkl search_model.npz --trans out.txt --beam-search, there is a error like: theano.compile.function_module.UnusedInputError: theano.function was asked to create a function computing outputs given certain inputs, but the provided input variable at index 1 is not part of the computational graph needed to compute the outputs: step_num.
To make this error into a warning, you can pass the parameter on_unused_input='warn' to theano.function. To disable it completely, use on_unused_input='ignore'. Could someone give any suggestions? Thanks!

Cleaning up: groundhog.cost_layers

This issue is a part of my initiative to leave only useful/used code in this main fork of Grounhog.

The following classes from groundhog.layers.cost_layers:

LinearLayer
SigmoidLayer

are not used in any groundhog-based project I know. May I just delete them?

sample.py gives unreasonable translation

Hi,
I had some problem when translating with trained model. I trained on the dataset given here which is used in NMT with alignment paper on one GPU for approx.200 hours.

At training step, the results given by sampling are very promising. But when I tried sampling with beam size of 12 / 60, the translations are basically all 'UNK', even on the training set.
I tried using a smaller dataset and still got this problem.

My testing script is like:
python sample.py --source=test.fr --beam-search --beam-size 12 --trans test.trans --state search_state.pkl, search_model.npz 2>>log.txt

Has anyone seen this problem before?

I know you guys have shifted to Blocks but maybe you can tell me how to correctly test a trained model. It would be a lot of help. Thanks!

Memory Error on GPU for NMT training

Hello,
I am trying to train nmt but getting following error-
"Error allocating 240008000 bytes of device memory (out of memory). Driver report 95178752 bytes free and 12079136768 bytes total"

I have changed the batch size and mini batch size, 10 and 5 respectively as suggested in a post on the same issue. kindly suggest, what may be the cause of this much memory utilization.

Thank you.

errors with scan: got an unexpected keyword argument 'states' at tutorials/DT_RNN_Tut.py

I am using theano0.7. I got the following error from the tutorial.

TypeError: scan() got an unexpected keyword argument 'states'

Can anyone kindly help me? Thanks.

file: tutorials/DT_RNN_Tut.py
line: 235

scan for iterating the single-step sampling multiple times

[samples, summaries], updates = scan(sample_fn,
                  states = [
                      TT.alloc(numpy.int64(0), state['sample_steps']),
                      TT.alloc(numpy.float32(0), 1, eval(state['nhids'])[-1])],
                  n_steps= state['sample_steps'],
                  name='sampler_scan')

experiments/nmt default state

I observed the following minor issue.

This is what is written in readme for experiments/nmt:

Simply running ... train.py ... would start training in the current directory.
...
The default prototype state used is prototype_search_state that corresponds to the RNNsearch-50 model from [1].

It turns out that by "simply running train.py" I get an error message, because train.py parse_args() contains:
("--proto", default="prototype_state")

Cleaning Up: groundhog.datasets.

This issue is a part of my initiative to leave only useful/used code in this main fork of Grounhog.

The following classes:

groundhog.dataset.TMIterator
groundhog.dataset.NNJMContextIterator

are not used in any of the groundhog-based projects I know. May I just delete them?

Merge SGD_ Files.

We have a lot of ugly code duplication in groundhog/trainer/, since new trainer was traditionally created by copying an old one. Waiting for a kind soul to refactor it...

About LSTM Layer in GroundHog

Hello

I have checked the LSTM Layer in GroundHog / groundhog / layers / rec_layers.py. I wonder is this a complete standard LSTM layer(e.g. described in Alex Grave's paper, http://arxiv.org/pdf/1308.0850v5.pdf ) or a prototype for now?

I didn't see bias term in # input/output gate update. did I miss it? btw. do you have an example using this layer?

thanks.

btw. I can write a wiki(tutorial) about it, If there is some example.

formulas described in paper:

code of LSTM layer:

Error when shuffle dataset

Hi,

I'm new on neural networks and GroundHog. Now I am trying to but I get this error:

python shuffle-hdf5.py binarized_text.en.h5 binarized_text.zn.h5 binarized_text.en.shuf.h5 binarized_text.zn.shuf.h5
Data len: 2208739
Shuffled
. . . . . . . . . 100000 . . . . . . . . . 200000 . . . . . . . . . 300000 . . . . . . . . . 400000 . . . . . . . . . 500000 . . . . . . . . . 600000 . . . . . . . . . 700000 . . . . . . . . . 800000 . . . . . . . . . 900000 . . . . .
Traceback (most recent call last):
File "shuffle-hdf5-1.py", line 68, in
pos = indices_in[fi][ii]['pos']
File "C:\Python27\lib\site-packages\tables\table.py", line 2119, in getitem
raise IndexError("Index out of range")
IndexError: Index out of range

Could anyone help me with it?

Cleaning Up: groundhog.layers.basic

This issue is a part of my initiative to leave only useful/used code in this main fork of Grounhog.

Do we need the Operator class?

Access to test/training scripts

I realize this project is no longer being developed, but as Bahdanau et al. (2014) is a critical work in the literature, it would be great to be able to reproduce this implementation exactly.

I've got it up-and-running, but test.bash says "Until we upload the models to a server the test data is stored in LISA data storage, which means that running tests is currently possible only from the LISA lab." Is this test data released somewhere? (And the training data? - though that I can roughly compose myself from details in the paper).

Otherwise, if I can access all the data/code in another repository, that would also be helpful (even if it's implemented with another framework).

How to preprocess the "phrase-table.en.h5" and ""phrase-table.fr.h5"

Hi
For "prototype_phrase_state" method, the "phrase-table.en.h5" and ""phrase-table.fr.h5" are need. How can I obtain these two file? Are they from the original SMT phrase-table? Which toolkit should I use to convert them to "phrase-table.en.h5" and ""phrase-table.fr.h5"?

Thanks

Inf log-probs when using score.py

Meet this issue when implement score.py. And all the log-probs are inf.
I test this on the training data, where in the training process, most target sentences can be recovered.
In my opinion, zero probability should not exist since in the softmax, there is inner product between two vector.
Any suggestions?

theano update error when running DT_RNN_Tut tutorial

data length is 11
data length is 11
data length is 11
Traceback (most recent call last):
File "DT_RNN_Tut.py", line 416, in
jobman(state, None)
File "DT_RNN_Tut.py", line 225, in jobman
name='valid_fn', updates=valid_updates)
File "/Library/Python/2.7/site-packages/theano/compile/function.py", line 309, in function
output_keys=output_keys)
File "/Library/Python/2.7/site-packages/theano/compile/pfunc.py", line 487, in pfunc
no_default_updates=no_default_updates)
File "/Library/Python/2.7/site-packages/theano/compile/pfunc.py", line 214, in rebuild_collect_shared
raise TypeError(err_msg, err_sug)
TypeError: ('An update must have the same type as the original shared variable (shared_var=<TensorType(float32, vector)>, shared_var.type=TensorType(float32, vector), update_val=Subtensor{int64}.0, update_val.type=TensorType(float64, vector)).', 'If the difference is related to the broadcast pattern, you can call the tensor.unbroadcast(var, axis_to_unbroadcast[, ...]) function to remove broadcastable dimensions.')

I just modified this part in the code:
state['path']= "data_words.npz"
state['dictionary']= "data_char_words.npz"
state['chunks'] = 'words'

I also installed the latest version of theano, tables, hdf5.
Not sure how I should proceed to get the tutorial working. Any help is appreciated, thanks!

Use your own parallel corpus for training and translation

Hi,
I am a beginner (machine translation) and I would like to ask how to use my own parallel corpus for training and translation. Training and translation of the specific orders and operations? Is there a manual?
Thank you very much, hoping to give answers!
Thanks again.

Possible to save compiled graph to disk?

Hi,

I have set up an RNN using groundhog which has a larger graph than the original RNN-search model and it takes quite a long time to compile (depending on the settings, more than an hour). I don't assume it's already implemented, but is it possible to save the compiled graph to disk somehow? I am running on a cluster with a 12 hour time limit, so it's a bit wasteful to recompile on every restart.

Cheers,
Eva

Error running DT_RNN_Tut.py

Since generate.py is supposed to create npz files that are compatible with the tutorial scripts, I was expecting the following comands to work (after putting input_chars.npz and input_chars_dict.npz in the appropriate place in DT_RNN_Tut.py, of course).

python generate --dest input_chars --level chars PATH_TO_TEXT_COMPRESSION_BENCHMARK
python DT_RNN_Tut.py

But no workie. See below for details about what I'm doing and what I'm seeing. I have fresh versions of Theano and GroundHog from github.

Is this a pilot error? Should I change something else in DT_RNN_Tut.py? The error IndexError: index 69 is out of bounds for size 50 is related to the number of dimensions of the embedding layer. If I change state['n_in'] from 50 to 51, the IndexError message changes accordingly:

# declare the dimensionalies of the input and output
if state['chunks'] == 'words':
    state['n_in'] = 10000
    state['n_out'] = 10000
else:
    state['n_in'] = 50
    state['n_out'] = 50
train_data, valid_data, test_data = get_text_data(state)

Similarly, if I switch from chars to words, the error becomes IndexError: index 33223 is out of bounds for size 10000, reflecting the dimensionality of the word embeddings.

Thanks!

I have train, valid, and test files from enwiki8 from http://mattmahoney.net/dc/textdata.html.

$ wc -l ~/proj/benchmarks/large-text-compression/{train,test,valid}
   44843 /home/ndronen/proj/benchmarks/large-text-compression/train
   36655 /home/ndronen/proj/benchmarks/large-text-compression/test
   44843 /home/ndronen/proj/benchmarks/large-text-compression/valid
  126341 total

Running generate.py results in no errors, and the files input_chars.npz and input_chars_dict.npz are created.

$ python generate.py --dest input_chars --level chars ~/proj/benchmarks/large-text-compression/
Constructing the vocabulary ..
 .. sorting words
 .. shrinking the vocabulary size
EOL 0
Constructing train set
Constructing valid set
Constructing test set
Saving data
... Done

$ file input_chars*
input_chars_dict.npz: Zip archive data, at least v2.0 to extract
input_chars.npz:      Zip archive data, at least v2.0 to extract

$ git diff DT_RNN_Tut.py 
diff --git a/tutorials/DT_RNN_Tut.py b/tutorials/DT_RNN_Tut.py
index e6e83d8..c17d425 100644
--- a/tutorials/DT_RNN_Tut.py
+++ b/tutorials/DT_RNN_Tut.py
@@ -298,8 +298,10 @@ if __name__=='__main__':
     state = {}
     # complete path to data (cluster specific)
     state['seqlen'] = 100
-    state['path']= "/data/lisa/data/PennTreebankCorpus/pentree_char_and_word.npz"
-    state['dictionary']= "/data/lisa/data/PennTreebankCorpus/dictionaries.npz"
+    #state['path']= "/data/lisa/data/PennTreebankCorpus/pentree_char_and_word.npz"
+    state['path']= 'input_chars.npz'
+    #state['dictionary']= "/data/lisa/data/PennTreebankCorpus/dictionaries.npz"
+    state['dictionary']= 'input_chars_dict.npz'
     state['chunks'] = 'chars'
     state['seed'] = 123

$ python DT_RNN_Tut.py
Using gpu device 0: GeForce GTX TITAN Black
data length is  9979512
data length is  9979512
data length is  8838862
/home/ndronen/.local/lib/python2.7/site-packages/Theano-0.6.0-py2.7.egg/theano/scan_module/scan_perform_ext.py:133: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
  from scan_perform.scan_perform import *
/home/ndronen/.local/lib/python2.7/site-packages/Theano-0.6.0-py2.7.egg/theano/sandbox/rng_mrg.py:1195: UserWarning: MRG_RandomStreams Can't determine #streams from size (Elemwise{Cast{int32}}.0), guessing 60*256
  nstreams = self.n_streams(size)
Constructing grad function
Compiling grad function
took 0.283576965332
Validation computed every 1000
GPU status : Used 110.398 Mb Free 6033.414 Mb,total 6143.812 Mb [context start]
Saving the model...
Model saved, took 0.161453008652
Traceback (most recent call last):
  File "DT_RNN_Tut.py", line 418, in 
    jobman(state, None)
  File "DT_RNN_Tut.py", line 293, in jobman
    main.main()
  File "/home/ndronen/proj/GroundHog/groundhog/mainLoop.py", line 293, in main
    rvals = self.algo()
  File "/home/ndronen/proj/GroundHog/groundhog/trainer/SGD_momentum.py", line 159, in __call__
    rvals = self.train_fn()
  File "/home/ndronen/.local/lib/python2.7/site-packages/Theano-0.6.0-py2.7.egg/theano/compile/function_module.py", line 605, in __call__
    self.fn.thunks[self.fn.position_of_error])
  File "/home/ndronen/.local/lib/python2.7/site-packages/Theano-0.6.0-py2.7.egg/theano/compile/function_module.py", line 595, in __call__
    outputs = self.fn()
IndexError: index 69 is out of bounds for size 50
Apply node that caused the error: AdvancedSubtensor1(Elemwise{add,no_inplace}.0, x)
Inputs types: [TensorType(float32, matrix), TensorType(int64, vector)]
Inputs shapes: [(50, 400), (100,)]
Inputs strides: [(1600, 4), (8,)]
Inputs scalar values: ['not scalar', 'not scalar']

Backtrace when the node is created:
  File "/home/ndronen/proj/GroundHog/groundhog/utils/utils.py", line 177, in dot
    return matrix[inp]

Debugprint of the apply node:
AdvancedSubtensor1 [@A]  ''
 |Elemwise{add,no_inplace} [@B]  ''
 | |HostFromGpu [@C]  ''
 | | |W_0_emb_words [@D] 
 | |HostFromGpu [@E]  ''
 |   |noise_W_0_emb_words [@F] 
 |x [@G]

lisa-groundhog / groundhog Goto Github PK

groundhog's Introduction

GroundHog by lisa-groundhog

Installation

Neural Machine Translation

groundhog's People

Contributors

Stargazers

Watchers

Forkers

groundhog's Issues

scan for iterating the single-step sampling multiple times

Recommend Projects

Recommend Topics

Recommend Org