Since generate.py is supposed to create npz files that are compatible with the tutorial scripts, I was expecting the following comands to work (after putting input_chars.npz and input_chars_dict.npz in the appropriate place in DT_RNN_Tut.py, of course).
python generate --dest input_chars --level chars PATH_TO_TEXT_COMPRESSION_BENCHMARK
python DT_RNN_Tut.py
But no workie. See below for details about what I'm doing and what I'm seeing. I have fresh versions of Theano and GroundHog from github.
Is this a pilot error? Should I change something else in DT_RNN_Tut.py? The error IndexError: index 69 is out of bounds for size 50
is related to the number of dimensions of the embedding layer. If I change state['n_in']
from 50 to 51, the IndexError message changes accordingly:
# declare the dimensionalies of the input and output
if state['chunks'] == 'words':
state['n_in'] = 10000
state['n_out'] = 10000
else:
state['n_in'] = 50
state['n_out'] = 50
train_data, valid_data, test_data = get_text_data(state)
Similarly, if I switch from chars to words, the error becomes IndexError: index 33223 is out of bounds for size 10000
, reflecting the dimensionality of the word embeddings.
Thanks!
I have train, valid, and test files from enwiki8 from http://mattmahoney.net/dc/textdata.html.
$ wc -l ~/proj/benchmarks/large-text-compression/{train,test,valid}
44843 /home/ndronen/proj/benchmarks/large-text-compression/train
36655 /home/ndronen/proj/benchmarks/large-text-compression/test
44843 /home/ndronen/proj/benchmarks/large-text-compression/valid
126341 total
Running generate.py
results in no errors, and the files input_chars.npz and input_chars_dict.npz are created.
$ python generate.py --dest input_chars --level chars ~/proj/benchmarks/large-text-compression/
Constructing the vocabulary ..
.. sorting words
.. shrinking the vocabulary size
EOL 0
Constructing train set
Constructing valid set
Constructing test set
Saving data
... Done
$ file input_chars*
input_chars_dict.npz: Zip archive data, at least v2.0 to extract
input_chars.npz: Zip archive data, at least v2.0 to extract
$ git diff DT_RNN_Tut.py
diff --git a/tutorials/DT_RNN_Tut.py b/tutorials/DT_RNN_Tut.py
index e6e83d8..c17d425 100644
--- a/tutorials/DT_RNN_Tut.py
+++ b/tutorials/DT_RNN_Tut.py
@@ -298,8 +298,10 @@ if __name__=='__main__':
state = {}
# complete path to data (cluster specific)
state['seqlen'] = 100
- state['path']= "/data/lisa/data/PennTreebankCorpus/pentree_char_and_word.npz"
- state['dictionary']= "/data/lisa/data/PennTreebankCorpus/dictionaries.npz"
+ #state['path']= "/data/lisa/data/PennTreebankCorpus/pentree_char_and_word.npz"
+ state['path']= 'input_chars.npz'
+ #state['dictionary']= "/data/lisa/data/PennTreebankCorpus/dictionaries.npz"
+ state['dictionary']= 'input_chars_dict.npz'
state['chunks'] = 'chars'
state['seed'] = 123
$ python DT_RNN_Tut.py
Using gpu device 0: GeForce GTX TITAN Black
data length is 9979512
data length is 9979512
data length is 8838862
/home/ndronen/.local/lib/python2.7/site-packages/Theano-0.6.0-py2.7.egg/theano/scan_module/scan_perform_ext.py:133: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
from scan_perform.scan_perform import *
/home/ndronen/.local/lib/python2.7/site-packages/Theano-0.6.0-py2.7.egg/theano/sandbox/rng_mrg.py:1195: UserWarning: MRG_RandomStreams Can't determine #streams from size (Elemwise{Cast{int32}}.0), guessing 60*256
nstreams = self.n_streams(size)
Constructing grad function
Compiling grad function
took 0.283576965332
Validation computed every 1000
GPU status : Used 110.398 Mb Free 6033.414 Mb,total 6143.812 Mb [context start]
Saving the model...
Model saved, took 0.161453008652
Traceback (most recent call last):
File "DT_RNN_Tut.py", line 418, in
jobman(state, None)
File "DT_RNN_Tut.py", line 293, in jobman
main.main()
File "/home/ndronen/proj/GroundHog/groundhog/mainLoop.py", line 293, in main
rvals = self.algo()
File "/home/ndronen/proj/GroundHog/groundhog/trainer/SGD_momentum.py", line 159, in __call__
rvals = self.train_fn()
File "/home/ndronen/.local/lib/python2.7/site-packages/Theano-0.6.0-py2.7.egg/theano/compile/function_module.py", line 605, in __call__
self.fn.thunks[self.fn.position_of_error])
File "/home/ndronen/.local/lib/python2.7/site-packages/Theano-0.6.0-py2.7.egg/theano/compile/function_module.py", line 595, in __call__
outputs = self.fn()
IndexError: index 69 is out of bounds for size 50
Apply node that caused the error: AdvancedSubtensor1(Elemwise{add,no_inplace}.0, x)
Inputs types: [TensorType(float32, matrix), TensorType(int64, vector)]
Inputs shapes: [(50, 400), (100,)]
Inputs strides: [(1600, 4), (8,)]
Inputs scalar values: ['not scalar', 'not scalar']
Backtrace when the node is created:
File "/home/ndronen/proj/GroundHog/groundhog/utils/utils.py", line 177, in dot
return matrix[inp]
Debugprint of the apply node:
AdvancedSubtensor1 [@A] ''
|Elemwise{add,no_inplace} [@B] ''
| |HostFromGpu [@C] ''
| | |W_0_emb_words [@D]
| |HostFromGpu [@E] ''
| |noise_W_0_emb_words [@F]
|x [@G]