localminimum / qanet Goto Github PK

View Code? Open in Web Editor NEW

983.0 983.0 310.0 362 KB

A Tensorflow implementation of QANet for machine reading comprehension

License: MIT License

Python 96.50% Shell 1.26% HTML 2.23%

cnn machine-comprehension nlp squad tensorflow

qanet's People

Contributors

Stargazers

Watchers

Forkers

huangpeng1126 caoxu915683474 ioana-blue pengfight lirongming 3dmm-icme2023 kamalkraj cosecant-csc cutecha arfu2016 njust-taoye shuang0420 shubhampachori12110095 troflow jasonshiyong asherchan zhihaosun libertatis gauravyeole lujunru judelee19 simplejian los-phoenix cyzhangathit liu4lin yzx1992 nitish166 qhduan little1tow shlpu wanghm92 yuhuizhou vpegasus zwjyyc statml bellamkondaprakash arvindsg jasonwbw springbarley zzmjohn icewwn 312shan webblearning yucoian hunslater-deeplearning antriv shuaiyan terencezhou taghialiyev mennianshi michael-wzhu fengxhao jimmyzhangbupt mysqlsc aiedward stanstarks misoknisky annding zgsxwsdxg augmen txye dailyactie wangsc522 chenghuige dylan-fan qitong sunnymarkliu helenailse milamila56 ubermenschlzy andrefsp newenglandml i-lovelife xiaonainiu javacjh babylls joyle readzw repletetop amandalmia14 ewrfcas reloadbrain toddmorrill matejkvassay imdahmash vangogh0318 jadielam casillas-qf openhushen kimnt93 waveli123 ldruth28 jeremycchsu db-li hryym sanwushuosi studydeeplearningai wolfhu senseinfosys-indra-firmansyah roshanraj

qanet's Issues

Unable to load pre-trained weights

Steps i have done
1.Cloned the repo and downloaded weights
2.sh download.sh
3.run python config.py --mode prepro
4. run python config.py --mode demo
error

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/kamalraj/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/kamalraj/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/mnt/d/ML exp/Fast-Reading-Comprehension/demo.py", line 74, in demo_backend
    saver.restore(sess, tf.train.latest_checkpoint(config.save_dir))
  File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1755, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [91588,300] rhs shape= [91589,300]
         [[Node: save/Assign_375 = Assign[T=DT_FLOAT, _class=["loc:@word_mat"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:C
PU:0"](word_mat, save/RestoreV2:375)]]

Caused by op u'save/Assign_375', defined at:
  File "/home/kamalraj/anaconda2/lib/python2.7/threading.py", line 774, in __bootstrap
    self.__bootstrap_inner()
  File "/home/kamalraj/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/kamalraj/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/mnt/d/ML exp/Fast-Reading-Comprehension/demo.py", line 73, in demo_backend
    saver = tf.train.Saver()
  File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1293, in __init__
    self.build()
  File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1302, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1339, in _build
    build_save=build_save, build_restore=build_restore)
  File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 796, in _build_internal
    restore_sequentially, reshape)
  File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 471, in _AddRestoreOps
    assign_ops.append(saveable.restore(saveable_tensors, shapes))
  File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 161, in restore
    self.op.get_shape().is_fully_defined())
  File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/state_ops.py", line 280, in assign
    validate_shape=validate_shape)
  File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 58, in assign
    use_locking=use_locking, name=name)
  File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
    op_def=op_def)
  File "/home/kamalraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [91588,300] rhs shape= [91589,300]
         [[Node: save/Assign_375 = Assign[T=DT_FLOAT, _class=["loc:@word_mat"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:C
PU:0"](word_mat, save/RestoreV2:375)]]

No such file or directory: 'demo.html'

I am trying to run the interactive server, but when I navigate to the server URL, the page throws up a 500 code error (Internal Server Error).

The trace for the error is:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/bottle.py", line 862, in _handle
    return route.call(**args)
  File "/usr/local/lib/python3.6/dist-packages/bottle.py", line 1740, in wrapper
    rv = callback(*a, **ka)
  File "/home/rudresh/Documents/machine_comprehension/Fast-Reading-Comprehension/demo.py", line 25, in home
    with open('demo.html', 'r') as fl:
FileNotFoundError: [Errno 2] No such file or directory: 'demo.html'
127.0.0.1 - - [07/Apr/2018 10:07:55] "GET / HTTP/1.1" 500 739
127.0.0.1 - - [07/Apr/2018 10:07:56] "GET /favicon.ico HTTP/1.1" 404 740
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/bottle.py", line 862, in _handle
    return route.call(**args)
  File "/usr/local/lib/python3.6/dist-packages/bottle.py", line 1740, in wrapper
    rv = callback(*a, **ka)
  File "/home/rudresh/Documents/machine_comprehension/Fast-Reading-Comprehension/demo.py", line 25, in home
    with open('demo.html', 'r') as fl:
FileNotFoundError: [Errno 2] No such file or directory: 'demo.html'

Which is obviously caused by the missing demo.html file. Can you please help me out with where do I procure the file from?

I am running a Python3.6 on Ubuntu 16.04.

how to train by changing/adding batch_size?

I am not able to free GPU for training data. So I am planning how to add/update batch _size?

Try to test ELMO language model

Hi all,
As both results from Google Brain team and AllenNPL, using ELMO can give a big boost in result. I noticed that AllenNLP provides some pretrained model of ELMO. I would love to see some better results.
Thanks.
[1] QANet slide
[2] ELMO page

training on "Answer not available" ?

Any suggestions on how to train network on "Not available" answer for the questions which cannot be answered from the context.

Inference on my machine differs from other machines I tested

Hi, I trained the model on AWS (GPU instance) for 60K steps and got the model. I then tested it on several GPU/CPU instance and results are consistent. When I deploy it locally on my Ubuntu desktop (CPU only), the inferences are totally off. I tested on AWS GPU instance (p2.xlarge), AWS CPU instance (c5d.4xlarge) and also on Colab. All three show consistent answers for a given context and questions. Only on my desktop the answers are way off. Any inputs as to why this could be happening would help. Thanks!

Train Models With Macs

Hi. I have Macbook Air(Mid 2017) and I want to train data. So it haven't a GPU so without GPU how can I train model?

how to predict answers for custom question and context by reusing loaded model

Hello All,

I have many json files whose format are the same as the standard train file or dev file so can i feed that to this network and predict to get the answers for different input questions and contexts?

Thanks,
Sachin B. Ichake

InvalidArgumentError during evaluation

Hello,

for some questions in SQuAD dataset I got exception:

InvalidArgumentError (see above for traceback): num_upper must be negative or less or equal to number of columns (10) got: 30
[[Node: Output_Layer/MatrixBandPart = MatrixBandPart[T=DT_FLOAT, Tindex=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Output_Layer/MatMul, Output_Layer/MatrixBandPart/num_lower, Output_Layer/MatrixBandPart/num_upper)]]

Do you know what is the reason for that? How to get rid of this problem?

OOM error while training

What are the specification of System you used for training ?

Can you share a pre-trained model weights ?

about embedding matrix structure

Thanks for the brillient code.
I have noticed a santence in the paper:

"all the out-of-vocabulary words are mapped to a token ,whose embedding is trainable with random initialization." which not in your code. (they used a pretrained matrix)That seems make sence.
Do that works for the model?

Trainable Embedding for OOV words

Hello,

I have one doubt over your code: in your code, all OOV words are represented by id 1, which means, all OOV words are considered the same word, and its embedding is a zero vector. Also, this embedding will not be updated during training. However, in the original paper, the author mentioned that for OOV words, the word embeddings are updated during training.

I think this may be a reason why the score is lower than the original paper.

How to train in Multi GPU

I see that tensorflow detected 2 GPU's but the training is only happening in 1 GPU. Please advise?

Training stops after some time

Hello everyone,

I've been trying to train a model with different num_heads, hidden and num_steps parameters.
The default parameters in config.py works like a charm but once I change the mentioned parameters, I get this:

Exception ignored in: <bound method tqdm.__del__ of  42%|██████████████████████▉                                | 49999/120000 [15:34:24<18:06:29,  1.07it/s]>
Traceback (most recent call last):█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 328/328 [02:05<00:00,  2.53it/s]
  File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_tqdm.py", line 889, in __del__
    self.close()
  File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_tqdm.py", line 1095, in close
    self._decr_instances(self)
  File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_tqdm.py", line 454, in _decr_instances
    cls.monitor.exit()
  File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_monitor.py", line 52, in exit
    self.join()
  File "/usr/lib/python3.5/threading.py", line 1051, in join
    raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread

This occured when I set num_head to 2, 4 and 8. I could train up to 50k and 54k steps when num_head was set to 2 and 4, and it failed from the starts when num_head was set to 8.

I'm using Ubuntu 16.04, Python 3.5.2 and training the network on a GPU. Here's the nvidia-smi and nvcc --version output if someone needs it:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44                 Driver Version: 396.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   72C    P0    63W / 149W |      0MiB / 11441MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

So what could be the real cause of this error?

Thanks in advance!

Pre-loaded glove char vectors have mismatched tensor shapes

When trying to set the "pretrained_char" as True, the is a tensor reshape size conflict.

glove_char_file = os.path.join('data/glove', "glove.840B.300d-char.txt")
flags.DEFINE_string("glove_char_file", glove_char_file, "Glove character embedding source file")
flags.DEFINE_boolean("pretrained_char", True, "Whether to use pretrained character embedding")

Error is from model.py line 76, below. How can the reshape dimensions be adjusted?

Error:

Traceback (most recent call last):
  File "config.py", line 152, in <module>
    tf.app.run()
  File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "config.py", line 133, in main
    train(config)
  File "QANet/main.py", line 95, in train
    handle: train_handle, model.dropout: config.dropout})
  File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1140, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    run_metadata)
  File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 2445312 values, but the requested shape has 11462400
	 [[Node: Input_Embedding_Layer/Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Input_Embedding_Layer/embedding_lookup, Input_Embedding_Layer/Reshape/shape)]]
	 [[Node: Identity/_4743 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_52979_Identity", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'Input_Embedding_Layer/Reshape', defined at:
  File "config.py", line 152, in <module>
    tf.app.run()
  File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "config.py", line 133, in main
    train(config)
  File "QANet/main.py", line 72, in train
    model = Model(config, iterator, word_mat, char_mat, graph = g)
  File "QANet/model.py", line 60, in __init__
    self.forward()
  File "QANet/model.py", line 76, in forward
    ch_emb = tf.reshape(tf.nn.embedding_lookup(self.char_mat, self.ch), [N * PL, CL, dc]) # 32*1000?, 16, 64 = 32768000.  Input to reshape is a tensor with 34099200 values, but the requested shape has 7274496.
  File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5782, in reshape
    "Reshape", tensor=tensor, shape=shape, name=name)
  File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
    op_def=op_def)
  File "/home/my/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 2445312 values, but the requested shape has 11462400
	 [[Node: Input_Embedding_Layer/Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Input_Embedding_Layer/embedding_lookup, Input_Embedding_Layer/Reshape/shape)]]
	 [[Node: Identity/_4743 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_52979_Identity", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

TODOs

This is an umbrella issue where we can collectively tackled some problems and improve general open source reading comprehension quality.

Goal
The network is already there. We just need to add more features on top of the current model.

Implement full features stated in the original paper
Achieve EM/F1 performance stated in the original paper with a single model settings

Model

Increase the hidden units to 128. #15 reported performance increase when the hidden units increased from 96 to 128
Increase the number of heads to 8
Add dropouts in better locations to maximize regularization
Train "unknown" word embedding

Data

Implement paraphrasing by back-translation to increase the data size

Contribution to any of these issues is welcome and please comment on this issue and let us know if you want to work on these problems.

A tool to generate training data

Snorkel can generate training data, maybe it is useful to data augmentation.
It is using dynamic programming instead of translation twice.

conv_block problem

Why is the dropout here not every resudial block?

-flags.DEFINE_integer("bucket_range", [40, 401, 40], "the range of bucket")

should be:

-flags.DEFINE_list("bucket_range", [40, 401, 40], "the range of bucket")

mask_logits function

I don't understand the purpose of "mask_logits" function, which is being used before calling "softmax" function at various places. Can someone please explain.

inconsistency in predictions

We have trained QA net for our own question and answers data. But when we run it in demo mode for prediction it is giving different results for the same question.

Some times it picks correct answer for the same question and some time does not, but ideally it should pick the same answer, right ? Any ideas what could be the reason for this behaviour of trained model ?

I have commented out below section from test/demo code:

"""
if config.decay < 1.0:
sess.run(model.assign_vars)
"""

how to adapt it for squad2.0 dataset?

word_embed.json missing?

I'm trying to train/demo the code and in both cases, python config.py --mode train and python config.py --mode demo I end up hitting the same error.

The last few bits of the traceback are:

  File "config.py", line 125, in main
    train(config)
  File "/home/arjoonn/Fast-Reading-Comprehension/main.py", line 19, in train
    with open(config.word_emb_file, "r") as fh:
FileNotFoundError: [Errno 2] No such file or directory: 'data/word_emb.json'

I saw some commented out things in the download.sh file, should I be un-commenting those?

Unable to preprocess data

I am getting following error while preprocessing:
Generating word embedding...
13%|#######################3 | 296814/2200000 [00:36<03:52, 8176.12it/s]Traceback (most recent call last):
File "config.py", line 144, in
tf.app.run()
File "C:\Users\chchauha\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run
_sys.exit(main(argv))
File "config.py", line 127, in main
prepro(config)
File "F:\Synapse\QANet-master\prepro.py", line 287, in prepro
word_counter, "word", emb_file=word_emb_file, size=config.glove_word_size, vec_size=config.glove_dim)
File "F:\Synapse\QANet-master\prepro.py", line 99, in get_embedding
vector = list(map(float, array[-vec_size:]))
ValueError: could not convert string to float: 'sania'

Is this snippet in prepro.py correct

 for token in context_tokens:
                    word_counter[token] += len(para["qas"])
                    for char in token:
                        char_counter[char] += len(para["qas"])

Should it be +=1?

This repo cannot reproduce the result of original paper

Thank you for your implementation, it is very helpful for me.
I run this code and can get the similar result when the number of heads equals to 1. But, I cannot get the result of original paper(73.6/82.7) when I use 8 heads, batch size 32, training step 150k, char dimension of 200 (the same setting as the original paper). I can only get around (71.27/80.58).
Same situation was ocurred when I ran the pytorch repo (https://github.com/andy840314/QANet-pytorch-).

Any suggestions?

Results of the original paper

Thanks for this great implementation. I noticed that you mentioned in the README file that the original system can achieve EM: 72.5, F1: 81.4 after 150,000 training steps, and EM: 76.2, F1: 84.6 after 340,000 training steps. But I didn't find this information in the original paper. It seems that the original system takes much longer time to train? Could you show me where to get this information? Or did you infer that from other statistics?

https://nlp.stanford.edu/data/glove.840B.300d.zip

First it is a greate job!
The file https://nlp.stanford.edu/data/glove.840B.300d.zip
could not been download,where can I download it?Thanks!

Parameter setting problem

1.What is the meaning of config.hidden used in conv(), and why is the value of kernel size =5 in conv() , is it a parameter that needs to be debugged?

2.Is the conv function pre-packaged with tensorflow, or you need to rewrite it by yourself?

Is the highway function rewritten by yourself? In the original code of BiDAF, the highway function provided by Seo is different from yours. Have you you already tried it, and the effect of Seo is not good.

Preprocessing

In the preprocessing mode the execution stops at def build_features()
stating that (example["y1s"][0] - example["y2s"][0]) > ans_ limit
List index out of bound

And later when commenting that statement it moves forward and gives another error at
start, end = example["y1s"][-1], example["y2s"][-1]
List Index out of bound

Please Help. Is it because I am using SQuAD version 2.0?

How to start training?

I have read README.md file, but still don't know how to run this project. Can anybody give more instructions?

The embedding projection

Hi, I have noticed that you have put the input projection before Highway Network. However, in the paper, it is mentioned that the input of Embedding Encoding Layer is a vector of dimension p1+p2=500 for each word, which means that the projection is placed after the Highway Network.

Have you already try this?

layer normalization in layer?

https://github.com/NLPLearn/QANet/blob/8107d223897775d0c3838cb97f93b089908781d4/layers.py#L52

execuse me, in the paper "Layer Normalization,Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton", it said that the mean and variance is computed over all the hidden units in the same layer, and different training cases have different normalization terms. So I think the mean should be computed like this:

axes = list(range(1, x.shape.ndims))
mean = tf.reduce_mean(x, axes)

So the shape of mean is [batch,]. also the variance is [batch,]
and then feed them to compute the normlized x.

In the tensorflow api of layer normalization, the source code is below, and I think it is the same with mine.
norm_axes = list(range(begin_norm_axis, inputs_rank))
https://github.com/tensorflow/tensorflow/blob/c19e29306ce1777456b2dbb3a14f511edf7883a8/tensorflow/contrib/layers/python/layers/layers.py#L2311

Train with M40 card but got OOM message

i'm checking this model with M40 device , which is 24G memory on this board.

What's you default batch size used on 1080 card ?? as it seem tf show OOM when i increase batch size to 64 ?

Can it support Chinese?

I just change nlp = spacy.blank("en") to nlp = spacy.blank("zh")
Is that ok?

Memory Issue

I am using AWS p2.xlarge which has Tesla K80.
While training it is still showing memory issue. Why??
It has 11.17 GIGs of memory which displays in my console.
Logs - attached.
logs.txt

TIA

dev set evaluation

My test and dev sets are same. But I get different results from training check point evaluation vs running config.py in test mode.

Ideally it should give same results because we are loading the saved model and running it on dev file again ?

How can I do fine tuning using QANet?

I've trained the QANet model on SQUAD. I wanted to apply this SQUAD trained model to a new dataset using fine tuning. I need to use the weight from this SQUAD trained model as the initialization for the new dataset for training, with a purpose to make the SQUAD model adaptive to the new dataset.

From the train/FRC folder, I can see there are several checkpoint files. Which checkpoint files should I use for initialization of the new model for the new dataset?

Thanks,

AttributeError: 'module' object has no attribute 'blank'

i had done:
sudo pip install spacy==2.0.9

mldl@mldlUB1604:~/ub16_prj/Fast-Reading-Comprehension$ python config.py --mode prepro
Traceback (most recent call last):
File "config.py", line 9, in
from prepro import prepro
File "/home/mldl/ub16_prj/Fast-Reading-Comprehension/prepro.py", line 15, in
nlp = spacy.blank("en")
AttributeError: 'module' object has no attribute 'blank'

Trying to fine tune with different data, But getting dimensionality mismatched for tensor

I am getting the following error while trying to fine tune

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [326,64] rhs shape= [1427,64]
	 [[Node: save/Assign_746 = Assign[T=DT_FLOAT, _class=["loc:@char_mat"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](char_mat, save/RestoreV2:746)]]

How do I resume training off a checkpoint?

After training the model to 46 percent there was a power outage. What command do I use to resume training? I'm on checkpoint 26.

Thanks in Advance

problem about highwaynet

In highway network, H is a non_linear function. But in this report，H is a linear function. why this is? thanks!

RuntimeError('cannot join current thread',) in <object repr() failed>

(.venv) ub16c9@ub16c9-gpu:~/ub16_prj/QANet$ python config.py --mode train
Building model...
WARNING:tensorflow:From /home/ub16c9/ub16_prj/QANet/layers.py:52: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /home/ub16c9/ub16_prj/QANet/model.py:134: calling softmax (from tensorflow.python.ops.nn_ops) with dim is deprecated and will be removed in a future version.
Instructions for updating:
dim is deprecated, use axis instead
WARNING:tensorflow:From /home/ub16c9/ub16_prj/QANet/model.py:174: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

Total number of trainable parameters: 788673
2018-12-29 11:14:48.345129: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-12-29 11:14:48.431530: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-12-29 11:14:48.431955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6575
pciBusID: 0000:01:00.0
totalMemory: 10.92GiB freeMemory: 10.43GiB
2018-12-29 11:14:48.431971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-29 11:14:48.733045: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-29 11:14:48.733079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-12-29 11:14:48.733085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-12-29 11:14:48.733318: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10086 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-12-29 11:14:50.042331: W tensorflow/core/framework/allocator.cc:122] Allocation of 109906800 exceeds 10% of system memory.
2018-12-29 11:14:50.174758: W tensorflow/core/framework/allocator.cc:122] Allocation of 109906800 exceeds 10% of system memory.
2018-12-29 11:14:50.507489: W tensorflow/core/framework/allocator.cc:122] Allocation of 109906800 exceeds 10% of system memory.
2018-12-29 11:14:50.691090: W tensorflow/core/framework/allocator.cc:122] Allocation of 109906800 exceeds 10% of system memory.
2018-12-29 11:14:50.825623: W tensorflow/core/framework/allocator.cc:122] Allocation of 109906800 exceeds 10% of system memory.
55%|██████████████████████████████████████████████████████████████████████████████████████▏ | 32935/60000 [3:15:35<2:19:53, 3.22it/s] 90%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 53999/60000 [5:17:29<29:48, 3.36it/sException RuntimeError: RuntimeError('cannot join current thread',) in <object repr() failed> ignored██████████████████████████████████████████████████████████████████████| 328/328 [00:36<00:00, 9.07it/s]
(.venv) ub16c9@ub16c9-gpu:~/ub16_prj/QANet$

tensorflow not in requirements.txt

Is it a better way to use conv layer in highway or encoder block feed forward network rather than dense layer?

The author didn't mention they use conv layer in paper. thanks for any reply!

possibly insufficient driver version:

Hi，

i meet runtime error, in sess.run([]),

2018-03-23 11:55:55.959752: E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2018-03-23 11:55:55.959887: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  384.69  Wed Aug 16 19:34:54 PDT 2017
GCC version:  gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) 
"""
2018-03-23 11:55:55.959970: E tensorflow/stream_executor/cuda/cuda_dnn.cc:393] possibly insufficient driver version: 384.69.0
2018-03-23 11:55:55.959998: E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2018-03-23 11:55:55.960028: F tensorflow/core/kernels/conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms) 
2018-03-23 11:55:55.960040: E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
Aborted (core dumped)

tensorflow version: 1.5.0
CUDA: 9.0
Cudnn: 7.0
Driver Version: 384.69

Can i ask your versions, should i update my driver version, or may only change some model code?
it works without gpu.

Thanks

Report the results

Model	Training Steps	Size	Attention Heads	Data Size (aug)	EM	F1
My Model	60,000	128	1	87k (no aug)	70.7	79.8

The results are obtained on a K80 machine. I modify the trilinear function for memory efficiency, but the results are the same with the current version of this repository.

I'm not sure about the overfitting, the model is the last checkpoint after training 60,000 steps.

Speed ?

For num_heads 1, hidden size 96, seems not faster then HKUST rnet ?
With batch size 64 , 1.42 batch/s while HKUST RNET with 2.4+ batch/s
Though HKUST RNET default use char dim only 8 , here we use 64 but still I think QANet not as fast as which google show in the paper ?

mask_logits in layer.py

I think the line "return inputs + mask_value * (1 - mask)" should be "return inputs*mask + mask_value * (1 - mask)"

AssertionError

Hi,
when i try to run your code ,I got an error:
Reducing Glove Matrix
100%|█████████████████████████████████████████████████| 442/442 [01:32<00:00, 4.79it/s]
100%|███████████████████████████████████████████████████| 48/48 [00:10<00:00, 4.43it/s]
Processing 91600 vocabs
Total number of lines: 91604
Reduced vocab size: 91604
Reading GloVe from: ./glove.840B.300d.txt
Processing line 91600
Reading GloVe from: ./glove.840B.300d.char.txt

Tokenizing training data.
100%|█████████████████████████████████████████████████| 442/442 [01:25<00:00, 5.19it/s]
Tokenizing dev data.
100%|███████████████████████████████████████████████████| 48/48 [00:10<00:00, 4.77it/s]
Tokenizing complete
Processing 91600 vocabsTraceback (most recent call last):
File "process.py", line 377, in
main()
File "process.py", line 371, in main
load_glove(Params.glove_dir,"glove",vocab_size = Params.vocab_size)
File "process.py", line 203, in load_glove
assert 0
AssertionError
can you tell me why this happend?