magic282 / nqg Goto Github PK

View Code? Open in Web Editor NEW

143.0 5.0 29.0 52 KB

Code for the paper "Neural Question Generation from Text: A Preliminary Study"

License: GNU General Public License v3.0

Python 98.86% Shell 1.14%

neural-question-generation

nqg's Introduction

NQG

This repository contains code for the paper "Neural Question Generation from Text: A Preliminary Study"

About this code

The experiments in the paper were done with an in-house deep learning tool. Therefore, we re-implement this with PyTorch as a reference.

This code only implements the setting NQG+ in the paper. Within 1 hour's training on Tesla P100, the NQG+ model achieves 12.78 BLEU-4 score on the dev set.

If you find this code useful in your research, please consider citing:

@article{zhou2017neural,
  title={Neural Question Generation from Text: A Preliminary Study},
  author={Zhou, Qingyu and Yang, Nan and Wei, Furu and Tan, Chuanqi and Bao, Hangbo and Zhou, Ming},
  journal={arXiv preprint arXiv:1704.01792},
  year={2017}
}

How to run

Prepare the dataset and code

Make an experiment home folder for NQG data and code:

NQG_HOME=~/workspace/nqg
mkdir -p $NQG_HOME/code
mkdir -p $NQG_HOME/data
cd $NQG_HOME/code
git clone https://github.com/magic282/NQG.git
cd $NQG_HOME/data
wget https://res.qyzhou.me/redistribute.zip
unzip redistribute.zip

Put the data in the folder $NQG_HOME/code/data/giga and organize them as:

nqg
├── code
│   └── NQG
│       └── seq2seq_pt
└── data
    └── redistribute
        ├── QG
        │   ├── dev
        │   ├── test
        │   ├── test_sample
        │   └── train
        └── raw

Then collect vocabularies:

python $NQG_HOME/code/NQG/seq2seq_pt/CollectVocab.py \
       $NQG_HOME/data/redistribute/QG/train/train.txt.source.txt \
       $NQG_HOME/data/redistribute/QG/train/train.txt.target.txt \
       $NQG_HOME/data/redistribute/QG/train/vocab.txt
python $NQG_HOME/code/NQG/seq2seq_pt/CollectVocab.py \
       $NQG_HOME/data/redistribute/QG/train/train.txt.bio \
       $NQG_HOME/data/redistribute/QG/train/bio.vocab.txt
python $NQG_HOME/code/NQG/seq2seq_pt/CollectVocab.py \
       $NQG_HOME/data/redistribute/QG/train/train.txt.pos \
       $NQG_HOME/data/redistribute/QG/train/train.txt.ner \
       $NQG_HOME/data/redistribute/QG/train/train.txt.case \
       $NQG_HOME/data/redistribute/QG/train/feat.vocab.txt
head -n 20000 $NQG_HOME/data/redistribute/QG/train/vocab.txt > $NQG_HOME/data/redistribute/QG/train/vocab.txt.20k

Setup the environment

Package Requirements:

nltk scipy numpy pytorch

PyTorch version: This code requires PyTorch v0.4.0.

Python version: This code requires Python3.

Warning: Older versions of NLTK have a bug in the PorterStemmer. Therefore, a fresh installation or update of NLTK is recommended.

A Docker image is also provided.

Docker image

docker pull magic282/pytorch:0.4.0

Run training

The file run.sh is an example. Modify it according to your configuration.

Without Docker

bash $NQG_HOME/code/NQG/seq2seq_pt/run_squad_qg.sh $NQG_HOME/data/redistribute/QG $NQG_HOME/code/NQG/seq2seq_pt

With Docker

nvidia-docker run --rm -ti -v $NQG_HOME:/workspace magic282/pytorch:0.4.0

Then inside the docker:

bash code/NQG/seq2seq_pt/run_squad_qg.sh /workspace/data/redistribute/QG /workspace/code/NQG/seq2seq_pt

nqg's People

Contributors

Stargazers

Watchers

nqg's Issues

docker image empty?

Hi,

I tried to get into the docker image after downloading and found there's nothing in there. see below:

..\ngq>docker run -it 2caa29d6a3b3 /bin/bash
root@3de7bdd7d8ec:/workspace# ls
root@3de7bdd7d8ec:/workspace# exit
exit

I used your command docker pull magic282/pytorch:0.4.0 to pull the image down. I tried to mount into your image and also found empty in the mounted container. Please help!

how to train this model with my own corpus

Hi, thanks for your great work!
I was wondering how to train this model with my own corpus

wget error

when i run " wget https://res.qyzhou.me/redistribute.zip",
i got this error:

--11:28:30-- https://res.qyzhou.me/redistribute.zip
=> `redistribute.zip'
正在解析主机 res.qyzhou.me... 104.31.76.139, 104.31.77.139, 2400:cb00:2048:1::681f:4d8b, ...
Connecting to res.qyzhou.me|104.31.76.139|:443... 已连接。
OpenSSL: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure
无法建立 SSL 连接。

Testing own input

I am done with training the model.Confused with how to generate questions for my own input. Cou;d you please help me with the file that has to be run?

Could you share the code of preprocessing SQuAD?

Code for NQG++

Would it be possible to provide the code or maybe some code snippets for the NQG++ architecture with shared embedding matrix and pre-trained word embeddings?
That would really be a great help!

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 2. Got 6 and 64 in dimension 0

def forward(self, input, bio, feats, hidden=None):
        .....
        .....
        featsEmb = [self.feat_lut(feat) for feat in feats[0]] 
        featsEmb = torch.cat(featsEmb, dim=-1)

While trying to run this above code I got this error:
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 2. Got 6 and 1 in dimension 0

So I used pad_sequence to match the dimension

featsEmb = [self.feat_lut(feat) for feat in feats[0]]


featsEmb = (pad_sequence(featsEmb, batch_first=True))
featsEmb = torch.cat(tuple(featsEmb), dim=-1)

But now I am getting another dimension error:
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 2. Got 6 and 64 in dimension 0

I printed and checked the dimensions and all the dimensions are correct.

print("Shape:------------>")
print((len(featsEmb))
        for i in range(len(featsEmb)):
            print("feat[%d]: %d" % (i, len(featsEmb[i]) ))

            for j in range(len(featsEmb[i])):
                print("feat[%d][%d]: %d" % (i,j, len(featsEmb[i][j]) ))
            print()

Shape:------------>
3

feat[0]: 6
feat[0][0]: 64
feat[0][1]: 64
feat[0][2]: 64
feat[0][3]: 64
feat[0][4]: 64
feat[0][5]: 64

feat[1]: 6
feat[1][0]: 64
feat[1][1]: 64
feat[1][2]: 64
feat[1][3]: 64
feat[1][4]: 64
feat[1][5]: 64

feat[2]: 6
feat[2][0]: 64
feat[2][1]: 64
feat[2][2]: 64
feat[2][3]: 64
feat[2][4]: 64
feat[2][5]: 64

I think it is maybe because of different version of Pytorch.
How can I remove this error and concatenate the tensors correctly? I am running the latest version of PyTorch and the code was written in Pytorch v0.4.0

Any help would be greatly appreciated. Thank you in advance.

Can't reproduce results reported in the paper

I'm trying to reproduce the results reported in the paper but am getting a considerably lower BLEU score (by almost 1 BLEU point) on both the dev and test set.

I have run run_squad_qg.sh with the pre-defined parameters.
I used the generated model model_e20.pt and ran translate.py with the following parameters

python translate.py \
       -model "${MODELPATH}/model_e20.pt" \
       -src "${DATAPATH}/test/dev.txt.shuffle.test.source.txt" \
       -bio "${DATAPATH}/test/dev.txt.shuffle.test.bio" \
       -feats "${DATAPATH}/test/dev.txt.shuffle.test.pos" "${DATAPATH}/test/dev.txt.shuffle.test.ner" "${DATAPATH}/test/dev.txt.shuffle.test.case" \
       -tgt "${DATAPATH}/test/dev.txt.shuffle.test.target.txt" \
       -output "${SAVEPATH}/pred.txt" \
       -replace_unk \
       -verbose \
       -n_best 10 \
       -gpu 0

I then ran test.py in PyBLEU as

python3 test.py ../../../../data/redistribute/QG/test/dev.txt.shuffle.test.target.txt ../../../../data/generated/pred.txt

which resulted in a BLEU score of 11.26 which is almost one bleu score lower than the reported NQG+ BLEU score for the test set in the paper (12.18).

Do you have an explanation or an intuition for this considerable discrepancy? Maybe the parameters in run_squad_qg.sh are not those that were used to compute the results for the paper?

And thanks a lot for making this code public :)

Thanks