lancopku / global-encoding Goto Github PK

View Code? Open in Web Editor NEW

274.0 12.0 68.0 2.95 MB

Global Encoding for Abstractive Summarization (ACL 2018)

License: MIT License

Python 72.29% HTML 27.71%

global-encoding's Introduction

Global-Encoding

This is the code for our paper Global Encoding for Abstractive Summarization, https://arxiv.org/abs/1805.03989

Requirements

Ubuntu 16.0.4
Python 3.5
Pytorch 0.4.1 (updated)
pyrouge

In order to use pyrouge, set rouge path with the line below:

pyrouge_set_rouge_path RELEASE-1.5.5/

It seems that some user have met problems with pyrouge, so I have updated the script, and users can put the directory "RELEASE-1.5.5" in your home directory and set rouge path to it (or run the command "chmod 777 RELEASE-1.5.5" for the permission).

Preprocessing

python3 preprocess.py -load_data path_to_data -save_data path_to_store_data

Remember to put the data into a folder and name them train.src, train.tgt, valid.src, valid.tgt, test.src and test.tgt, and make a new folder inside called data

Training

python3 train.py -log log_name -config config_yaml -gpus id

Create your own yaml file for hyperparameter setting.

Evaluation

python3 train.py -log log_name -config config_yaml -gpus id -restore checkpoint -mode eval

Citation

If you use this code for your research, please kindly cite our paper:

@inproceedings{globalencoding,
  title     = {Global Encoding for Abstractive Summarization},
  author    = {Junyang Lin and Xu Sun and Shuming Ma and Qi Su},
  booktitle = {{ACL} 2018},
  year      = {2018}
}

global-encoding's People

Contributors

Stargazers

Watchers

Forkers

amyzhangmin colinsongf webblearning sy950921 stevenlol muximuxi cyzhangathit xusun26 bladerangu cyprienwen diego999 yuancz ehsanrbc fanafa shubhampachori12110095 ceshine sheldon-anderson zzzzxciid mqrshiyan wuhao050698 aishwarya-nr shikaize lvyufeng wangzy2015 minorfox eggtargaryen pengzhi123 monkeymk dattoym xingxinyu96 martinchan96 hiredd senchfu cc465528767 liuweiping2020 lost-person remarquecal crystal-tensor kdongyi vivid-k chenny0808 andy0hu jiahuanluo heheli1998 cquptjerry carpe-diem9655 huster-yuan zhao9797 nhammatcode echoph pptt168 xuehuiping aivanni 1144160159 nidhins96 wobudapai intellytix liulx20 colton-boxell sofiaperosin lewpeng97 692845067 bjutliulei olivia-liwenya jiaozi6 qiguoxiaosheng sleepydog77 iq-scm

global-encoding's Issues

关于Gigaword dataset的问题。

关于Gigaword dataset的问题，论文中提到，你们用的是Rush et al. (2015)处理好的数据。我想问一下这个数据在哪里可以找到？我去Rush et al. (2015)找了，找不到。关于源数据，我给LDC写了申请，也没有回复。所以我想问一下你们，这个数据怎么获取的。非常感谢你们的工作！

problem when beam is 1

When beam is 1, candidate likes [None, None, None...], and causes error, because you didn't transform tensor "samples(line 207 of train.py)" into list.
You can change line 86 of seq2seq.py like this to solve this problem
sample_ids = torch.index_select(outputs, dim=1, index=reverse_indices).t().tolist()

关于rouge测试

你好，在测试中文摘要的时候，会遇到如下报错：
Illegal division by zero at /home/zhumaokun/ROUGE/RELEASE-1.5.5/ROUGE-1.5.5.pl line 2450.

是不是该对中文转换成英文字符？？

any dataset for demo?

lcsts数据集训练时，ROUGE报异常 Illegal division by zero

所执行的命令：
/root/Global-Encoding/RELEASE-1.5.5/ROUGE-1.5.5.pl -e /root/Global-Encoding/RELEASE-1.5.5/data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -a -m /tmp/tmphr15dtvf/rouge_conf.xml

异常信息：
Illegal division by zero at /root/Global-Encoding/RELEASE-1.5.5/ROUGE-1.5.5.pl line 2450.

返回代码255

虽然有此issue #4 讨论过，但没有明白问题根源在哪儿。按讨论说法是出现了非法字符，我用以下代码清洗了数据集，只保留了汉字、英文字符、数字、普通标点符号，但依然出现此问题。
`
for line in f:
line = re
.compile(
u"[^]+")
.sub('', line)

同时用grep "^$" * | wc -l `检测空行，并没有空行。
所以不是很清楚哪些算非法字符。或者如果把汉字映射为数字，原文中的数字又如何处理？

代码复现ROUGE得分异常高

基于字级别的数据，从数据集里把摘要和原文本提取出来以后分别保存，再对他们进行分字处理。在这里标点没有去掉。然后用preprogress.py.在使用rouge评价是把reference和candidate根据字典dict转成index数字进行评分的，rouge结果异常高。
epoch: 3, loss: 23146.946, time: 3100.882, updates: 110000, accuracy: 57.57
F_measure: [57.11, 30.26, 43.7] Recall: [56.44, 29.56, 43.05] Precision: [62.93, 33.76, 48.4]
请问我是数据处理的有问题吗？或者您方便透漏一下您的实验的bleu值吗？

训练到第10个epoch时，出现类型错误

错误如下：

The validation error for testing score of rouge .

The model can be trained successful in first epoch. However, it will use validation set to test model performance , which caused a bug. the log is as followed.

total number of parameters: 83289940

epoch: 1, loss: 51426.711, time: 2488.591, updates: 10000, accuracy: 27.78

218 Traceback (most recent call last):
219 File "train.py", line 332, in
220 main()
221 File "train.py", line 324, in main
222 train_model(model, data, optim, i, params)
223 File "train.py", line 179, in train_model
224 score = eval_model(model, data, params)
225 File "train.py", line 252, in eval_model
226 score[metric] = getattr(utils, metric)(reference, candidate, params['log_path'], params['log'], config)
227 File "/data/mmyin/Global-Encoding/utils/metrics.py", line 58, in rouge
228 rouge_results = r.convert_and_evaluate()
229 File "/home/mmyin/anaconda3/lib/python3.6/site-packages/pyrouge-0.1.3-py3.6.egg/pyrouge/Rouge155.py", line 367, in convert_and_evaluate
230 rouge_output = self.evaluate(system_id, rouge_args)
231 File "/home/mmyin/anaconda3/lib/python3.6/site-packages/pyrouge-0.1.3-py3.6.egg/pyrouge/Rouge155.py", line 342, in evaluate
232 rouge_output = check_output(command, env=env).decode("UTF-8")
233 File "/home/mmyin/anaconda3/lib/python3.6/subprocess.py", line 336, in check_output
234 **kwargs).stdout
235 File "/home/mmyin/anaconda3/lib/python3.6/subprocess.py", line 418, in run
236 output=stdout, stderr=stderr)
237 subprocess.CalledProcessError: Command '['/data/mmyin/ROUGE/RELEASE-1.5.5/ROUGE-1.5.5.pl', '-e', '/data/mmyin/ROU GE/RELEASE-1.5.5/data', '-c', '95', '-2', '-1', '-U', '-r', '1000', '-n', '4', '-w', '1.2', '-a', '-m', '/tmp/tmp 436z9iyl/rouge_conf.xml']' returned non-zero exit status 255.

训练耗时？

请问一下你们的模型在LCSTS数据集训练好大概需要多长时间呀？

train problem

ROUGE计算的时候报错 non-zero exit status 2

我用的是srush 的数据集

“Training and evaluation data for Gigaword is available https://drive.google.com/open?id=0B6N7tANPyVeBNmlSX19Ld2xDU1E”

我将下载的文件命名为train.src, train.tgt, valid.src, valid.tgt, test.src and test.tgt后放入 giga文件夹中，并在这个giga文件夹中新建了一个data空文件夹。

之后运行 python3 preprocess.py -load_data giga -save_data newgiga

再之后将giga_yam中的data 改成 ‘newgiga/’

模型成功训练，与验证，但验证完毕开始计算rouge分数的时候，

显示

Command '['script/RELEASE-1.5.5/ROUGE-1.5.5.pl', '-e', 'script/RELEASE-1.5.5/data', '-c', '95', '-2', '-1', '-U', '-r', '1000', '-n', '4', '-w', '1.2', '-a', '-m', '/tmp/tmp6fmy4hlt/rouge_conf.xml']' returned non-zero exit status 2

我尝试了
https://github.com/tagucci/pythonrouge/issues/11中提到的方法：

“
cd pythonrouge/RELEASE-1.5.5/data/
rm WordNet-2.0.exc.db
./WordNet-2.0-Exceptions/buildExeptionDB.pl ./WordNet-2.0-Exceptions ./smart_common_words.txt ./WordNet-2.0.exc.db
”

均无法解决这个问题，请问应该怎么做呢？

您好，请问您的问题解决了吗？我也是rouge得分异常高，前几个epoch55+。我是对数据分字后用的preprogress.py处理的

Originally posted by @wyyNLP in #22 (comment)

关于LCSTS测rouge的问题

你好：我想请教一下关于LCSTS的word-based和character-based测rouge的问题。举个例子:
训练结果：上市公司朋友圈合作背后利益均沾
实际结果：揭秘 “ 老虎们 ” 的上市公司朋友圈合作背后利益均沾
是直接用 files2rouge测两者之间的rouge么？还是需要改变什么，比如说将中文字改成数字之类的，或者需要其他的改变。
谢谢！

关于测试集结果

您好，我看到代码中最后输出的是验证集上的结果，请问论文中的结果是验证集上的结果还是测试集上的结果呢？希望得到您的回复，谢谢！

'<unk> ' problem when infering

When infering test.src, there are too much ' ' output.

关于论文中的model的字典大小

请问一下，你在论文中model使用的字典大小是5W吗？

打算做对比实验，想设定baseline model一样的参数，但是发现字典大小没有给。。。

Gigaword Validation Data

Hi,

I have downloaded Gigaword dataset from havardnlp, and there are 4 directories: train, Giga, DUC2004, and DUC2003.

I use valid.title.filter.txt, valid.article.filter.txt, train.title.txt and train.article.txt in train as the tgt/src of valid/train data, and use two txt files in Giga as test data. However, the valid data has wrong size (189651) after preprocess. The weird thing is that when I run preprocess.py, the result shows "(0 and 0 ignored due to length == 0 or > )".

Would you know any method to fix this?
Thanks a lot!

train error

File "train.py", line 322, in
main()
File "train.py", line 316, in main
print_log("Best %s score: %.2f\n" % (metric, max(params[metric])))
ValueError: max() arg is an empty sequence

LCSTS数据集处理需要去掉标点符号吗

LCSTS数据集处理需要去掉标点符号或者空格吗，能否给一个处理后的小demo

您好每个epoch都出现 Use of uninitialized value in division (/) at script/multi-bleu.perl line 127, <STDIN> line 1106.

您好每个epoch都出现 Use of uninitialized value in division (/) at script/multi-bleu.perl line 127, line 1106.
而且每个epoch的BLEU都是0.00 0/0/0/0
请问是什么问题？
输出如下：

数据处理，以valid.src为例，如下：

valid.tgt:

代码错误的地方: beam_search

错误的代码：
def unbottle(m):
return m.view(beam_size, batch_size, -1)
修改后：
def unbottle(m):
return m.view(batch_size, beam_size, -1)

评估矩阵为空

Traceback (most recent call last):
File "train.py", line 322, in
main()
File "train.py", line 316, in main
print_log("Best %s score: %.2f\n" % (metric, max(params[metric])))

Data preprocessing

Could you tell me what preprocessing steps you perform on the dataset itself before you use your tokenization script, by that i mean how do you tokenize?

preprocess Gigaword data

Global-Encoding/preprocess.py

Line 32 in 11ebf69

    
           parser.add_argument('-share', action='store_true', help='share the vocabulary between source and target')

Is there need to share dictionary between article and title file ?

train.py gets stuck at validation

Hello, I've set up pyrouge according to the tutorial below:
https://stackoverflow.com/questions/45894212/installing-pyrouge-gets-error-in-ubuntu

And used the command you propose, having RELEASE-1.5.5 at home:
pyrouge_set_rouge_path RELEASE-1.5.5/

When trying to train the first epoch works, but when it comes to validation the script gets stuck when trying to deal with the files. I'm not sure why this happens, but it always gets blocked on the same line of execution (as I show in the image).

I should notice you that I'm using a reduced training and validation set, consisting of only the first 100 lines of each of the files (.src and .tgt). I'd like to mention, aswell, that the original valid.src and valid.tgt had more than 180K lines; which doesn't match with the settings you mentioned in your paper.

If you have any idea as to why this doesn't work, it would be very helpful. Thanks in advance and congratulations on the work.

关于giaword的字典

同学，你好，打扰了。
想问问你，关于 giaword 你是取全部单词当作字典大小吗？还是高频词前5W个？

输入格式

你好，请问train.src和train.tgt等格式是怎么样呢？

Some trouble when running your code

“Traceback (most recent call last):
File "train.py", line 332, in
main()
File "train.py", line 324, in main
train_model(model, data, optim, i, params)
File "train.py", line 179, in train_model
score = eval_model(model, data, params)
File "train.py", line 252, in eval_model
score[metric] = getattr(utils, metric)(reference, candidate, params['log_path'], params['log'], config)
File "/home/zhengxin/Global-Encoding/Global-Encoding-master/utils/metrics.py", line 58, in rouge
rouge_results = r.convert_and_evaluate()
File "/root/miniconda3/lib/python3.7/site-packages/pyrouge/Rouge155.py", line 361, in convert_and_evaluate
rouge_output = self.evaluate(system_id, rouge_args)
File "/root/miniconda3/lib/python3.7/site-packages/pyrouge/Rouge155.py", line 336, in evaluate
rouge_output = check_output(command).decode("UTF-8")
File "/root/miniconda3/lib/python3.7/subprocess.py", line 389, in check_output
**kwargs).stdout
File "/root/miniconda3/lib/python3.7/subprocess.py", line 466, in run
with Popen(*popenargs, **kwargs) as process:
File "/root/miniconda3/lib/python3.7/subprocess.py", line 769, in init
restore_signals, start_new_session)
File "/root/miniconda3/lib/python3.7/subprocess.py", line 1516, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
PermissionError: [Errno 13] Permission denied: 'RELEASE-1.5.5/ROUGE-1.5.5.pl'”
Tried in several different machine, but get the same mistake, and I'm wondering the reason QwQ.

关于中文是否需要分词

请教几个问题，
一，你们中文跑的粒度是词粒度还是字符粒度呀，哪种效果好呢？
二，原始数据要做预处理，比如去掉标点符号啥的。
三，这个方法你们有在标题生成任务测过吗

我的程序还在跑目前先用的char，跑了好久一直没有loss日志输出来，不知道为啥，是要跑很久才能看到loss输出吗？我用的数据量也不大，nlpcc2017摘要数据试的

Error when running the code

subprocess.CalledProcessError: Command '['/home/ghx/PycharmProjects/pythonProject/Global-Encoding-master/RELEASE-1.5.5/ROUGE-1.5.5.pl', '-e', '/home/ghx/PycharmProjects/pythonProject/Global-Encoding-master/RELEASE-1.5.5/data', '-c', '95', '-2', '-1', '-U', '-r', '1000', '-n', '4', '-w', '1.2', '-a', '-m', '/tmp/tmpj5l8j6ve/rouge_conf.xml']' returned non-zero exit status 255

Looking for a solution to this problem, thanks

训练集，测试集划分

您好：
我采用PARTI作为训练，PARTIII中得分大于3的作为测试，得到的结果非常高，想问您一下，是如何划分的训练和测试的？

preprocessing

Once we downloaded the gigaword dataset, how are we supposed to preprocess it so we have the train and valid tgt and src?

Doubt on some pieces of code

in function sortFinished in models/beam.py, when the parameter minimum is not None, we need add from beam, but during the iteration, the beam index i is not updated

if minimum is not None:
  i = 0
  # Add from beam until we have minimum outputs.
  while len(self.finished) < minimum:
    s = self.scores[i]
    self.finished.append((s, len(self.nextYs) - 1, i))
    # update i?
    # i += 1

in function sample in models.seq2seq.py line 125, is the variable alignmentsi misspelled?

problem about `.loss`

It seems that the from .loss import * in models/__init__.pycould be deleted ?

模型生成的candidate和原文一样

使用的是LCSTS全部的数据集，训练了一个epoch之后，生成的摘要和原文一样……不知道有没有人遇到同样的情况？

我做的一点修改就是不再使用pyrouge，因为pyrouge测不了中文。我改成了用rouge，这个应该没什么影响吧。

数据处理过程应该是没什么问题:
(Global-Encoding) [ychuang@gpu18 data]$ cat train.src | head -n 1
新华社受权于18日全文播发修改后的《中华人民共和国立法法》，修改后的立法法分为“总则”“法律”“行政法规”“地方性法规、自治条例和单行条例、规章”“适用与备案审查”“附则”等6章，共计105条。
(Global-Encoding) [ychuang@gpu18 data]$ cat train.tgt | head -n 1
修改后的立法法全文公布
(Global-Encoding) [ychuang@gpu18 data]$ cat test.src | head -n 1
日前，方舟子发文直指林志颖旗下爱碧丽推销假保健品，引起哗然。调查发现，爱碧丽没有自己的生产加工厂。其胶原蛋白饮品无核心研发，全部代工生产。号称有“逆生长”功效的爱碧丽“梦幻奇迹限量组”售价高达1080元，实际成本仅为每瓶4元！
(Global-Encoding) [ychuang@gpu18 data]$ cat test.tgt | head -n 1
林志颖公司疑涉虚假营销无厂房无研发

Can we have the actual output of test set?

As above..Thanks!