memray / seq2seq-keyphrase Goto Github PK

View Code? Open in Web Editor NEW

317.0 14.0 109.0 192 KB

License: MIT License

Python 100.00%

seq2seq-keyphrase's Introduction

seq2seq-keyphrase

Note: this repository has been deprecated. Please move to our latest code/data/model release for keyphrase generation at https://github.com/memray/OpenNMT-kpg-release. Thank you.

Data

Check out all datasets at https://huggingface.co/memray/.

Introduction

This is an implementation of Deep Keyphrase Generation based on CopyNet.

One training dataset (KP20k), five testing datasets (KP20k, Inspec, NUS, SemEval, Krapivin) and one pre-trained model are provided.

Note that the model is trained on scientific papers (abstract and keyword) in Computer Science domain, so it's expected to work well only for CS papers.

Cite

If you use the code or datasets, please cite the following paper:

Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky and Yu Chi. Deep Keyphrase Generation. 55th Annual Meeting of Association for Computational Linguistics, 2017. [PDF] [arXiv]

@InProceedings{meng-EtAl:2017:Long,
  author    = {Meng, Rui  and  Zhao, Sanqiang  and  Han, Shuguang  and  He, Daqing  and  Brusilovsky, Peter  and  Chi, Yu},
  title     = {Deep Keyphrase Generation},
  booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month     = {July},
  year      = {2017},
  address   = {Vancouver, Canada},
  publisher = {Association for Computational Linguistics},
  pages     = {582--592},
  url       = {http://aclweb.org/anthology/P17-1054}
}

seq2seq-keyphrase's People

Contributors

Stargazers

Watchers

Forkers

caozhen-alex v1shwa ericxsun benjamesbabala casywang vickzhang qitong huangpeng1126 hydercps hhself mindis leezqcst buaahsh binkmust jsk11 sxdkxgwan panyang saraswat fpcsong vyraun ccnu-datasummarization nanfeng1101 loyrevel zxsted yinweichong zhumzhu ckqsars zhang-jian burness lillianfly wangjianyu1031 luoshanji vikingmew jinyeong gracegay pingoogle jerrygaolondon yanghongkai wwf5067 frankblood maofei fandywang fydlzr ruijiera nhu2000 1000-7 vincehxb sodagreencellur kormilitzin shermin sanjeeku nonva gitforhf roozbehsanaei idealbaozi davidswresearcher viezhong lumiqai xiedake prakritidev shubhampachori12110095 deardeerdeer tonyliangli neofung inimah ai-jie01 will007008 shengxiaoxiao zhangjiulong afcarl dexterruc chenglongchen acpmmd haoday senmonster huihui-song fengxhao ifeynman sericwong ssanbu08 qibaoyuan lu839684437 zhoulei163 mdp0999 panhuafan sundeeppidugu chenyuango sleepsophia kdjyss juneou chen-wang-cuhk jwata guoqunabc pidugusundeep jinmeiz zhyq qianrenjian avinsit123 damo894127201 overlordyuan

seq2seq-keyphrase's Issues

Is there a pretrained model available ?

Hey @memray ,
I have a newer version of UBUNTU and I cannot really use CUDA 8.0 as it doesn't support ubuntu 18.04, so just wanted to know if you rather have a pre-trained model available to download so that I can use that for testing.

Absent keyphrase benchmarking numbers

Hi @memray

I am able to reproduce the numbers published in your paper for the present keyphrases but not for the absent ones. I thought maybe you will have an insight into what I am doing wrong. The prediction data that I am using for evaluation was downloaded from your google drive on 6 Feb, 2019, the seq2seq-keyphrase repository was cloned on 12 Feb, 2019.

As parameters in the config for the absent test I use (see also attached patch code)
model= CopyRNN_absent,
predict_type generative,
predict_filter: non-appear-only
target_filter non-appear-only.

The metrics I get vs the expected ones:

Dataset	CopyRNN (paper)	CopyRNN (I got)
.	R@10 / R@50	R@10 / R@50
Inspec	0.051 / 0.101	0.00000 / 0.001952
Krapivin	0.116 / 0.195	0.005940 / 0.017970
NUS	0.078 / 0.144	0.002892 / 0.006479
SemEval	0.049 / 0.075	0.005540 / 0.009127
KP20k	0.115 / 0.189	0.075402 / 0.172163

Best,
Laura

CopyRNN_absent_patch.txt
experiments.baseline.id=20190212-132601.log

环境配置问题

您好，我想知道您当时的运行坏境是什么样的 theano版本，cuda版本，显卡型号，我也出现了#这样#

Question about cannot find path /dataset/keyphrase/baseline-data/kp20k//plain_text/

Hello, when I tried to launch your program, I encountered this problem.

File "D:/Projects/seq2seq-keyphrase/keyphrase/keyphrase_copynet.py", line 205, in
test_sets = keyphrase_test_dataset.load_additional_testing_data(['kp20k'], idx2word, word2idx, config, postagging=False)
File "D:\Projects\seq2seq-keyphrase\keyphrase\dataset\keyphrase_test_dataset.py", line 824, in load_additional_testing_data records = dataloader.get_docs()
File "D:\Projects\seq2seq-keyphrase\keyphrase\dataset\keyphrase_test_dataset.py", line 642, in get_docs for fname in os.listdir(self.textdir):
FileNotFoundError: [WinError 3] Cannot find path : 'D:\Projects\seq2seq-keyphrase\keyphrase/dataset/keyphrase/baseline-data/kp20k//plain_text/'

It seems that this file is not existed and I downloaded this dataset from the url your provided directly. So could you help me about this? Thank you!

Transfer raw data to the same format in baseline-data

I want to extract keyphrases from my own data using the model, but how to transfer raw data to the same format in baseline-data? what is the processing flow or call some func?

Can not download training data

About inconsistent testing set between experiment_data.zip and kp20k.zip!

Dear Rui Meng,
First, really thank you for the sharing of your code and dataset.
But I found that the testing set in experiment_data.zip may be inconsistent with the testing set in kp20k.zip.
For example, I can not find the paper "a feedback vertex set of degenerate graphs" (the "0.txt" in experiment_data...\baseline-data\kp20k) in kp20k\kp20k_testing.json .
Does the inconsistency really exist?
Best,
Wang

About training dataset

请问，在您论文里section4.2提到的：
the remaining papers are used to train the supervised baselines.
怎么理解？是其余四个小数据集也划分出来了train和test来训练KEA和Maui吗？
为什么不用kp20k training data 去训练？是memory limit？

关于代码调试问题的请教

您好：
我在根据您训练好的模型，预测kp20k以及其他测试集时，也遇到了如下错误
Loading testing dataset KP20k from /home/majun/keyphrase_generation/seq2seq-keyphrase-master/dataset/keyphrase/baseline-data/kp20k/
kp20k
Size of test data=0
/home/majun/anaconda2/lib/python2.7/site-packages/numpy/lib/function_base.py:1110: RuntimeWarning: Mean of empty slice.
avg = a.mean(axis)
Traceback (most recent call last):
File "/home/majun/anaconda2/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/home/majun/anaconda2/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/majun/keyphrase_generation/seq2seq-keyphrase-master/keyphrase/keyphrase_copynet.py", line 533, in
print('Avg length=%d, Max length=%d' % (np.average([len(s) for s in test_set['source']]), np.max([len(s) for s in test_set['source']])))
File "/home/majun/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 2272, in amax
out=out, **kwargs)
File "/home/majun/anaconda2/lib/python2.7/site-packages/numpy/core/_methods.py", line 26, in _amax
return umr_maximum(a, axis, None, out, keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity

参考之前issues的解决办法，调试了一下，还是报错。想再请教一下，这个问题怎么弄啊？
十分感谢。（theano版本是0.8.2）

Dead link for data download

Hi i was hoping to download the zip file you have in the readme, but it appears that the link is no longer available. I was wondering if you were still hosting the data somewhere so others could access it.

training problem

hi, thanks for answering CopyNet questions in the other issue~ O(∩_∩)O~

I followed ReadMe to train the model and found this ValueError: too many values to unpack.
this error is located to [197] train_set, validation_set, test_sets, idx2word, word2idx = deserialize_from_file(config['dataset'])

BUT, when I set config['copynet']=False, it passes this error and meets another "MemoryError", the details are as following.
Traceback (most recent call last): File "theano/scan_module/scan_perform.pyx", line 397, in theano.scan_module.scan_perform.perform (/home/cc/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.6.2-64/scan_perform/mod.cpp:4490) MemoryError

I don't know if it is related to my environment, which is ‘’GTX GeForce 1070 with Nvidia-375, Anaconda2-4.4, Cuda-8, Cudnn-5.1, Python2.7”, theano_flags are "device=gpu, floatX=float32, nvcc.fastmath=True, nvcc.flags=-D_FORCE_INLINES". I will be grateful if you could share your environment setting.

BTW, how much time and memory will it take to train the model with all_600k_dataset.pkl?
Can I try the training process with another smaller dataset? How can I create a smaller dataset?

Running of keyphrase/baseline/evaluate.py causes error

I would like to reproduce the results in the paper, hence run keyphrase/baseline/evaluate.py.
I get the following error:
AttributeError: 'module' object has no attribute 'setup_keyphrase_baseline'

The current version of the config.py file does not contain a setup_keyphrase_baseline method.

Data Gathering

Hi, thank you for sharing your code and the data you gathered.
I was wondering if were planning on publishing an article describing the kp20k dataset (as it is (to my knowledge) the biggest available dataset for keyphrase extraction), and how, where and when did you collect the data?
Are the gold keyphrases created by authors or users ?

Issue when loading test data

Hello,

I'm trying to run the code and extract keyphrases on at least on of the provided datasets. There is apparently an issue when reading the test sets (I have tried for several datasets and the error still persists), here is what I get:

1/05/2018 13:10:27 [INFO] core: load the weights.
Loading testing dataset INSPEC from /home/ubuntu/work/nlp/seq2seq-keyphrase/dataset/keyphrase/testing-data/INSPEC
inspec
Size of test data=0
/home/ubuntu/.local/lib/python2.7/site-packages/numpy/lib/function_base.py:1110: RuntimeWarning: Mean of empty slice.
  avg = a.mean(axis)
Traceback (most recent call last):
  File "keyphrase/keyphrase_copynet.py", line 530, in <module>
    print('Avg length=%d, Max length=%d' % (np.average([len(s) for s in test_set['source']]), np.max([len(s) for s in test_set['source']])))
  File "/home/ubuntu/.local/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 2272, in amax
    out=out, **kwargs)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/numpy/core/_methods.py", line 26, in _amax
    return umr_maximum(a, axis, None, out, keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity

I am wondering why the size of the test data is 0. I have downloaded the zip file and unzipped it in the main folder of the project and strictly followed the instructions. I have tried for few days now to try to figure out what is wrong. Did you meet this issue ?

Sincere thanks.

word vector

Excuse me,

Can you ask how the word vector is trained?

Where is the code implementation?

Thank you very much!

question about building dataset

I refer to the following piece of code to build my own dataset. But I note that this code ignores the OOV words. What's the correct way to build the dataset that enables the "copy-mode"?

Moreover, how does cc_martix works?It seems that "cc[k][j][i]=1" marks the copy words in vocabulary.

About Micro and Macro scores

In your paper, there is a clarification about the evaluation scores.
I konw the difference between micro and macro scores.
I was wondering why you choose Macro-averaged score in your paper?
Are there any justifications?
Thanks very much!

ERROR incurred when running on Tesla GPU

Could someone help?

I tried to run this project on GPU to get a speedup. But error incurred. What's the correct way? What should I do to configure the env?

command:
THEANO_FLAGS='floatX=float32,device=cuda' CUDA_VISIBLE_DEVICES=1 python keyphrase_copynet.py

error info:
ERROR (theano.gpuarray): pygpu was configured but could not be imported or is too old (version 0.7 or higher required)

runtime error in training process

Traceback (most recent call last):
File "keyphrase_copynet.py", line 262, in
agent.compile_('all')
File "/home/developer/seq2seq-keyphrase/emolga/models/covc_encdec.py", line 1849, in compile_
self.compile_train()
File "/home/developer/seq2seq-keyphrase/emolga/models/covc_encdec.py", line 1880, in compile_train
logPxz, logPPL = self.decoder.build_decoder(target, cc_matrix, code, c_mask)
File "/home/developer/seq2seq-keyphrase/emolga/models/covc_encdec.py", line 996, in build_decoder
non_sequences=[context, c_mask, context_A]
File "/home/developer/anaconda3/lib/python3.6/site-packages/theano/scan_module/scan.py", line 1076, in scan
scan_outs = local_op(*scan_inputs)
File "/home/developer/anaconda3/lib/python3.6/site-packages/theano/gof/op.py", line 615, in call
node = self.make_node(*inputs, **kwargs)
File "/home/developer/anaconda3/lib/python3.6/site-packages/theano/scan_module/scan_op.py", line 546, in make_node
inner_sitsot_out.type.dtype))
ValueError: When compiling the inner function of scan the following error has been encountered: The initial state (outputs_info in scan nomenclature) of variable IncSubtensor{Set;:int64:}.0 (argument number 5) has dtype float32, while the result of the inner function (fn) has dtype float64. This can happen if the inner function of scan results in an upcast or downcast.

Some questions about kp20k dataset

Hi@memray~
Thanks for releasing the kp20k dataset, but after read the paper, I still have some question about this dataset.

In the paper said the training set is 527830 articles, val set is 20000 and the test set is 20000. But I see in your release dataset there are 530809 articles. If you removed the same articles in other testset such as Inspec and SemEval to get the training set's number to 527830?
And in the paper you said you didn't use the whole training set, and you used only the val set to train( there maybe some questions, I don't get very well). Could you explain about that?
Because I am doing experiments about keyphrase extraction not generation, do you think I can use this dataset if I remove some absent keyphrases in dataset?

Question about extracting keyphrases

I want to extract keyphrases from my own data using the model.

But I am confused with 'testing_datasets', is it direct to 'testing-data/' or direct to 'baseline-data/' ?

Shoud I transfer my data to both testing-data and baseline-data?

Also has this error

cannot download training dataset kp20k

Good morning,
I've tried to download the training data Kp20k provided in your description: http://crystal.exp.sis.pitt.edu:8080/iris/data/kp20k.zip, but the link doesn't working.
Could you verify this problem?
Thank you,
Best,

any doc or guide to get start, at least README....

Question about code

Hi ! Memray:

Sorry to bother u again , I have two questions about the code.

What's the difference between covc_encdec.py and encdec.py ?
I'm a little confused about code parameter "context" in the embeddings.py and "use_context"in encoder.

Thanks u for explanation.

missing dataset

Why can't I find a path to training dataset?

Is there a code version writed by TensorFlow?

RT.

training——dataset

Thanks for sharing your code. I'm a little confused about this json file showed below which i cannot find in filefolder： /dataset/keyphrase

训练集问题

Hi，请问‘punctuation-20000validation-20000testing’里的每个文件的作用是什么呀，初学NLP有些不太懂，all_600k_dataset.pkl是所有raw数据吗？如果是的话，请问如何直接使用其它格式的文件作为训练集，比如kp20k数据集中的json格式？

关于训练时数据集的使用

请问，在您的论文中有使用除了kp20k以外的其他四个数据集(nus等)进行实验。我想知道的是，在copyRNN的训练过程中您是否使用了其他四个数据集？还是仅仅使用kp20k进行了copyRNN的训练，在测试时使用了全部五个数据集？望回答，非常感谢

一个问题

你好！我想问下在生成keyphrase的时候怎么确定生成结束的呢？也就是生成的长度如何确定的？是有一个结束标记然后预测出结束标记的概率最大吗？

关于dataset和生成keyphrases的一点问题

您好。我看您的论文里提到了dataset已经公开了，请问我在哪儿可以获得50多万的论文数据啊？还有一个关于模型的问题，您是用seq2seq模型来生成keyphrases的，在预处理数据时一篇论文会被拆分成好几个数据，也就是说在训练模型的时候Encoder同一个文本时会Decoder出好几个不同的keyphrases，对应于机器翻译的话就是同一个原文会有很多不同的译文版本，是这样的吗？那么在测试过程中，输入一篇新的text，如何Decoder得到多个keyphrases啊，比如在计算 F@5 时，是Encoder输入文本后，利用Beam Search进行Decoder结束后将最大堆中的top-5返回的吗？谢谢 :)

A question about the "Micro" and the "Macro" averaging methods in evaluation.

Hi Memray,
Your implementation of the "Micro" averaging method actually is Macro-averaged ?
Your implementation of the "Macro" averaging method actually is Micro-averaged ?
Looking forward to your reply！

inspec

您好，请问您能发我一份inspec数据集吗，我使用kp20k感觉模型泛化能力不行，就是我自己训练机拟合还可以，测试集跑一下就太低了，谢谢

运行 prediction 后的结果和你之前保存下来的prediction结果不一样

你好，
我用你的代码运行了prediction部分，载入的训练好的模型就是你提供的 experiments.keyphrase-all.one2one.copy.id=20170106-025508.epoch=4.batch=1000.pkl，然后测试数据是 seq2seq-keyphrase/dataset/keyphrase/baseline-data/kp20k，也就是直接运行了一遍没有改动任何参数，然后我evaluate的时候发现上述predict的结果和你提供的predict结果完全不一样，而且运行结果明显错误，预测值和文本完全不相关，然后显示的f1 score为0。
我想知道为什么会这样，难道是载入的模型不对还是需要另外再设置什么参数吗？谢谢

generate_multiple method does not have the accepting parameter

TypeError: generate_multiple() got an unexpected keyword argument 'return_encoding'

seq2seq-keyphrase/keyphrase/keyphrase_copynet.py

Lines 587 to 591 in cfb2e00

    
               input_encoding, prediction, score, output_encoding = agent.generate_multiple(inputs_unk[None, :], return_all=True, return_encoding=True) 
        
               input_encodings.append(input_encoding) 
        
               output_encodings.append(output_encoding) 
        
           else: 
        
               prediction, score = agent.generate_multiple(inputs_unk[None, :], return_encoding=False)

关于模型OOV的一些问题

你好，在训练模型时encode指定了一个固定大小的vocab_size,decoder Copynet的copy部分是怎么去识别copy词表外的word的呢，训练中词表外的word都表示为了unknown.

关于环境配置的问题

您好，我想知道您当时运行这程序的关于deep learning方面环境的版本 theano版本 cuda版本显卡型号等等。我不清楚我的机器是否能成功运行这个程序我现在也遇到了 ValueError: zero-size array to reduction operation maximum which has no identity （issue#10）这个错误，当运行copynet.py的时候。我怀疑可能和我环境配置有关系，非常感谢！

	input_encoding, prediction, score, output_encoding = agent.generate_multiple(inputs_unk[None, :], return_all=True, return_encoding=True)
	input_encodings.append(input_encoding)
	output_encodings.append(output_encoding)
	else:
	prediction, score = agent.generate_multiple(inputs_unk[None, :], return_encoding=False)