edinburghnlp / xsum Goto Github PK

View Code? Open in Web Editor NEW

350.0 350.0 73.0 10.18 MB

Topic-Aware Convolutional Neural Networks for Extreme Summarization

License: MIT License

Python 98.58% C++ 1.42%

abstractive-summarization convolutional-neural-networks extreme-summarization topic-aware

xsum's People

Contributors

Stargazers

Watchers

Forkers

yuedongcs yaserkl sunqinghui fayhe jufengada caoxu915683474 qibaoyuan tomhosking jawdat23 legendtianjin eycab shikaize ryanzhumich silenceliang abiraja2004 huangxiaolist hyusheng fatmalearning zmarinho pierre-zhao timmykkk artidoro staer-tan dyxgwb phillette zsquaredz arunikayadav42 ahmadisb nightdessert talha1503 rogervaas fagan2888 schen149 simonkorl zeyuchen yds05238 dayihengliu geminifox2019 xuemingqiu mqrshiyan fazarafi eymenkagantaspinar danielglickmantau mehmetcalikus cstandring brianous xiongshufeng bjutliulei destiny0504 aqhali rquispec debadityashome yapbenzet irvifa sleepydog77 videosummarizer daniel891116 tdtrinh11 mathys-guy ghchinoy dzapologize nthon wilfoderek gitvasu10 hawksilent owanr techthiyanes deremakif songys suramyapathak zkailgd pirateforfreedom

xsum's Issues

Mail adress not found!

We just sent a mail to the mentioned mail address [email protected] to request the dataset for our University project.
But a Diagnostic-Code: smtp; 550 5.1.1 <[email protected]>... User unknown failure mail came back.

Maybe the mail changed or there is a typo? But if you would give us address to contact, we would be happy to send our mail there!

Model Training causing issue AttributeError: 'Tensor' object has no attribute 'conv_tbc'

File "C:\Users\Gangolian's PC\Desktop\Text Summary\WikiHow-Dataset-master\Topic-ConvS2S\fairseq\models\fconv.py", line 154, in forward x = conv(x) File "C:\ProgramData\Anaconda3\envs\tensorflow\lib\site-packages\torch\nn\modules\module.py", line 489, in call result = self.forward(*input, **kwargs) File "C:\Users\Gangolian's PC\Desktop\Text Summary\WikiHow-Dataset-master\Topic-ConvS2S\fairseq\modules\conv_tbc.py", line 32, in forward return input.contiguous().conv_tbc(self.weight, self.bias, self.padding[0]) AttributeError: 'Tensor' object has no attribute 'conv_tbc'

CUDA_VISIBLE_DEVICES=1 python XSum-Topic-ConvS2S/train.py ./data-topic-convs2s --source-lang document --target-lang summary --doctopics doc-topics --max-sentences 32 --arch fconv --criterion label_smoothed_cross_entropy --max-epoch 200 --clip-norm 0.1 --lr 0.10 --dropout 0.2 --save-dir ./checkpoints-topic-convs2s --no-progress-bar --log-interval 10

torch version =1.0.0
python version= 3.5.6

The pre-trained LDA model is not working

Loading LDA model from ./lda-train-document-lemma-topic-512-iter-1000...
Start decoding in ./xsum-preprocessed/document-lemma-topic-512-iter-1000...
Traceback (most recent call last):
File "scripts/lda-gensim-decoding-document-lemma.py", line 62, in
bow = lda.id2word.doc2bow(doc)
AttributeError: 'NoneType' object has no attribute 'doc2bow'

Dataset

Thanks for your excellent works.

Would you mind provide XSum dataset directly just like CNN/Dialy Mail that we are familiar with? I believe it may save time and be more convenient for experiments.

I'd be appreciate if you could give any help. Thanks~

Some issue with generation

Hey,

I was trying to run Topic-ConvS2s' generation.py on only the test data. I preprocessed only the test data following the guideline in XSum_Dataset readme with the pretrained LDA model you provided there. However, when I do the generation I got some errors.

Here is the command line I used:

CUDA_VISIBLE_DEVICES=1 python XSum-Topic-ConvS2S/generate.py ./data-topic-convs2s \
> --path ./topic-convs2s-emnlp18/checkpoints-topic-convs2s/checkpoint_best.pt \
> --batch-size 1 \
> --beam 10 \
> --replace-unk \
> --source-lang document \
> --target-lang summary \
> --doctopics doc-topics \
> --encoder-embed-dim 512 > test-output-topic-convs2s-checkpoint-best.pt

And here is the error message:

Traceback (most recent call last):
  File "XSum-Topic-ConvS2S/generate.py", line 166, in <module>
    main(args)
  File "XSum-Topic-ConvS2S/generate.py", line 43, in main
    models, _ = utils.load_ensemble_for_inference(args.path, dataset.src_dict, dataset.dst_dict)
  File "XSum-Topic-ConvS2S/fairseq/utils.py", line 146, in load_ensemble_for_inference
    model.load_state_dict(state['model'])
  File "XSum-Topic-ConvS2S/fairseq/models/fairseq_model.py", line 69, in load_state_dict
    super().load_state_dict(state_dict, strict)
  File "../anaconda3/envs/thisenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 847, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for FConvModel:
	size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([50004, 512]) from checkpoint, the shape in current model is torch.Size([4, 512]).
	size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([50004, 512]) from checkpoint, the shape in current model is torch.Size([4, 512]).
	size mismatch for decoder.fc3.bias: copying a param with shape torch.Size([50004]) from checkpoint, the shape in current model is torch.Size([4]).
	size mismatch for decoder.fc3.weight_g: copying a param with shape torch.Size([50004, 1]) from checkpoint, the shape in current model is torch.Size([4, 1]).
	size mismatch for decoder.fc3.weight_v: copying a param with shape torch.Size([50004, 256]) from checkpoint, the shape in current model is torch.Size([4, 256]).

Do you know what possibly the error can be? Thanks in advance

About collecting data

I want to know if I can crawl BBC articles as a dataset at will. Won't there be a copyright issue?

Bad Output

Error in downloading data

~/XSum/XSum-Dataset$ python scripts/download-bbc-articles.py
Creating the download directory.
Downloading URLs from the XSum-WebArxiveUrls.txt file:
urls left to download: 226711
5.19% [11765/226711]Traceback (most recent call last):
File "scripts/download-bbc-articles.py", line 253, in
main()
File "scripts/download-bbc-articles.py", line 248, in main
DownloadMode(urls_file_to_download, missing_urls_file, downloads_dir, args.request_parallelism, args.timestamp_exactness)
File "scripts/download-bbc-articles.py", line 198, in DownloadMode
for url, story_html in results:
File "/usr/lib/python2.7/multiprocessing/pool.py", line 668, in next
raise value
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.

I know probably I can fix this issue but I want to make sure the copy of the dataset is exactly the same as you use in the paper. Could you just upload the dataset to Dropbox or Google cloud? or fix this issue. thx.

download the dataset

thanks for your excellent work！

when I run download-bbc-articles.py, it showed that

I want to konw why, thanks for your help~

AttributeError: function 'bleu_zero_init' not found

python XSum-Topic-ConvS2S/generate.py D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s --path D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\checkpoints-topic-convs2s\checkpoint_best.pt --batch-size 1 --beam 10 --replace-unk --source-lang document --target-lang summary --doctopics doc-topics --encoder-embed-dim 512 > test-output-topic-convs2s-checkpoint-best.pt
Traceback (most recent call last):
File "XSum-Topic-ConvS2S/generate.py", line 164, in
main(args)
File "XSum-Topic-ConvS2S/generate.py", line 88, in main
scorer = bleu.Scorer(dataset.dst_dict.pad(), dataset.dst_dict.eos(), dataset.dst_dict.unk())
File "D:\PycharmFile\XSum\XSum-Topic-ConvS2S\fairseq\bleu.py", line 44, in init
self.reset()
File "D:\PycharmFile\XSum\XSum-Topic-ConvS2S\fairseq\bleu.py", line 50, in reset
C.bleu_zero_init(ctypes.byref(self.stat))
File "D:\python3.6.8\lib\ctypes_init_.py", line 361, in getattr
func = self.getitem(name)
File "D:\python3.6.8\lib\ctypes_init_.py", line 366, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: function 'bleu_zero_init' not found

And the output file "test-output-topic-convs2s-checkpoint-best.pt":

Loading D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s\dict.document-lemma.lda.txt
Done!
Loading D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s\test.document
Done!
Loading D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s\test.summary
Done!
Loading D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s\test.document-lemma
Done!
Loading D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s\test.doc-topics
Done!
| loading model(s) from D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\checkpoints-topic-convs2s\checkpoint_best.pt
| [document] dictionary: 50004 types
| [summary] dictionary: 50004 types
| D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s test 5 examples

too much the corresponding data in this dataset

RESTBODY in 36369226.summary is

[SN]RESTBODY[SN]
Share this with
Email
Facebook
Messenger
Messenger
Twitter
Pinterest
WhatsApp
LinkedIn
Copy this link

And 34618290.summary is

[SN]RESTBODY[SN]
23 October 2015 Last updated at 18:50 BST
Able Seaman Albert McKenzie took part in the Zeebrugge raid on 23 April 1918.
BBC London reporter Sarah Harris reports.

not only this 2 data, many data has problem.
Please check it and fix it !

noise in the dataset

Thanks for the wonderful repo! I have a question in the data collection part.

I exactly followed the instructions and found that there are many noisy texts in the dataset such as "Share this with email facebook messenger ..... copy this link" in the had of many documents.

Did you train your model on the dataset with these noises? Are those numbers in the paper with the noisy text? Or, please let me know if I am wrong. I just need clarification so we can filter them out for valid replication.

Thank you again for your amazing work!

Bad Output

Hi! I completely follow the instruction for ConvS2S model, i.e. data preprocessing, model training , generation and extract final hypothesis. I'm wondering why the output file is in .pt format (final-test-output-convs2s-checkpoint-best.pt)??
Isn't it supposed to be a .txt file because it contains generated summaries???
And, when I open the generated final-test-output-convs2s-checkpoint-best.pt file, it looks like:

Joke of a repo

This is a joke of a repo. There is no code and no pretrained model. I guess the research must be the same.

where to see the results

I followed the instructions and ran using the pretrained models, but where can i see the results . all the document and summary files in data-convs2s are remaining unchanged .
If anyone is has succeded in running this please help me.
this is the notebook link(it is the section with title EdinburghNLP/XSum)
https://colab.research.google.com/drive/10XEKdP2UdoiwaoZgWfH0fydqrz7rhrJa?usp=sharing

PyTorch version compatibility

Hi,

firstly, thanks very much for the great work.

The issue I raise here is about the PyTorch version that makes XSum work. When I installed my PyTorch in my conda environment using "pip install torch", it installed the latest torch 1.0.1.post2. This specific torch version seemed to cause an AttributeError, which I believe is a compatibility issue. In contrast, torch 0.4.0 or 0.4.1 works, and no compatibility issue exists.

Therefore, does "Our code requires PyTorch version >= 0.4.0" actually mean PyTorch version 0.4.0 or 0.4.1?

download-bbc-articles.py: error: unrecognized arguments: [--timestamp_exactness 14]

Hi, thanks for your work.
When I ran this "python download-bbc-articles.py [--timestamp_exactness 14]
", I encountered this error:
usage: download-bbc-articles.py [-h]
[--request_parallelism REQUEST_PARALLELISM]
[--context_token_limit CONTEXT_TOKEN_LIMIT]
[--timestamp_exactness TIMESTAMP_EXACTNESS]
download-bbc-articles.py: error: unrecognized arguments: [--timestamp_exactness 14]
Could you help me?

How to train a new model using my data

Thank you very much for this great job,
I just need to ask you how can I train a new model on a new data-set.
I just want to know what is the from the data should be so the model can be trained on them.
thanks in advanced.

Some bugs in generation phase

Hi,

Thanks for your excellent work.

I downloaded the data and processed them as in the instructions (few of them cannot be downloaded and I modified the data processing part to skip on these ones, I guess should not be a big issue. ) Then I trained a new model and everything looks good. However, when I run with the generation script, it gives me the following errors:

The command:
CUDA_VISIBLE_DEVICES=1 python XSum-ConvS2S/generate.py ./data-convs2s --path ./checkpoints-convs2s/checkpoint-best.pt --batch-size 1 --beam 10 --replace-unk --source-lang document --target-lang summary > test-output-convs2s-checkpoint-best.pt

Output:
0%| | 0/11334 [00:00<?, ?it/s]/home/rasmlnlp/zhiyu/anaconda3/envs/tf1.12/lib/python3.5/site-packages/torch/autograd/function.py:41: UserWarning: mark_shared_storage is deprecated. Tensors with shared storages are automatically tracked. Note that calls to set_() are not tracked
'mark_shared_storage is deprecated. '
Traceback (most recent call last):
File "XSum-ConvS2S/generate.py", line 161, in
main(args)
File "XSum-ConvS2S/generate.py", line 96, in main
for sample_id, src_tokens, target_tokens, hypos in translations:
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/sequence_generator.py", line 77, in generate_batched_itr
prefix_tokens=s['target'][:, :prefix_size] if prefix_size > 0 else None,
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/sequence_generator.py", line 90, in generate
return self._generate(src_tokens, src_lengths, beam_size, maxlen, prefix_tokens)
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/sequence_generator.py", line 250, in _generate
tokens[:, :step+1], encoder_outs, incremental_states)
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/sequence_generator.py", line 413, in _decode
decoder_out, attn = model.decoder(tokens, encoder_out, incremental_states[model])
File "/home/rasmlnlp/zhiyu/anaconda3/envs/tf1.12/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/models/fconv.py", line 266, in forward
x, attn_scores = attention(x, target_embedding, (encoder_a, encoder_b))
File "/home/rasmlnlp/zhiyu/anaconda3/envs/tf1.12/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/models/fconv.py", line 160, in forward
x = self.bmm(x, encoder_out[0])
File "/home/rasmlnlp/zhiyu/anaconda3/envs/tf1.12/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/modules/beamable_mm.py", line 34, in forward
input1 = input1[:, 0, :].unfold(0, beam, beam).transpose(2, 1)
RuntimeError: invalid argument 3: out of range at /opt/conda/conda-bld/pytorch_1535490206202/work/aten/src/THC/generic/THCTensor.cpp:444

Is there anything bugged here? Or is there any other way to reproduce the results in the paper? (The training only shows the loss or ppl on the validation set, not ROUGE results.)

Thank you.

Running pretrained models on Other datasets

Hi,
You some great work here! Is there a way to run your pre trained models on another dataset? I tried just replacing the train.document and train.summary files with other data, but the final-test-output-convs2s-checkpoint-best.pt results were totally unrelated, and repeated. It seems it is still trying to map my custom values to previously seen titles??

Here's what I did:
I was not sure which file is the data read from for the test, so I replaced train.document, test.document, valid.document, validation.document all with the texts (same in each) and train.summary, test.summary, valid.summary, validation.summary with the titles. (same in each). I copied he dict.document.txt and dict.summary.txt from your original tar.

Then I ran

cd XSum-ConvS2S
python generate.py ./convs2s-emnlp18/data-convs2s --path ./convs2s-emnlp18/checkpoints-convs2s/checkpoint-best.pt --batch-size 1 --beam 10 --replace-unk --source-lang document --target-lang summary > test-output-convs2s-checkpoint-best.pt
cd ..
python scripts/extract-hypothesis-fairseq.py -o XSum-ConvS2S/test-output-convs2s-checkpoint-best.pt -f final-test-output-convs2s-checkpoint-best.pt

Increasing loss in a new dataset

Thank you for sharing your repo! I am running your code on CNN/Daily mail and experienced a similar issue that you encountered at facebookresearch/fairseq#118. Could you let me know how do you resolve this in the end?

Error in scripts/parse-bbc-html-data.py

(2.7.18/envs/venv) mittu1008@katfuji-7:~/XSum/XSum-Dataset$ python -m scripts/parse-bbc-html-data
226711
Traceback (most recent call last):
File "/home/mittu1008/.pyenv/versions/2.7.18/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/home/mittu1008/.pyenv/versions/2.7.18/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/mittu1008/XSum/XSum-Dataset/scripts/parse-bbc-html-data.py", line 198, in
story_title, story_introduction, story_restcontent = GenerateMapper((webarxivid, "bbc", htmldata))
File "/home/mittu1008/XSum/XSum-Dataset/scripts/parse-bbc-html-data.py", line 143, in GenerateMapper
story_title = ParseHtml(raw_story, corpus).getstory_title()
File "/home/mittu1008/XSum/XSum-Dataset/scripts/parse-bbc-html-data.py", line 44, in init
self.parser = html.HTMLParser(encoding=chardet.detect(self.story.html)['encoding'])
File "/home/mittu1008/.pyenv/versions/2.7.18/envs/venv/lib/python2.7/site-packages/lxml/html/init.py", line 1910, in init
super(HTMLParser, self).init(**kwargs)
File "src/lxml/parser.pxi", line 1728, in lxml.etree.HTMLParser.init
File "src/lxml/parser.pxi", line 840, in lxml.etree._BaseParser.init
LookupError: unknown encoding: 'IBM852'
(2.7.18/envs/venv)

what should I do to thi error?

How to use dataset

Hi,

thanks for providing the dataset as a download. I downloaded the dataset from the location mentioned in #12 (comment)
But it appears that the format of the dataset is different from the files you receive if you dowload the data yourself.

See this gist, the first file 12092740.data I downloaded myself from archive.org, while the second file was part of the dowloaded dataset.

As you can see the downloaded file contains the attributes [XSUM]URL[XSUM], [XSUM]INTRODUCTION[XSUM] and [XSUM]RESTBODY[XSUM]. But the file from the dataset has [SN]URL[SN], [SN]TITLE[SN], [SN]FIRST-SENTENCE[SN] and [SN]RESTBODY[SN].

My problem is that if I follow the tutorial at https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset the scripts don't work with the unmodified files.

Which changes do I need to make to the scripts?

Best,
Pyfisch

how to make this project support other langues such as Chinese thank

thank

Just a comment (poor results) ...

... very interesting paper (http://aclweb.org/anthology/D18-1206 , with included examples), but I tried the online demo, http://cohort.inf.ed.ac.uk/xsum.html , with rather appalling results.

The textual sources for my tests were the abstract for one of my published papers, and the Google infobox for Nova Scotia.

Does the code for the online demo faithfully replicate the code in the paper?

Reference summary and Source text

Hello,
Thank you very much for sharing this work and dataset.
Currently, I am working on abstractive summarization and I wish to evaluate my model on XSum dataset.

While I was processing the dataset, I was wondering if the [SN]title[SN] is used or not.
It appears in here that the dataset uses :

[SN]FIRST-SENTENCE[SN] as Reference summary
[SN]RESTBODY[SN] as Source text
[SN]TITLE[SN] : Not used

May I request your confirmation of the above information?

(I am using the dataset downloaded from http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz )

Thank you very much and I look forward to hearing your reply,
Best,
Wonjin

Raw dataset

I saw the dataset is available at http://kinloch.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz (link given by @shashiongithub )

I tried to use this dataset to reproduce BART results, but I couldn't. According to one of BART author on this issue (fairseq repository), this is because I'm using an already processed version of the dataset.

Is it possible to have a link to the raw dataset (no postprocess of any kind) ?

results on other dataset

Hello, it's a very good job!
Then, did you or anyone else train the model on CNN/DM or Gigaword and get results?

Wish for reply!!