edinburghnlp / xsum Goto Github PK
View Code? Open in Web Editor NEWTopic-Aware Convolutional Neural Networks for Extreme Summarization
License: MIT License
Topic-Aware Convolutional Neural Networks for Extreme Summarization
License: MIT License
We just sent a mail to the mentioned mail address [email protected]
to request the dataset for our University project.
But a Diagnostic-Code: smtp; 550 5.1.1 <[email protected]>... User unknown
failure mail came back.
Maybe the mail changed or there is a typo? But if you would give us address to contact, we would be happy to send our mail there!
File "C:\Users\Gangolian's PC\Desktop\Text Summary\WikiHow-Dataset-master\Topic-ConvS2S\fairseq\models\fconv.py", line 154, in forward x = conv(x) File "C:\ProgramData\Anaconda3\envs\tensorflow\lib\site-packages\torch\nn\modules\module.py", line 489, in call result = self.forward(*input, **kwargs) File "C:\Users\Gangolian's PC\Desktop\Text Summary\WikiHow-Dataset-master\Topic-ConvS2S\fairseq\modules\conv_tbc.py", line 32, in forward return input.contiguous().conv_tbc(self.weight, self.bias, self.padding[0]) AttributeError: 'Tensor' object has no attribute 'conv_tbc'
CUDA_VISIBLE_DEVICES=1 python XSum-Topic-ConvS2S/train.py ./data-topic-convs2s --source-lang document --target-lang summary --doctopics doc-topics --max-sentences 32 --arch fconv --criterion label_smoothed_cross_entropy --max-epoch 200 --clip-norm 0.1 --lr 0.10 --dropout 0.2 --save-dir ./checkpoints-topic-convs2s --no-progress-bar --log-interval 10
torch version =1.0.0
python version= 3.5.6
Loading LDA model from ./lda-train-document-lemma-topic-512-iter-1000...
Start decoding in ./xsum-preprocessed/document-lemma-topic-512-iter-1000...
Traceback (most recent call last):
File "scripts/lda-gensim-decoding-document-lemma.py", line 62, in
bow = lda.id2word.doc2bow(doc)
AttributeError: 'NoneType' object has no attribute 'doc2bow'
Thanks for your excellent works.
Would you mind provide XSum dataset directly just like CNN/Dialy Mail that we are familiar with? I believe it may save time and be more convenient for experiments.
I'd be appreciate if you could give any help. Thanks~
Hey,
I was trying to run Topic-ConvS2s' generation.py on only the test data. I preprocessed only the test data following the guideline in XSum_Dataset readme with the pretrained LDA model you provided there. However, when I do the generation I got some errors.
Here is the command line I used:
CUDA_VISIBLE_DEVICES=1 python XSum-Topic-ConvS2S/generate.py ./data-topic-convs2s \
> --path ./topic-convs2s-emnlp18/checkpoints-topic-convs2s/checkpoint_best.pt \
> --batch-size 1 \
> --beam 10 \
> --replace-unk \
> --source-lang document \
> --target-lang summary \
> --doctopics doc-topics \
> --encoder-embed-dim 512 > test-output-topic-convs2s-checkpoint-best.pt
And here is the error message:
Traceback (most recent call last):
File "XSum-Topic-ConvS2S/generate.py", line 166, in <module>
main(args)
File "XSum-Topic-ConvS2S/generate.py", line 43, in main
models, _ = utils.load_ensemble_for_inference(args.path, dataset.src_dict, dataset.dst_dict)
File "XSum-Topic-ConvS2S/fairseq/utils.py", line 146, in load_ensemble_for_inference
model.load_state_dict(state['model'])
File "XSum-Topic-ConvS2S/fairseq/models/fairseq_model.py", line 69, in load_state_dict
super().load_state_dict(state_dict, strict)
File "../anaconda3/envs/thisenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 847, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for FConvModel:
size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([50004, 512]) from checkpoint, the shape in current model is torch.Size([4, 512]).
size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([50004, 512]) from checkpoint, the shape in current model is torch.Size([4, 512]).
size mismatch for decoder.fc3.bias: copying a param with shape torch.Size([50004]) from checkpoint, the shape in current model is torch.Size([4]).
size mismatch for decoder.fc3.weight_g: copying a param with shape torch.Size([50004, 1]) from checkpoint, the shape in current model is torch.Size([4, 1]).
size mismatch for decoder.fc3.weight_v: copying a param with shape torch.Size([50004, 256]) from checkpoint, the shape in current model is torch.Size([4, 256]).
Do you know what possibly the error can be? Thanks in advance
I want to know if I can crawl BBC articles as a dataset at will. Won't there be a copyright issue?
~/XSum/XSum-Dataset$ python scripts/download-bbc-articles.py
Creating the download directory.
Downloading URLs from the XSum-WebArxiveUrls.txt file:
urls left to download: 226711
5.19% [11765/226711]Traceback (most recent call last):
File "scripts/download-bbc-articles.py", line 253, in
main()
File "scripts/download-bbc-articles.py", line 248, in main
DownloadMode(urls_file_to_download, missing_urls_file, downloads_dir, args.request_parallelism, args.timestamp_exactness)
File "scripts/download-bbc-articles.py", line 198, in DownloadMode
for url, story_html in results:
File "/usr/lib/python2.7/multiprocessing/pool.py", line 668, in next
raise value
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
I know probably I can fix this issue but I want to make sure the copy of the dataset is exactly the same as you use in the paper. Could you just upload the dataset to Dropbox or Google cloud? or fix this issue. thx.
python XSum-Topic-ConvS2S/generate.py D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s --path D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\checkpoints-topic-convs2s\checkpoint_best.pt --batch-size 1 --beam 10 --replace-unk --source-lang document --target-lang summary --doctopics doc-topics --encoder-embed-dim 512 > test-output-topic-convs2s-checkpoint-best.pt
Traceback (most recent call last):
File "XSum-Topic-ConvS2S/generate.py", line 164, in
main(args)
File "XSum-Topic-ConvS2S/generate.py", line 88, in main
scorer = bleu.Scorer(dataset.dst_dict.pad(), dataset.dst_dict.eos(), dataset.dst_dict.unk())
File "D:\PycharmFile\XSum\XSum-Topic-ConvS2S\fairseq\bleu.py", line 44, in init
self.reset()
File "D:\PycharmFile\XSum\XSum-Topic-ConvS2S\fairseq\bleu.py", line 50, in reset
C.bleu_zero_init(ctypes.byref(self.stat))
File "D:\python3.6.8\lib\ctypes_init_.py", line 361, in getattr
func = self.getitem(name)
File "D:\python3.6.8\lib\ctypes_init_.py", line 366, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: function 'bleu_zero_init' not found
And the output file "test-output-topic-convs2s-checkpoint-best.pt":
Loading D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s\dict.document-lemma.lda.txt
Done!
Loading D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s\test.document
Done!
Loading D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s\test.summary
Done!
Loading D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s\test.document-lemma
Done!
Loading D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s\test.doc-topics
Done!
| loading model(s) from D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\checkpoints-topic-convs2s\checkpoint_best.pt
| [document] dictionary: 50004 types
| [summary] dictionary: 50004 types
| D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s test 5 examples
RESTBODY in 36369226.summary is
[SN]RESTBODY[SN]
Share this with
Messenger
Messenger
Copy this link
And 34618290.summary is
[SN]RESTBODY[SN]
23 October 2015 Last updated at 18:50 BST
Able Seaman Albert McKenzie took part in the Zeebrugge raid on 23 April 1918.
BBC London reporter Sarah Harris reports.
not only this 2 data, many data has problem.
Please check it and fix it !
Thanks for the wonderful repo! I have a question in the data collection part.
I exactly followed the instructions and found that there are many noisy texts in the dataset such as "Share this with email facebook messenger ..... copy this link" in the had of many documents.
Did you train your model on the dataset with these noises? Are those numbers in the paper with the noisy text? Or, please let me know if I am wrong. I just need clarification so we can filter them out for valid replication.
Thank you again for your amazing work!
Hi! I completely follow the instruction for ConvS2S model, i.e. data preprocessing, model training , generation and extract final hypothesis. I'm wondering why the output file is in .pt format (final-test-output-convs2s-checkpoint-best.pt)??
Isn't it supposed to be a .txt file because it contains generated summaries???
And, when I open the generated final-test-output-convs2s-checkpoint-best.pt file, it looks like:
This is a joke of a repo. There is no code and no pretrained model. I guess the research must be the same.
I followed the instructions and ran using the pretrained models, but where can i see the results . all the document and summary files in data-convs2s are remaining unchanged .
If anyone is has succeded in running this please help me.
this is the notebook link(it is the section with title EdinburghNLP/XSum)
https://colab.research.google.com/drive/10XEKdP2UdoiwaoZgWfH0fydqrz7rhrJa?usp=sharing
Hi,
firstly, thanks very much for the great work.
The issue I raise here is about the PyTorch version that makes XSum work. When I installed my PyTorch in my conda environment using "pip install torch", it installed the latest torch 1.0.1.post2. This specific torch version seemed to cause an AttributeError, which I believe is a compatibility issue. In contrast, torch 0.4.0 or 0.4.1 works, and no compatibility issue exists.
Therefore, does "Our code requires PyTorch version >= 0.4.0" actually mean PyTorch version 0.4.0 or 0.4.1?
Hi, thanks for your work.
When I ran this "python download-bbc-articles.py [--timestamp_exactness 14]
", I encountered this error:
usage: download-bbc-articles.py [-h]
[--request_parallelism REQUEST_PARALLELISM]
[--context_token_limit CONTEXT_TOKEN_LIMIT]
[--timestamp_exactness TIMESTAMP_EXACTNESS]
download-bbc-articles.py: error: unrecognized arguments: [--timestamp_exactness 14]
Could you help me?
Thank you very much for this great job,
I just need to ask you how can I train a new model on a new data-set.
I just want to know what is the from the data should be so the model can be trained on them.
thanks in advanced.
Hi,
Thanks for your excellent work.
I downloaded the data and processed them as in the instructions (few of them cannot be downloaded and I modified the data processing part to skip on these ones, I guess should not be a big issue. ) Then I trained a new model and everything looks good. However, when I run with the generation script, it gives me the following errors:
The command:
CUDA_VISIBLE_DEVICES=1 python XSum-ConvS2S/generate.py ./data-convs2s --path ./checkpoints-convs2s/checkpoint-best.pt --batch-size 1 --beam 10 --replace-unk --source-lang document --target-lang summary > test-output-convs2s-checkpoint-best.pt
Output:
0%| | 0/11334 [00:00<?, ?it/s]/home/rasmlnlp/zhiyu/anaconda3/envs/tf1.12/lib/python3.5/site-packages/torch/autograd/function.py:41: UserWarning: mark_shared_storage is deprecated. Tensors with shared storages are automatically tracked. Note that calls to set_()
are not tracked
'mark_shared_storage is deprecated. '
Traceback (most recent call last):
File "XSum-ConvS2S/generate.py", line 161, in
main(args)
File "XSum-ConvS2S/generate.py", line 96, in main
for sample_id, src_tokens, target_tokens, hypos in translations:
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/sequence_generator.py", line 77, in generate_batched_itr
prefix_tokens=s['target'][:, :prefix_size] if prefix_size > 0 else None,
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/sequence_generator.py", line 90, in generate
return self._generate(src_tokens, src_lengths, beam_size, maxlen, prefix_tokens)
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/sequence_generator.py", line 250, in _generate
tokens[:, :step+1], encoder_outs, incremental_states)
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/sequence_generator.py", line 413, in _decode
decoder_out, attn = model.decoder(tokens, encoder_out, incremental_states[model])
File "/home/rasmlnlp/zhiyu/anaconda3/envs/tf1.12/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/models/fconv.py", line 266, in forward
x, attn_scores = attention(x, target_embedding, (encoder_a, encoder_b))
File "/home/rasmlnlp/zhiyu/anaconda3/envs/tf1.12/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/models/fconv.py", line 160, in forward
x = self.bmm(x, encoder_out[0])
File "/home/rasmlnlp/zhiyu/anaconda3/envs/tf1.12/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/modules/beamable_mm.py", line 34, in forward
input1 = input1[:, 0, :].unfold(0, beam, beam).transpose(2, 1)
RuntimeError: invalid argument 3: out of range at /opt/conda/conda-bld/pytorch_1535490206202/work/aten/src/THC/generic/THCTensor.cpp:444
Is there anything bugged here? Or is there any other way to reproduce the results in the paper? (The training only shows the loss or ppl on the validation set, not ROUGE results.)
Thank you.
Hi,
You some great work here! Is there a way to run your pre trained models on another dataset? I tried just replacing the train.document and train.summary files with other data, but the final-test-output-convs2s-checkpoint-best.pt results were totally unrelated, and repeated. It seems it is still trying to map my custom values to previously seen titles??
Here's what I did:
I was not sure which file is the data read from for the test, so I replaced train.document, test.document, valid.document, validation.document all with the texts (same in each) and train.summary, test.summary, valid.summary, validation.summary with the titles. (same in each). I copied he dict.document.txt and dict.summary.txt from your original tar.
Then I ran
cd XSum-ConvS2S
python generate.py ./convs2s-emnlp18/data-convs2s --path ./convs2s-emnlp18/checkpoints-convs2s/checkpoint-best.pt --batch-size 1 --beam 10 --replace-unk --source-lang document --target-lang summary > test-output-convs2s-checkpoint-best.pt
cd ..
python scripts/extract-hypothesis-fairseq.py -o XSum-ConvS2S/test-output-convs2s-checkpoint-best.pt -f final-test-output-convs2s-checkpoint-best.pt
Thank you for sharing your repo! I am running your code on CNN/Daily mail and experienced a similar issue that you encountered at facebookresearch/fairseq#118. Could you let me know how do you resolve this in the end?
(2.7.18/envs/venv) mittu1008@katfuji-7:~/XSum/XSum-Dataset$ python -m scripts/parse-bbc-html-data
226711
Traceback (most recent call last):
File "/home/mittu1008/.pyenv/versions/2.7.18/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/home/mittu1008/.pyenv/versions/2.7.18/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/mittu1008/XSum/XSum-Dataset/scripts/parse-bbc-html-data.py", line 198, in
story_title, story_introduction, story_restcontent = GenerateMapper((webarxivid, "bbc", htmldata))
File "/home/mittu1008/XSum/XSum-Dataset/scripts/parse-bbc-html-data.py", line 143, in GenerateMapper
story_title = ParseHtml(raw_story, corpus).getstory_title()
File "/home/mittu1008/XSum/XSum-Dataset/scripts/parse-bbc-html-data.py", line 44, in init
self.parser = html.HTMLParser(encoding=chardet.detect(self.story.html)['encoding'])
File "/home/mittu1008/.pyenv/versions/2.7.18/envs/venv/lib/python2.7/site-packages/lxml/html/init.py", line 1910, in init
super(HTMLParser, self).init(**kwargs)
File "src/lxml/parser.pxi", line 1728, in lxml.etree.HTMLParser.init
File "src/lxml/parser.pxi", line 840, in lxml.etree._BaseParser.init
LookupError: unknown encoding: 'IBM852'
(2.7.18/envs/venv)
what should I do to thi error?
Hi,
thanks for providing the dataset as a download. I downloaded the dataset from the location mentioned in #12 (comment)
But it appears that the format of the dataset is different from the files you receive if you dowload the data yourself.
See this gist, the first file 12092740.data
I downloaded myself from archive.org, while the second file was part of the dowloaded dataset.
As you can see the downloaded file contains the attributes [XSUM]URL[XSUM]
, [XSUM]INTRODUCTION[XSUM]
and [XSUM]RESTBODY[XSUM]
. But the file from the dataset has [SN]URL[SN]
, [SN]TITLE[SN]
, [SN]FIRST-SENTENCE[SN]
and [SN]RESTBODY[SN]
.
My problem is that if I follow the tutorial at https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset the scripts don't work with the unmodified files.
Which changes do I need to make to the scripts?
Best,
Pyfisch
thank
... very interesting paper (http://aclweb.org/anthology/D18-1206 , with included examples), but I tried the online demo, http://cohort.inf.ed.ac.uk/xsum.html , with rather appalling results.
The textual sources for my tests were the abstract for one of my published papers, and the Google infobox for Nova Scotia.
Does the code for the online demo faithfully replicate the code in the paper?
Hello,
Thank you very much for sharing this work and dataset.
Currently, I am working on abstractive summarization and I wish to evaluate my model on XSum dataset.
While I was processing the dataset, I was wondering if the [SN]title[SN] is used or not.
It appears in here that the dataset uses :
May I request your confirmation of the above information?
(I am using the dataset downloaded from http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz )
Thank you very much and I look forward to hearing your reply,
Best,
Wonjin
I saw the dataset is available at http://kinloch.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz (link given by @shashiongithub )
I tried to use this dataset to reproduce BART results, but I couldn't. According to one of BART author on this issue (fairseq repository), this is because I'm using an already processed version of the dataset.
Is it possible to have a link to the raw dataset (no postprocess of any kind) ?
Hello, it's a very good job!
Then, did you or anyone else train the model on CNN/DM or Gigaword and get results?
Wish for reply!!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.