Code Monkey home page Code Monkey logo

xsum's People

Contributors

shashiongithub avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xsum's Issues

Mail adress not found!

We just sent a mail to the mentioned mail address [email protected] to request the dataset for our University project.
But a Diagnostic-Code: smtp; 550 5.1.1 <[email protected]>... User unknown failure mail came back.

Maybe the mail changed or there is a typo? But if you would give us address to contact, we would be happy to send our mail there!

Model Training causing issue AttributeError: 'Tensor' object has no attribute 'conv_tbc'

File "C:\Users\Gangolian's PC\Desktop\Text Summary\WikiHow-Dataset-master\Topic-ConvS2S\fairseq\models\fconv.py", line 154, in forward x = conv(x) File "C:\ProgramData\Anaconda3\envs\tensorflow\lib\site-packages\torch\nn\modules\module.py", line 489, in call result = self.forward(*input, **kwargs) File "C:\Users\Gangolian's PC\Desktop\Text Summary\WikiHow-Dataset-master\Topic-ConvS2S\fairseq\modules\conv_tbc.py", line 32, in forward return input.contiguous().conv_tbc(self.weight, self.bias, self.padding[0]) AttributeError: 'Tensor' object has no attribute 'conv_tbc'

CUDA_VISIBLE_DEVICES=1 python XSum-Topic-ConvS2S/train.py ./data-topic-convs2s --source-lang document --target-lang summary --doctopics doc-topics --max-sentences 32 --arch fconv --criterion label_smoothed_cross_entropy --max-epoch 200 --clip-norm 0.1 --lr 0.10 --dropout 0.2 --save-dir ./checkpoints-topic-convs2s --no-progress-bar --log-interval 10

torch version =1.0.0
python version= 3.5.6

The pre-trained LDA model is not working

Loading LDA model from ./lda-train-document-lemma-topic-512-iter-1000...
Start decoding in ./xsum-preprocessed/document-lemma-topic-512-iter-1000...
Traceback (most recent call last):
File "scripts/lda-gensim-decoding-document-lemma.py", line 62, in
bow = lda.id2word.doc2bow(doc)
AttributeError: 'NoneType' object has no attribute 'doc2bow'

Dataset

Thanks for your excellent works.

Would you mind provide XSum dataset directly just like CNN/Dialy Mail that we are familiar with? I believe it may save time and be more convenient for experiments.

I'd be appreciate if you could give any help. Thanks~

Some issue with generation

Hey,

I was trying to run Topic-ConvS2s' generation.py on only the test data. I preprocessed only the test data following the guideline in XSum_Dataset readme with the pretrained LDA model you provided there. However, when I do the generation I got some errors.

Here is the command line I used:

CUDA_VISIBLE_DEVICES=1 python XSum-Topic-ConvS2S/generate.py ./data-topic-convs2s \
> --path ./topic-convs2s-emnlp18/checkpoints-topic-convs2s/checkpoint_best.pt \
> --batch-size 1 \
> --beam 10 \
> --replace-unk \
> --source-lang document \
> --target-lang summary \
> --doctopics doc-topics \
> --encoder-embed-dim 512 > test-output-topic-convs2s-checkpoint-best.pt

And here is the error message:

Traceback (most recent call last):
  File "XSum-Topic-ConvS2S/generate.py", line 166, in <module>
    main(args)
  File "XSum-Topic-ConvS2S/generate.py", line 43, in main
    models, _ = utils.load_ensemble_for_inference(args.path, dataset.src_dict, dataset.dst_dict)
  File "XSum-Topic-ConvS2S/fairseq/utils.py", line 146, in load_ensemble_for_inference
    model.load_state_dict(state['model'])
  File "XSum-Topic-ConvS2S/fairseq/models/fairseq_model.py", line 69, in load_state_dict
    super().load_state_dict(state_dict, strict)
  File "../anaconda3/envs/thisenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 847, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for FConvModel:
	size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([50004, 512]) from checkpoint, the shape in current model is torch.Size([4, 512]).
	size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([50004, 512]) from checkpoint, the shape in current model is torch.Size([4, 512]).
	size mismatch for decoder.fc3.bias: copying a param with shape torch.Size([50004]) from checkpoint, the shape in current model is torch.Size([4]).
	size mismatch for decoder.fc3.weight_g: copying a param with shape torch.Size([50004, 1]) from checkpoint, the shape in current model is torch.Size([4, 1]).
	size mismatch for decoder.fc3.weight_v: copying a param with shape torch.Size([50004, 256]) from checkpoint, the shape in current model is torch.Size([4, 256]).

Do you know what possibly the error can be? Thanks in advance

About collecting data

I want to know if I can crawl BBC articles as a dataset at will. Won't there be a copyright issue?

Error in downloading data

~/XSum/XSum-Dataset$ python scripts/download-bbc-articles.py
Creating the download directory.
Downloading URLs from the XSum-WebArxiveUrls.txt file:
urls left to download: 226711
5.19% [11765/226711]Traceback (most recent call last):
File "scripts/download-bbc-articles.py", line 253, in
main()
File "scripts/download-bbc-articles.py", line 248, in main
DownloadMode(urls_file_to_download, missing_urls_file, downloads_dir, args.request_parallelism, args.timestamp_exactness)
File "scripts/download-bbc-articles.py", line 198, in DownloadMode
for url, story_html in results:
File "/usr/lib/python2.7/multiprocessing/pool.py", line 668, in next
raise value
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.

I know probably I can fix this issue but I want to make sure the copy of the dataset is exactly the same as you use in the paper. Could you just upload the dataset to Dropbox or Google cloud? or fix this issue. thx.

download the dataset

thanks for your excellent work!

when I run download-bbc-articles.py, it showed that
image

I want to konw why, thanks for your help~

AttributeError: function 'bleu_zero_init' not found

python XSum-Topic-ConvS2S/generate.py D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s --path D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\checkpoints-topic-convs2s\checkpoint_best.pt --batch-size 1 --beam 10 --replace-unk --source-lang document --target-lang summary --doctopics doc-topics --encoder-embed-dim 512 > test-output-topic-convs2s-checkpoint-best.pt
Traceback (most recent call last):
File "XSum-Topic-ConvS2S/generate.py", line 164, in
main(args)
File "XSum-Topic-ConvS2S/generate.py", line 88, in main
scorer = bleu.Scorer(dataset.dst_dict.pad(), dataset.dst_dict.eos(), dataset.dst_dict.unk())
File "D:\PycharmFile\XSum\XSum-Topic-ConvS2S\fairseq\bleu.py", line 44, in init
self.reset()
File "D:\PycharmFile\XSum\XSum-Topic-ConvS2S\fairseq\bleu.py", line 50, in reset
C.bleu_zero_init(ctypes.byref(self.stat))
File "D:\python3.6.8\lib\ctypes_init_.py", line 361, in getattr
func = self.getitem(name)
File "D:\python3.6.8\lib\ctypes_init_.py", line 366, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: function 'bleu_zero_init' not found

And the output file "test-output-topic-convs2s-checkpoint-best.pt":

Loading D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s\dict.document-lemma.lda.txt
Done!
Loading D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s\test.document
Done!
Loading D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s\test.summary
Done!
Loading D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s\test.document-lemma
Done!
Loading D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s\test.doc-topics
Done!
| loading model(s) from D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\checkpoints-topic-convs2s\checkpoint_best.pt
| [document] dictionary: 50004 types
| [summary] dictionary: 50004 types
| D:\NLP\XSum\XSUM-EMNLP18-topic-convs2s\topic-convs2s-emnlp18\data-topic-convs2s test 5 examples

too much the corresponding data in this dataset

RESTBODY in 36369226.summary is

[SN]RESTBODY[SN]
Share this with
Email
Facebook
Messenger
Messenger
Twitter
Pinterest
WhatsApp
LinkedIn
Copy this link

And 34618290.summary is

[SN]RESTBODY[SN]
23 October 2015 Last updated at 18:50 BST
Able Seaman Albert McKenzie took part in the Zeebrugge raid on 23 April 1918.
BBC London reporter Sarah Harris reports.

not only this 2 data, many data has problem.
Please check it and fix it !

noise in the dataset

Thanks for the wonderful repo! I have a question in the data collection part.

I exactly followed the instructions and found that there are many noisy texts in the dataset such as "Share this with email facebook messenger ..... copy this link" in the had of many documents.

Did you train your model on the dataset with these noises? Are those numbers in the paper with the noisy text? Or, please let me know if I am wrong. I just need clarification so we can filter them out for valid replication.

Thank you again for your amazing work!

Bad Output

Hi! I completely follow the instruction for ConvS2S model, i.e. data preprocessing, model training , generation and extract final hypothesis. I'm wondering why the output file is in .pt format (final-test-output-convs2s-checkpoint-best.pt)??
Isn't it supposed to be a .txt file because it contains generated summaries???
And, when I open the generated final-test-output-convs2s-checkpoint-best.pt file, it looks like:
image

Joke of a repo

This is a joke of a repo. There is no code and no pretrained model. I guess the research must be the same.

PyTorch version compatibility

Hi,

firstly, thanks very much for the great work.

The issue I raise here is about the PyTorch version that makes XSum work. When I installed my PyTorch in my conda environment using "pip install torch", it installed the latest torch 1.0.1.post2. This specific torch version seemed to cause an AttributeError, which I believe is a compatibility issue. In contrast, torch 0.4.0 or 0.4.1 works, and no compatibility issue exists.

Therefore, does "Our code requires PyTorch version >= 0.4.0" actually mean PyTorch version 0.4.0 or 0.4.1?

download-bbc-articles.py: error: unrecognized arguments: [--timestamp_exactness 14]

Hi, thanks for your work.
When I ran this "python download-bbc-articles.py [--timestamp_exactness 14]
", I encountered this error:
usage: download-bbc-articles.py [-h]
[--request_parallelism REQUEST_PARALLELISM]
[--context_token_limit CONTEXT_TOKEN_LIMIT]
[--timestamp_exactness TIMESTAMP_EXACTNESS]
download-bbc-articles.py: error: unrecognized arguments: [--timestamp_exactness 14]
Could you help me?

How to train a new model using my data

Thank you very much for this great job,
I just need to ask you how can I train a new model on a new data-set.
I just want to know what is the from the data should be so the model can be trained on them.
thanks in advanced.

Some bugs in generation phase

Hi,

Thanks for your excellent work.

I downloaded the data and processed them as in the instructions (few of them cannot be downloaded and I modified the data processing part to skip on these ones, I guess should not be a big issue. ) Then I trained a new model and everything looks good. However, when I run with the generation script, it gives me the following errors:

The command:
CUDA_VISIBLE_DEVICES=1 python XSum-ConvS2S/generate.py ./data-convs2s --path ./checkpoints-convs2s/checkpoint-best.pt --batch-size 1 --beam 10 --replace-unk --source-lang document --target-lang summary > test-output-convs2s-checkpoint-best.pt

Output:
0%| | 0/11334 [00:00<?, ?it/s]/home/rasmlnlp/zhiyu/anaconda3/envs/tf1.12/lib/python3.5/site-packages/torch/autograd/function.py:41: UserWarning: mark_shared_storage is deprecated. Tensors with shared storages are automatically tracked. Note that calls to set_() are not tracked
'mark_shared_storage is deprecated. '
Traceback (most recent call last):
File "XSum-ConvS2S/generate.py", line 161, in
main(args)
File "XSum-ConvS2S/generate.py", line 96, in main
for sample_id, src_tokens, target_tokens, hypos in translations:
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/sequence_generator.py", line 77, in generate_batched_itr
prefix_tokens=s['target'][:, :prefix_size] if prefix_size > 0 else None,
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/sequence_generator.py", line 90, in generate
return self._generate(src_tokens, src_lengths, beam_size, maxlen, prefix_tokens)
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/sequence_generator.py", line 250, in _generate
tokens[:, :step+1], encoder_outs, incremental_states)
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/sequence_generator.py", line 413, in _decode
decoder_out, attn = model.decoder(tokens, encoder_out, incremental_states[model])
File "/home/rasmlnlp/zhiyu/anaconda3/envs/tf1.12/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/models/fconv.py", line 266, in forward
x, attn_scores = attention(x, target_embedding, (encoder_a, encoder_b))
File "/home/rasmlnlp/zhiyu/anaconda3/envs/tf1.12/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/models/fconv.py", line 160, in forward
x = self.bmm(x, encoder_out[0])
File "/home/rasmlnlp/zhiyu/anaconda3/envs/tf1.12/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/rasmlnlp/zhiyu/XSum/XSum-ConvS2S/fairseq/modules/beamable_mm.py", line 34, in forward
input1 = input1[:, 0, :].unfold(0, beam, beam).transpose(2, 1)
RuntimeError: invalid argument 3: out of range at /opt/conda/conda-bld/pytorch_1535490206202/work/aten/src/THC/generic/THCTensor.cpp:444

Is there anything bugged here? Or is there any other way to reproduce the results in the paper? (The training only shows the loss or ppl on the validation set, not ROUGE results.)

Thank you.

Running pretrained models on Other datasets

Hi,
You some great work here! Is there a way to run your pre trained models on another dataset? I tried just replacing the train.document and train.summary files with other data, but the final-test-output-convs2s-checkpoint-best.pt results were totally unrelated, and repeated. It seems it is still trying to map my custom values to previously seen titles??

Here's what I did:
I was not sure which file is the data read from for the test, so I replaced train.document, test.document, valid.document, validation.document all with the texts (same in each) and train.summary, test.summary, valid.summary, validation.summary with the titles. (same in each). I copied he dict.document.txt and dict.summary.txt from your original tar.

Then I ran

cd XSum-ConvS2S
python generate.py ./convs2s-emnlp18/data-convs2s --path ./convs2s-emnlp18/checkpoints-convs2s/checkpoint-best.pt --batch-size 1 --beam 10 --replace-unk --source-lang document --target-lang summary > test-output-convs2s-checkpoint-best.pt
cd ..
python scripts/extract-hypothesis-fairseq.py -o XSum-ConvS2S/test-output-convs2s-checkpoint-best.pt -f final-test-output-convs2s-checkpoint-best.pt

Error in scripts/parse-bbc-html-data.py

(2.7.18/envs/venv) mittu1008@katfuji-7:~/XSum/XSum-Dataset$ python -m scripts/parse-bbc-html-data
226711
Traceback (most recent call last):
File "/home/mittu1008/.pyenv/versions/2.7.18/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/home/mittu1008/.pyenv/versions/2.7.18/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/mittu1008/XSum/XSum-Dataset/scripts/parse-bbc-html-data.py", line 198, in
story_title, story_introduction, story_restcontent = GenerateMapper((webarxivid, "bbc", htmldata))
File "/home/mittu1008/XSum/XSum-Dataset/scripts/parse-bbc-html-data.py", line 143, in GenerateMapper
story_title = ParseHtml(raw_story, corpus).getstory_title()
File "/home/mittu1008/XSum/XSum-Dataset/scripts/parse-bbc-html-data.py", line 44, in init
self.parser = html.HTMLParser(encoding=chardet.detect(self.story.html)['encoding'])
File "/home/mittu1008/.pyenv/versions/2.7.18/envs/venv/lib/python2.7/site-packages/lxml/html/init.py", line 1910, in init
super(HTMLParser, self).init(**kwargs)
File "src/lxml/parser.pxi", line 1728, in lxml.etree.HTMLParser.init
File "src/lxml/parser.pxi", line 840, in lxml.etree._BaseParser.init
LookupError: unknown encoding: 'IBM852'
(2.7.18/envs/venv)

what should I do to thi error?

How to use dataset

Hi,

thanks for providing the dataset as a download. I downloaded the dataset from the location mentioned in #12 (comment)
But it appears that the format of the dataset is different from the files you receive if you dowload the data yourself.

See this gist, the first file 12092740.data I downloaded myself from archive.org, while the second file was part of the dowloaded dataset.

As you can see the downloaded file contains the attributes [XSUM]URL[XSUM], [XSUM]INTRODUCTION[XSUM] and [XSUM]RESTBODY[XSUM]. But the file from the dataset has [SN]URL[SN], [SN]TITLE[SN], [SN]FIRST-SENTENCE[SN] and [SN]RESTBODY[SN].

My problem is that if I follow the tutorial at https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset the scripts don't work with the unmodified files.

Which changes do I need to make to the scripts?

Best,
Pyfisch

Reference summary and Source text

Hello,
Thank you very much for sharing this work and dataset.
Currently, I am working on abstractive summarization and I wish to evaluate my model on XSum dataset.

While I was processing the dataset, I was wondering if the [SN]title[SN] is used or not.
It appears in here that the dataset uses :

  • [SN]FIRST-SENTENCE[SN] as Reference summary
  • [SN]RESTBODY[SN] as Source text
  • [SN]TITLE[SN] : Not used

May I request your confirmation of the above information?

(I am using the dataset downloaded from http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz )

Thank you very much and I look forward to hearing your reply,
Best,
Wonjin

results on other dataset

Hello, it's a very good job!
Then, did you or anyone else train the model on CNN/DM or Gigaword and get results?

Wish for reply!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.