abisee / cnn-dailymail Goto Github PK

View Code? Open in Web Editor NEW

623.0 623.0 306.0 15.69 MB

Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization

License: MIT License

Python 100.00%

cnn-dailymail's People

Contributors

Stargazers

Watchers

Forkers

stevenlol makcbe shibhansh pltrdy opennmt weili-nlp fabianp acehao reborn2016 ponta63 hemina leezqcst jalused huangpeng1126 comckay becxer sanjeeku jafferwilson egez tanvirfuad gabrielstanovsky leejayyoon bolin0215 jingjingxupku mingyuanxie yaolu ubaidsworld vseledkin rukor excelmaxx hailiang-wang tagucci tinab19 coopgalvin dondon2475848 tianforks yuancz adorationhp chenrocks zhongxia96 2rings tzwwww bibin-johny steveatgit vyacheslavov zkyzq ml-lab mhasa004 imthomas93 pwiercinski nobodywy knok robertmarton amoshua joey10huawei nyxjemk sachin19 tonynemo chmille3 shubhampachori12110095 niteshdm martinbodocky hidhineshraja xiaotianzhao djiyige batermj mqrshiyan yingxi320 the-black-knight-01 zhujunnan voltek62 coderalo jiajie-mei anubhav-jangra samcaetano cybermanbd vidhishanair houjinyu buger2333 luomuqinghan danmann prometeoai kinect59 jongwonlee-chatbot liaomingyue brief-ds syyunn yiyang7 lakshkd northautumn y12uc231 wisharyco tcrapse tarsbase fuxiang-chen nellyluo santosh-gupta shenjiawei19 li-ming-fan htfhxx

cnn-dailymail's Issues

Titles of articles

Hi,

In the *.story files the titles of the news articles are absent. Is there a way to get the titles?

run_summarization.py shows TypeError: unsupported operand type(s) for *: 'int' and 'Flag'

it actually shows in batcher.py.

Initialize a queue of Batches waiting to be used, and a queue of Examples waiting to be batched

self._batch_queue = Queue.Queue(self.BATCH_QUEUE_MAX)
print(self.BATCH_QUEUE_MAX, self._hps.batch_size)
self._example_queue = Queue.Queue(self.BATCH_QUEUE_MAX * self._hps.batch_size)

line 236.
it reports int type can not operand with Flag,but i cannot modify.

Important fix (untokenized data written to .bin files)

This is a notification that the code to obtain the CNN / Daily Mail dataset unfortunately had a bug which caused the untokenized data to be written to the .bin files (not the tokenized data, as intended). The fix has been committed here.

If you've already created your .bin and vocab files, I advise you to recreate them. To do this:

Pull the new version of the cnn-dailymail repo
Delete or rename the finished_files directory (but keep the cnn_stories_tokenized and dm_stories_tokenized directories)
Comment out the lines tokenize_stories(cnn_stories_dir, cnn_tokenized_stories_dir) and tokenize_stories(dm_stories_dir, dm_tokenized_stories_dir) (lines 178 and 179) of make_datafiles.py. This is because you don't need to retokenize the data.
Run make_datafiles.py. This will create the new .bin and vocab files.

If you've already begun training with the Tensorflow code, I advise you to restart training with the new datafiles. Switching the vocab and .bin files mid-training will not work.

Apologies for the inconvenience.

Tagging people to whom this may be relevant: @prokopevaleksey @tianjianjiang @StevenLOL @MrGLaDOS @hate5six @liuchen11 @bugtig @ayushoriginal @BenJamesbabala @BinbinBian @caomw @halolimat @ml-lab @ParseThis @qiang2100 @scylla @tonydeep @yiqingyang2012 @YuxuanHuang @Rahul-Iisc @pj-parag

byte error while reproducing the processed data

While I am processing the data by myself in my local machine, I encountered the following issue:

Traceback (most recent call last):
File "make_datafiles.py", line 241, in
write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin"))
File "make_datafiles.py", line 186, in write_to_bin
tf_example.features.feature['article'].bytes_list.value.extend([article])
TypeError: "-lrb- cnn -rrb- -- spirit airlines said thursday it 's sorry that hundreds of passengers had flight has type str, but expected one of: bytes

I indeed modified the code a bit so that I can only process the cnn data (not dm)
but it seems that the above error does not come from my modification.

Can anybody share your insight with this issue?

How to generate dataset for our own article?

@abisee Could you help in running the nueral network against our own data, how to generate .bin files for our article?

I have clear idea about tokenozation but what about the urls mapping? How to do it?

License?

Thanks for the great work! Could you please add license to this repo so I can use it?

make_datafiles.py issue

I run make_datafiles.py to generate raw text file for BART preprocessing, but I meet following issue:

python make_datafiles.py ./cnn/stories ./dailymail/stories/
Making bin file for URLs listed in url_lists/all_test.txt...
Traceback (most recent call last):
File "make_datafiles.py", line 138, in
write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test"))
File "make_datafiles.py", line 84, in write_to_bin
url_list = read_text_file(url_file)
File "make_datafiles.py", line 26, in read_text_file
with open(text_file, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'url_lists/all_test.txt'

Then I assume it is because all_test_urls doesn't direct to the url file in the dataset, i.e., wayback_test_urls.txt. So, I alter the file name to all_test.txt and put it in the folder, ./cnn/url_lists . But the code still gives the same error. So, I check the source again and find something wrong in the following line.
url_list = read_text_file(url_file)
And I alter it to be:
url_list = read_text_file(os.path.join('./cnn', url_file))
In this way, I think all the source and target file is generated from only cnn dataset. Am I right?

The file does not work with the Python 3.5 even after conversion.

I have used the utility of the python 3 i.e., 2to3.py for converting your file to python 3 version. I am getting many encoding issues. Kindly, provide the Python version of the file else suggest the appropriate changes so that I can use it with python3, please.

Could you please provide the processed data for cnn/dailymail?

Hi, thanks for your hard work! Could you provide the processed data via google drive or dropbox?

how to get src/trg part of data

Hi
thanks for the links to the dataset, I see it has a folder of several .story files, however, this is not clear for me how to get the processed dataset of source/target parts, thanks a lot for your help.

What are these characters in the bin file?

I opened the file by 'rb', and the file contains many unconverted characters

with open('/users/cheng/NLP/Data/finished_files/chunked/test_000.bin', 'rb') as file:
    for line in file:
        print(line)

b'R\x1e\x00\x00\x00\x00\x00\x00\n'
b'\xcf<\n'
b'\xf0\x02\n'
b'\x08abstract\x12\xe3\x02\n'
b'\xe0\x02\n'
b"\xdd\x02<s> marseille prosecutor says `` so far no videos were used in the crash investigation '' despite media reports . </s> <s> journalists at bild and paris match are `` very confident '' the video clip is real , an editor says . </s> <s> andreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says . </s>\n"
b'\xd99\n'
b'\x07article\x12\xcd9\n'
b'\xca9\n'

Then I tried to process them by myself. Split the article and abstract and write them to separate file, but here is an error after processing most files:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

How can I get a clean article and abstract from these files?

URL_list

Sorry, I want to know how can I get url if I pack my article to .story?

Generated .bin files cannot be used for textsum?

Thanks for your script!

Are the data files generated by this script not useable for Google's textsum?

error writing .bin files after tokenizing

Finished tokenizing!
Making bin file for url_lists/all_test.txt...
Writing story 0 of 11490; 0.00 percent done
make_datafiles.py:68: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if line[-1] in END_TOKENS: return line
Writing story 1000 of 11490; 8.70 percent done

Traceback (most recent call last):
  File "make_datafiles.py", line 182, in <module>
    write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin"))
  File "make_datafiles.py", line 127, in write_to_bin
    raise Exception("can't find story file")
Exception: can't find story file

Naming convention in tokenized dir

Question:
What should output filenames look like resulting from tokenize_stories()?
It seems like write_to_bin() expects hashed names in this directory, which I'm not producing directly from tokenize_stories() (i.e. from PTBTokenizer).

Context:
On Mac OS, using stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar.

Example:
If one of the "stories" input file names is "A" then after executing tokenize_stories(), a file called "A" appears in the corresponding tokenized_stories_dir, as opposed to hashhex("A").

It seems PTBTokenizer is working (at least partially) since the tokenized "A" does have, for example, spaces between punctuation marks and -LRB- for left parenthesis.

Outlook:
Specifically, in write_to_bin(), there is
story_fnames = [s+".story" for s in url_hashes]
However, if hashed names are not produced by tokenize_stories(), then a "fix" is
story_fnames = [s + ".story" for s in url_list]

New Test

if I have the content of article that is not of CNN or DM. How will I process data?

Errors while preparing the dataset

Making bin file for URLs listed in /home/demo/Downloads/pointer-generator-master/data/url_lists/all_test.txt...
Traceback (most recent call last):
File "make_datafiles.py", line 238, in
write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin"))
File "make_datafiles.py", line 153, in write_to_bin
url_hashes = get_url_hashes(url_list)
File "make_datafiles.py", line 105, in get_url_hashes
return [hashhex(url) for url in url_list]
File "make_datafiles.py", line 105, in
return [hashhex(url) for url in url_list]
File "make_datafiles.py", line 100, in hashhex
h.update(s)
TypeError: Unicode-objects must be encoded before hashing

License?

What's the license for the code in this repository?

Error in Tokenizing the CNN and DailyMotion stories

Here is the error.

python make_datafiles.py cnn/stories/ dailymail/stories/
Preparing to tokenize cnn/stories/ to cnn_stories_tokenized...
Making list of files to tokenize...
Tokenizing 92579 files in cnn/stories/ and saving in cnn_stories_tokenized...
Exception in thread "main" java.io.IOException: Stream closed
	at java.io.BufferedWriter.ensureOpen(BufferedWriter.java:116)
	at java.io.BufferedWriter.write(BufferedWriter.java:221)
	at java.io.Writer.write(Writer.java:157)
	at edu.stanford.nlp.process.PTBTokenizer.tokReader(PTBTokenizer.java:505)
	at edu.stanford.nlp.process.PTBTokenizer.tok(PTBTokenizer.java:450)
	at edu.stanford.nlp.process.PTBTokenizer.main(PTBTokenizer.java:813)
Stanford CoreNLP Tokenizer has finished.
Traceback (most recent call last):
  File "make_datafiles.py", line 238, in <module>
    tokenize_stories(cnn_stories_dir, cnn_tokenized_stories_dir)
  File "make_datafiles.py", line 86, in tokenize_stories
    raise Exception("The tokenized stories directory %s contains %i files, but it should contain the same number as %s (which has %i files). Was there an error during tokenization?" % (tokenized_stories_dir, num_tokenized, stories_dir, num_orig))
Exception: The tokenized stories directory cnn_stories_tokenized contains 1 files, but it should contain the same number as cnn/stories/ (which has 92579 files). Was there an error during tokenization?

Kindly, help me.

Could you make a branch for python3?

Hello, I was impressed with your paper and code.
Can I create a branch for Python 3?
If it's not an excuse, I want to contribute to the conversion. Thanks.

error while running make_datafile.py

@abisee this is the error that I get when I run the command makefile.py cnn/stories dailymail/stories

Preparing to tokenize cnn/stories to cnn_stories_tokenized...
Making list of files to tokenize...
Tokenizing 92579 files in cnn/stories and saving in cnn_stories_tokenized...
Exception in thread "main" java.io.IOException: Stream closed
	at java.io.BufferedWriter.ensureOpen(BufferedWriter.java:116)
	at java.io.BufferedWriter.write(BufferedWriter.java:221)
	at java.io.Writer.write(Writer.java:157)
	at edu.stanford.nlp.process.PTBTokenizer.tokReader(PTBTokenizer.java:505)
	at edu.stanford.nlp.process.PTBTokenizer.tok(PTBTokenizer.java:450)
	at edu.stanford.nlp.process.PTBTokenizer.main(PTBTokenizer.java:813)
Stanford CoreNLP Tokenizer has finished.
Traceback (most recent call last):
  File "make_datafiles.py", line 235, in <module>
    tokenize_stories(cnn_stories_dir, cnn_tokenized_stories_dir)
  File "make_datafiles.py", line 86, in tokenize_stories
    raise Exception("The tokenized stories directory %s contains %i files, but it should contain the same number as %s (which has %i files). Was there an error during tokenization?" % (tokenized_stories_dir, num_tokenized, stories_dir, num_orig))
Exception: The tokenized stories directory cnn_stories_tokenized contains 1 files, but it should contain the same number as cnn/stories (which has 92579 files). Was there an error during `tokenization?`

About the txt vertion

Hi,
I would like to ask about the file type. The data given is binary, when I convert it to string or text with python, there would be some garbled. May I access to the "txt" version directly or may I ask for some suggestion? Thanks very much!

is it really Abstractive approach?

I have set up this repo. now all my output are the selection of sentences from the original text. how this can be Abstractive?

even test data has the same type of out.

Example
Decoded Summary:

obama says he is `` absolutely committed to making sure '' israel maintains a military advantage over iran .
his comments to the new york times published on sunday , come amid criticism from israeli prime minister benjamin netanyahu .

Original Text:
washington ( cnn ) president barack obama says he is `` absolutely committed to making sure '' israel maintains a military advantage over iran . his comments to the new york times , published on sunday , come amid criticism from israeli prime minister benjamin netanyahu of the deal that the united states and five other world powers struck with iran . tehran agreed to halt the country 's nuclear ambitions , and in exchange , western powers would drop sanctions that have hurt the iran 's economy . obama said he understands and respects netanyahu 's stance that israel is particularly vulnerable and does n't have the luxury of testing these propositions '' in the deal . but what i would say to them is that not only am i absolutely committed to making sure they maintain their qualitative military edge , and that they can deter any potential future attacks , but what i 'm willing to do is to make the kinds of commitments that would give everybody in the neighborhood , including iran , a clarity that if israel were to be attacked by any state , that we would stand by them , '' obama said . that , he said , should be sufficient to take advantage of this once-in-a-lifetime opportunity to see whether or not we can at least take the nuclear issue off the table , '' he said . the framework negotiators announced last week would see iran reduce its centrifuges from 19,000 to 5,060 , limit the extent to which uranium necessary for nuclear weapons can be enriched and increase inspections . the talks over a final draft are scheduled to continue until june 30 . but netanyahu and republican critics in congress have complained that iran wo n't have to shut down its nuclear facilities and that the country 's leadership is n't trustworthy enough for the inspections to be as valuable as obama says they are . obama said even if iran ca n't be trusted , there 's still a case to be made for the deal . in fact , you could argue that if they are implacably opposed to us , all the more reason for us to want to have a deal in which we know what they 're doing and that , for a long period of time , we can prevent them from having a nuclear weapon , '' obama said .

Issue about the input text

Hi, I see a lot of input text starts with "( cnn )", don't we need to delete the "( cnn )" at the beginning of the text? Thanks.

Making Dataset for Bengali Language

I have two files in Bengali. article.txt, and summary.txt. Now how can I convert it to corresponding train.bin, val.bin, test.bin? I just couldn't understand how to process my Bengali corpus for this summarization process. Thanks in advance.

How to read .story files in python

How to generate the anonymized version?

@abisee Did you wrote code for generating anonymized version of cnn-dailymail summarizaition dataset?

The size of train.bin in FINISHED_FILES

I get the 287,113 pairs in ‘train.bin’, it is not equal with 287,226 mentioned in Get To The Point: Summarization with Pointer-Generator Networks， Does anyone have the same problem？

Note about chunking data

For simplicity, we originally provided code to produce a single train.bin, val.bin and test.bin file for the data. However in our experiments for the paper we split the data into several chunks, each containing 1000 examples (i.e. train_000.bin, train_001.bin, ..., train_287.bin). In the interest of reproducibility, make_datafiles.py has now been updated to also produce chunked data that's saved in finished_files/chunked and the README for the Tensorflow code now gives instructions for chunked data. If you've already run make_datafiles.py to obtain train.bin/val.bin/test.bin files, then just run

import make_datafiles
make_datafiles.chunk_all()

in Python, from the cnn-dailymail directory, to get the chunked files (it takes a few seconds).

To use your chunked datafiles with the Tensorflow code, set e.g.

--data_path=/path/to/chunked/train_*

You don't have to restart training from the beginning to switch to the chunked datafiles.

Why does it matter?

The multi-threaded batcher code is originally from the TextSum project. The idea is that each input thread calls example_generator, which randomizes the chunks, and then reads from the chunks in that order. Thus 16 threads concurrently fill the input queue with examples drawn from different, randomly-chosen chunks. If your data is in a single file however, then the multi-threaded batcher will result in 16 threads concurrently filling the input queue with examples drawn in order from the same single .bin file. Firstly, this might produce batches containing more duplicate examples than we want. Secondly, reading through the dataset in order may produce different training results to reading through in randomized chunks.

If you're concerned about duplicate examples in batches, either chunk your data or switch the batcher to single-threaded by setting

self._num_example_q_threads = 1 # num threads to fill example queue
self._num_batch_q_threads = 1  # num threads to fill batch queue

(From a speed point of view, the multi-threaded batcher is probably unnecessary for many systems anyway).

If you're concerned about reproducibility and the effect on training of reading the data in randomized chunks vs. from a single file, then chunk your data.

Incorrectly formatted line in vocabulary file

Why is for example 0800 555 111 356 included in the generated vocab file? This example is at line 23163. Or is it just me who have this problem?

>>> with open('data/cnn-dailymail/vocab', 'r') as vocab_f:
...      for line in vocab_f:
...          pieces = line.split()
...          if len(pieces) != 2:
...             print(pieces)
... 
['0800', '555', '111', '356']
['1800', '333', '000', '139']
['2', '1/2', '124']
['3', '1/2', '86']
['1', '1/2', '68']
['0800', '555111', '59']
['4', '1/2', '47']
['0844', '472', '4157', '41']
['5', '1/2', '39']
['7', '1/2', '25']
['6', '1/2', '24']
['9', '1/2', '21']
['020', '7629', '9161', '19']
['8', '1/2', '19']
['0300', '123', '8018', '19']
['0808', '800', '5000', '19']
['11', '1/2', '18']
['0844', '493', '0787', '14']
['1300', '659', '467', '13']
['16', '1/2', '12']
['13', '1/2', '12']
['1800', '273', '8255', '11']
['18', '1/2', '10']
['0300', '1234', '999', '10']
['0845', '790', '9090', '10']
['0845', '634', '1414', '9']
['14', '1/2', '8']
['0207', '938', '6364', '8']
['0207', '938', '6683', '8']
['310', '642', '2317', '7']
['at', 'uefa.com', '7']
['0207', '386', '0868', '7']
['0808', '800', '2222', '6']
['0800', '789', '321', '6']
['0800', '854', '440', '6']

Error: Could not find or load main class edu.stanford.nlp.process.PTBTokenizer

i run makedatafiles.py. but it has an error:
Preparing to tokenize /home/ztl/Downloads/cnn_stories/cnn/stories to cnn_stories_tokenized...
Making list of files to tokenize...
Tokenizing 92579 files in /home/ztl/Downloads/cnn_stories/cnn/stories and saving in cnn_stories_tokenized...
Error: Could not find or load main class edu.stanford.nlp.process.PTBTokenizer
Caused by: java.lang.ClassNotFoundException: edu.stanford.nlp.process.PTBTokenizer
Stanford CoreNLP Tokenizer has finished.
Traceback (most recent call last):

However i can run echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer in the root
i dont know how to deal with? thanks a lot

Stanford CoreNLP 3.8 not compatible

Thanks for the code abisee.

I have tried to use the new stanford corenlp 3.8, but it seems it can only tokenise the first data file aka the first line in mapping.txt and thus:

Exception in thread "main" java.io.IOException: Stream closed
	at java.io.BufferedWriter.ensureOpen(BufferedWriter.java:116)
	at java.io.BufferedWriter.write(BufferedWriter.java:221)
	at java.io.Writer.write(Writer.java:157)
	at edu.stanford.nlp.process.PTBTokenizer.tokReader(PTBTokenizer.java:505)
	at edu.stanford.nlp.process.PTBTokenizer.tok(PTBTokenizer.java:450)
	at edu.stanford.nlp.process.PTBTokenizer.main(PTBTokenizer.java:813)

Corenlp 3.7 works perfectly though. Jus thought you might want to know this.