abisee / cnn-dailymail Goto Github PK
View Code? Open in Web Editor NEWCode to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization
License: MIT License
Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization
License: MIT License
Hi,
In the *.story files the titles of the news articles are absent. Is there a way to get the titles?
it actually shows in batcher.py.
self._batch_queue = Queue.Queue(self.BATCH_QUEUE_MAX)
print(self.BATCH_QUEUE_MAX, self._hps.batch_size)
self._example_queue = Queue.Queue(self.BATCH_QUEUE_MAX * self._hps.batch_size)
line 236.
it reports int type can not operand with Flag,but i cannot modify.
This is a notification that the code to obtain the CNN / Daily Mail dataset unfortunately had a bug which caused the untokenized data to be written to the .bin
files (not the tokenized data, as intended). The fix has been committed here.
If you've already created your .bin
and vocab
files, I advise you to recreate them. To do this:
cnn-dailymail
repofinished_files
directory (but keep the cnn_stories_tokenized
and dm_stories_tokenized
directories)tokenize_stories(cnn_stories_dir, cnn_tokenized_stories_dir)
and tokenize_stories(dm_stories_dir, dm_tokenized_stories_dir)
(lines 178 and 179) of make_datafiles.py
. This is because you don't need to retokenize the data.make_datafiles.py
. This will create the new .bin
and vocab
files.If you've already begun training with the Tensorflow code, I advise you to restart training with the new datafiles. Switching the vocab
and .bin
files mid-training will not work.
Apologies for the inconvenience.
Tagging people to whom this may be relevant: @prokopevaleksey @tianjianjiang @StevenLOL @MrGLaDOS @hate5six @liuchen11 @bugtig @ayushoriginal @BenJamesbabala @BinbinBian @caomw @halolimat @ml-lab @ParseThis @qiang2100 @scylla @tonydeep @yiqingyang2012 @YuxuanHuang @Rahul-Iisc @pj-parag
While I am processing the data by myself in my local machine, I encountered the following issue:
Traceback (most recent call last):
File "make_datafiles.py", line 241, in
write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin"))
File "make_datafiles.py", line 186, in write_to_bin
tf_example.features.feature['article'].bytes_list.value.extend([article])
TypeError: "-lrb- cnn -rrb- -- spirit airlines said thursday it 's sorry that hundreds of passengers had flight has type str, but expected one of: bytes
I indeed modified the code a bit so that I can only process the cnn data (not dm)
but it seems that the above error does not come from my modification.
Can anybody share your insight with this issue?
@abisee Could you help in running the nueral network against our own data, how to generate .bin files for our article?
I have clear idea about tokenozation but what about the urls mapping? How to do it?
Thanks for the great work! Could you please add license to this repo so I can use it?
I run make_datafiles.py to generate raw text file for BART preprocessing, but I meet following issue:
python make_datafiles.py ./cnn/stories ./dailymail/stories/
Making bin file for URLs listed in url_lists/all_test.txt...
Traceback (most recent call last):
File "make_datafiles.py", line 138, in
write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test"))
File "make_datafiles.py", line 84, in write_to_bin
url_list = read_text_file(url_file)
File "make_datafiles.py", line 26, in read_text_file
with open(text_file, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'url_lists/all_test.txt'
Then I assume it is because all_test_urls doesn't direct to the url file in the dataset, i.e., wayback_test_urls.txt. So, I alter the file name to all_test.txt and put it in the folder, ./cnn/url_lists . But the code still gives the same error. So, I check the source again and find something wrong in the following line.
url_list = read_text_file(url_file)
And I alter it to be:
url_list = read_text_file(os.path.join('./cnn', url_file))
In this way, I think all the source and target file is generated from only cnn dataset. Am I right?
I have used the utility of the python 3 i.e., 2to3.py for converting your file to python 3 version. I am getting many encoding issues. Kindly, provide the Python version of the file else suggest the appropriate changes so that I can use it with python3, please.
Hi, thanks for your hard work! Could you provide the processed data via google drive or dropbox?
Hi
thanks for the links to the dataset, I see it has a folder of several .story files, however, this is not clear for me how to get the processed dataset of source/target parts, thanks a lot for your help.
I opened the file by 'rb', and the file contains many unconverted characters
with open('/users/cheng/NLP/Data/finished_files/chunked/test_000.bin', 'rb') as file:
for line in file:
print(line)
b'R\x1e\x00\x00\x00\x00\x00\x00\n'
b'\xcf<\n'
b'\xf0\x02\n'
b'\x08abstract\x12\xe3\x02\n'
b'\xe0\x02\n'
b"\xdd\x02<s> marseille prosecutor says `` so far no videos were used in the crash investigation '' despite media reports . </s> <s> journalists at bild and paris match are `` very confident '' the video clip is real , an editor says . </s> <s> andreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says . </s>\n"
b'\xd99\n'
b'\x07article\x12\xcd9\n'
b'\xca9\n'
Then I tried to process them by myself. Split the article and abstract and write them to separate file, but here is an error after processing most files:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte
How can I get a clean article and abstract from these files?
Sorry, I want to know how can I get url if I pack my article to .story?
Thanks for your script!
Are the data files generated by this script not useable for Google's textsum?
Finished tokenizing!
Making bin file for url_lists/all_test.txt...
Writing story 0 of 11490; 0.00 percent done
make_datafiles.py:68: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if line[-1] in END_TOKENS: return line
Writing story 1000 of 11490; 8.70 percent done
Traceback (most recent call last):
File "make_datafiles.py", line 182, in <module>
write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin"))
File "make_datafiles.py", line 127, in write_to_bin
raise Exception("can't find story file")
Exception: can't find story file
Question:
What should output filenames look like resulting from tokenize_stories()?
It seems like write_to_bin() expects hashed names in this directory, which I'm not producing directly from tokenize_stories() (i.e. from PTBTokenizer).
Context:
On Mac OS, using stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar.
Example:
If one of the "stories" input file names is "A" then after executing tokenize_stories(), a file called "A" appears in the corresponding tokenized_stories_dir, as opposed to hashhex("A").
It seems PTBTokenizer is working (at least partially) since the tokenized "A" does have, for example, spaces between punctuation marks and -LRB- for left parenthesis.
Outlook:
Specifically, in write_to_bin(), there is
story_fnames = [s+".story" for s in url_hashes]
However, if hashed names are not produced by tokenize_stories(), then a "fix" is
story_fnames = [s + ".story" for s in url_list]
if I have the content of article that is not of CNN or DM. How will I process data?
Making bin file for URLs listed in /home/demo/Downloads/pointer-generator-master/data/url_lists/all_test.txt...
Traceback (most recent call last):
File "make_datafiles.py", line 238, in
write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin"))
File "make_datafiles.py", line 153, in write_to_bin
url_hashes = get_url_hashes(url_list)
File "make_datafiles.py", line 105, in get_url_hashes
return [hashhex(url) for url in url_list]
File "make_datafiles.py", line 105, in
return [hashhex(url) for url in url_list]
File "make_datafiles.py", line 100, in hashhex
h.update(s)
TypeError: Unicode-objects must be encoded before hashing
What's the license for the code in this repository?
Here is the error.
python make_datafiles.py cnn/stories/ dailymail/stories/
Preparing to tokenize cnn/stories/ to cnn_stories_tokenized...
Making list of files to tokenize...
Tokenizing 92579 files in cnn/stories/ and saving in cnn_stories_tokenized...
Exception in thread "main" java.io.IOException: Stream closed
at java.io.BufferedWriter.ensureOpen(BufferedWriter.java:116)
at java.io.BufferedWriter.write(BufferedWriter.java:221)
at java.io.Writer.write(Writer.java:157)
at edu.stanford.nlp.process.PTBTokenizer.tokReader(PTBTokenizer.java:505)
at edu.stanford.nlp.process.PTBTokenizer.tok(PTBTokenizer.java:450)
at edu.stanford.nlp.process.PTBTokenizer.main(PTBTokenizer.java:813)
Stanford CoreNLP Tokenizer has finished.
Traceback (most recent call last):
File "make_datafiles.py", line 238, in <module>
tokenize_stories(cnn_stories_dir, cnn_tokenized_stories_dir)
File "make_datafiles.py", line 86, in tokenize_stories
raise Exception("The tokenized stories directory %s contains %i files, but it should contain the same number as %s (which has %i files). Was there an error during tokenization?" % (tokenized_stories_dir, num_tokenized, stories_dir, num_orig))
Exception: The tokenized stories directory cnn_stories_tokenized contains 1 files, but it should contain the same number as cnn/stories/ (which has 92579 files). Was there an error during tokenization?
Kindly, help me.
Hello, I was impressed with your paper and code.
Can I create a branch for Python 3?
If it's not an excuse, I want to contribute to the conversion. Thanks.
@abisee this is the error that I get when I run the command makefile.py cnn/stories dailymail/stories
Preparing to tokenize cnn/stories to cnn_stories_tokenized...
Making list of files to tokenize...
Tokenizing 92579 files in cnn/stories and saving in cnn_stories_tokenized...
Exception in thread "main" java.io.IOException: Stream closed
at java.io.BufferedWriter.ensureOpen(BufferedWriter.java:116)
at java.io.BufferedWriter.write(BufferedWriter.java:221)
at java.io.Writer.write(Writer.java:157)
at edu.stanford.nlp.process.PTBTokenizer.tokReader(PTBTokenizer.java:505)
at edu.stanford.nlp.process.PTBTokenizer.tok(PTBTokenizer.java:450)
at edu.stanford.nlp.process.PTBTokenizer.main(PTBTokenizer.java:813)
Stanford CoreNLP Tokenizer has finished.
Traceback (most recent call last):
File "make_datafiles.py", line 235, in <module>
tokenize_stories(cnn_stories_dir, cnn_tokenized_stories_dir)
File "make_datafiles.py", line 86, in tokenize_stories
raise Exception("The tokenized stories directory %s contains %i files, but it should contain the same number as %s (which has %i files). Was there an error during tokenization?" % (tokenized_stories_dir, num_tokenized, stories_dir, num_orig))
Exception: The tokenized stories directory cnn_stories_tokenized contains 1 files, but it should contain the same number as cnn/stories (which has 92579 files). Was there an error during `tokenization?`
Hi,
I would like to ask about the file type. The data given is binary, when I convert it to string or text with python, there would be some garbled. May I access to the "txt" version directly or may I ask for some suggestion? Thanks very much!
I have set up this repo. now all my output are the selection of sentences from the original text. how this can be Abstractive?
even test data has the same type of out.
Example
Decoded Summary:
obama says he is `` absolutely committed to making sure '' israel maintains a military advantage over iran .
his comments to the new york times published on sunday , come amid criticism from israeli prime minister benjamin netanyahu .
Original Text:
washington ( cnn ) president barack obama says he is `` absolutely committed to making sure '' israel maintains a military advantage over iran
. his comments to the new york times , published on sunday , come amid criticism from israeli prime minister benjamin netanyahu
of the deal that the united states and five other world powers struck with iran . tehran agreed to halt the country 's nuclear ambitions , and in exchange , western powers would drop sanctions that have hurt the iran 's economy . obama said he understands and respects netanyahu 's stance that israel is particularly vulnerable and does n't have the luxury of testing these propositions '' in the deal .
but what i would say to them is that not only am i absolutely committed to making sure they maintain their qualitative military edge , and that they can deter any potential future attacks , but what i 'm willing to do is to make the kinds of commitments that would give everybody in the neighborhood , including iran , a clarity that if israel were to be attacked by any state , that we would stand by them , '' obama said . that , he said , should be sufficient to take advantage of this once-in-a-lifetime opportunity to see whether or not we can at least take the nuclear issue off the table , '' he said . the framework negotiators announced last week would see iran reduce its centrifuges from 19,000 to 5,060 , limit the extent to which uranium necessary for nuclear weapons can be enriched and increase inspections . the talks over a final draft are scheduled to continue until june 30 . but netanyahu and republican critics in congress have complained that iran wo n't have to shut down its nuclear facilities and that the country 's leadership is n't trustworthy enough for the inspections to be as valuable as obama says they are . obama said even if iran ca n't be trusted , there 's still a case to be made for the deal .
in fact , you could argue that if they are implacably opposed to us , all the more reason for us to want to have a deal in which we know what they 're doing and that , for a long period of time , we can prevent them from having a nuclear weapon , '' obama said .
Hi, I see a lot of input text starts with "( cnn )", don't we need to delete the "( cnn )" at the beginning of the text? Thanks.
I have two files in Bengali. article.txt, and summary.txt. Now how can I convert it to corresponding train.bin, val.bin, test.bin? I just couldn't understand how to process my Bengali corpus for this summarization process. Thanks in advance.
@abisee Did you wrote code for generating anonymized version of cnn-dailymail summarizaition dataset?
I get the 287,113 pairs in ‘train.bin’, it is not equal with 287,226 mentioned in Get To The Point: Summarization with Pointer-Generator Networks, Does anyone have the same problem?
For simplicity, we originally provided code to produce a single train.bin
, val.bin
and test.bin
file for the data. However in our experiments for the paper we split the data into several chunks, each containing 1000 examples (i.e. train_000.bin
, train_001.bin
, ..., train_287.bin
). In the interest of reproducibility, make_datafiles.py
has now been updated to also produce chunked data that's saved in finished_files/chunked
and the README for the Tensorflow code now gives instructions for chunked data. If you've already run make_datafiles.py
to obtain train.bin/val.bin/test.bin
files, then just run
import make_datafiles
make_datafiles.chunk_all()
in Python, from the cnn-dailymail
directory, to get the chunked files (it takes a few seconds).
To use your chunked datafiles with the Tensorflow code, set e.g.
--data_path=/path/to/chunked/train_*
You don't have to restart training from the beginning to switch to the chunked datafiles.
Why does it matter?
The multi-threaded batcher code is originally from the TextSum project. The idea is that each input thread calls example_generator
, which randomizes the chunks, and then reads from the chunks in that order. Thus 16 threads concurrently fill the input queue with examples drawn from different, randomly-chosen chunks. If your data is in a single file however, then the multi-threaded batcher will result in 16 threads concurrently filling the input queue with examples drawn in order from the same single .bin
file. Firstly, this might produce batches containing more duplicate examples than we want. Secondly, reading through the dataset in order may produce different training results to reading through in randomized chunks.
self._num_example_q_threads = 1 # num threads to fill example queue
self._num_batch_q_threads = 1 # num threads to fill batch queue
(From a speed point of view, the multi-threaded batcher is probably unnecessary for many systems anyway).
Why is for example 0800 555 111 356 included in the generated vocab file? This example is at line 23163. Or is it just me who have this problem?
>>> with open('data/cnn-dailymail/vocab', 'r') as vocab_f:
... for line in vocab_f:
... pieces = line.split()
... if len(pieces) != 2:
... print(pieces)
...
['0800', '555', '111', '356']
['1800', '333', '000', '139']
['2', '1/2', '124']
['3', '1/2', '86']
['1', '1/2', '68']
['0800', '555111', '59']
['4', '1/2', '47']
['0844', '472', '4157', '41']
['5', '1/2', '39']
['7', '1/2', '25']
['6', '1/2', '24']
['9', '1/2', '21']
['020', '7629', '9161', '19']
['8', '1/2', '19']
['0300', '123', '8018', '19']
['0808', '800', '5000', '19']
['11', '1/2', '18']
['0844', '493', '0787', '14']
['1300', '659', '467', '13']
['16', '1/2', '12']
['13', '1/2', '12']
['1800', '273', '8255', '11']
['18', '1/2', '10']
['0300', '1234', '999', '10']
['0845', '790', '9090', '10']
['0845', '634', '1414', '9']
['14', '1/2', '8']
['0207', '938', '6364', '8']
['0207', '938', '6683', '8']
['310', '642', '2317', '7']
['at', 'uefa.com', '7']
['0207', '386', '0868', '7']
['0808', '800', '2222', '6']
['0800', '789', '321', '6']
['0800', '854', '440', '6']
i run makedatafiles.py. but it has an error:
Preparing to tokenize /home/ztl/Downloads/cnn_stories/cnn/stories to cnn_stories_tokenized...
Making list of files to tokenize...
Tokenizing 92579 files in /home/ztl/Downloads/cnn_stories/cnn/stories and saving in cnn_stories_tokenized...
Error: Could not find or load main class edu.stanford.nlp.process.PTBTokenizer
Caused by: java.lang.ClassNotFoundException: edu.stanford.nlp.process.PTBTokenizer
Stanford CoreNLP Tokenizer has finished.
Traceback (most recent call last):
However i can run echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer in the root
i dont know how to deal with? thanks a lot
Thanks for the code abisee.
I have tried to use the new stanford corenlp 3.8, but it seems it can only tokenise the first data file aka the first line in mapping.txt and thus:
Exception in thread "main" java.io.IOException: Stream closed
at java.io.BufferedWriter.ensureOpen(BufferedWriter.java:116)
at java.io.BufferedWriter.write(BufferedWriter.java:221)
at java.io.Writer.write(Writer.java:157)
at edu.stanford.nlp.process.PTBTokenizer.tokReader(PTBTokenizer.java:505)
at edu.stanford.nlp.process.PTBTokenizer.tok(PTBTokenizer.java:450)
at edu.stanford.nlp.process.PTBTokenizer.main(PTBTokenizer.java:813)
Corenlp 3.7 works perfectly though. Jus thought you might want to know this.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.