dongjun-lee / text-summarization-tensorflow Goto Github PK

View Code? Open in Web Editor NEW

630.0 24.0 197.0 765 KB

Tensorflow seq2seq Implementation of Text Summarization.

License: MIT License

Python 100.00%

tensorflow text-summarization seq2seq encoder-decoder

text-summarization-tensorflow's Introduction

tensorflow-text-summarization

Simple Tensorflow implementation of text summarization using seq2seq library.

Model

Encoder-Decoder model with attention mechanism.

Word Embedding

Used Glove pre-trained vectors to initialize word embedding.

Encoder

Used LSTM cell with stack_bidirectional_dynamic_rnn.

Decoder

Used LSTM BasicDecoder for training, and BeamSearchDecoder for inference.

Attention Mechanism

Used BahdanauAttention with weight normalization.

Requirements

Python 3
Tensorflow (>=1.8.0)
pip install -r requirements.txt

Usage

Prepare data

Dataset is available at harvardnlp/sent-summary. Locate the summary.tar.gz file in project root directory. Then,

$ python prep_data.py

To use Glove pre-trained embedding, download it via

$ python prep_data.py --glove

Train

We used sumdata/train/train.article.txt and sumdata/train/train.title.txt for training data. To train the model, use

$ python train.py

To use Glove pre-trained vectors as initial embedding, use

$ python train.py --glove

Additional Hyperparamters

$ python train.py -h
usage: train.py [-h] [--num_hidden NUM_HIDDEN] [--num_layers NUM_LAYERS]
                [--beam_width BEAM_WIDTH] [--glove]
                [--embedding_size EMBEDDING_SIZE]
                [--learning_rate LEARNING_RATE] [--batch_size BATCH_SIZE]
                [--num_epochs NUM_EPOCHS] [--keep_prob KEEP_PROB] [--toy]

optional arguments:
  -h, --help            show this help message and exit
  --num_hidden NUM_HIDDEN
                        Network size.
  --num_layers NUM_LAYERS
                        Network depth.
  --beam_width BEAM_WIDTH
                        Beam width for beam search decoder.
  --glove               Use glove as initial word embedding.
  --embedding_size EMBEDDING_SIZE
                        Word embedding size.
  --learning_rate LEARNING_RATE
                        Learning rate.
  --batch_size BATCH_SIZE
                        Batch size.
  --num_epochs NUM_EPOCHS
                        Number of epochs.
  --keep_prob KEEP_PROB
                        Dropout keep prob.
  --toy                 Use only 5K samples of data

Test

Generate summary of each article in sumdata/train/valid.article.filter.txt by

$ python test.py

It will generate result summary file result.txt. Check out ROUGE metrics between result.txt and sumdata/train/valid.title.filter.txt using pltrdy/files2rouge.

Sample Summary Output

"general motors corp. said wednesday its us sales fell ##.# percent in december and four percent in #### with the biggest losses coming from passenger car sales ."
> Model output: gm us sales down # percent in december
> Actual title: gm december sales fall # percent

"japanese share prices rose #.## percent thursday to <unk> highest closing high for more than five years as fresh gains on wall street fanned upbeat investor sentiment , dealers said ."
> Model output:  tokyo shares close # percent higher
> Actual title: tokyo shares close up # percent

"hong kong share prices opened #.## percent higher thursday on follow-through interest in properties after wednesday 's sharp gains on abating interest rate worries , dealers said ."
> Model output: hong kong shares open higher
> Actual title: hong kong shares open higher as rate worries ease

"the dollar regained some lost ground in asian trade thursday in what was seen as a largely technical rebound after weakness prompted by expectations of a shift in us interest rate policy , dealers said ."
> Model output: dollar stable in asian trade
> Actual title: dollar regains ground in asian trade

"the final results of iraq 's december general elections are due within the next four days , a member of the iraqi electoral commission said on thursday ."
> Model output: iraqi election results due in next four days
> Actual title: iraqi election final results out within four days

"microsoft chairman bill gates late wednesday unveiled his vision of the digital lifestyle , outlining the latest version of his windows operating system to be launched later this year ."
> Model output: bill gates unveils new technology vision
> Actual title: gates unveils microsoft 's vision of digital lifestyle

Pre-trained Model

To test with pre-trained model, download pre_trained.zip, and locate it in the project root directory. Then,

$ unzip pre_trained.zip
$ python test.py

text-summarization-tensorflow's People

Contributors

Stargazers

Watchers

Forkers

danielgil1 joshualeung samithaj meelement rosssong ykw-1t anyesh xxxhycl2010 little1tow handsomeboy vamsijkrishna hxw11 fancycheung akileshbadrinaaraayanan zhangcg1987 lanhaochen chenglongchen awasthimaddy daryllei pankajmehar adarve ganesh-4212 amoshua whaozl dfenglei ldzhangyx liybu36 gogasca shubhampachori12110095 didivassi thupx swocky lfsblack jkszw2014 dhruvsharma15 brother-in-brother jasonaidm dream1202 sunnyshah2894 pablitocho khemanta abhinavagrawal1995 dmoliveira hackable flamit craigwang hyunjunekim whosawme baiwyc119 gurpreetgosal leetre khan007 sanyam99 namitaatri codedealer31 solomon1401 kaeflint tinglishen sajibdebnath ginking jongwonlee-chatbot xin-miao-cs xuan20065 dvelle dipanjal charanrajt elbanan inhenn dathuynh deoko majingbit stevaras2 indeterminateoutcomesstudios aker218 sundeeppidugu wangziyang182 amish123 vtaunk adityabongale horieyuan ludmila9999 amit-deshmane alkalami damonpaul chengdong1100 danillolino sanjeevk20 jqk6 arvind-india munia03 conceptcodes shahp7575 xiaowen-ttkx deneshkumar aymansalama aakashshah-94 ngtrphuong mohcinemadkour beebrain cmwenliu

text-summarization-tensorflow's Issues

For training the model

Until how much epochs should I train the model? Because I have been used 5000 article training data and 2000 testing data. After 400 epochs I am not getting proper(It means not related to articles) summaries.

Drive link not working

BasicLSTMCell->LSTMCell

Hi,

Only I have trouble with BasicLSTMCell?
I resolve this error using LSTMCell(num_units=self.num_hidden,name='basic_lstm_cell'). But, it's tru that I use num_units=self.num_hidden?

Model question

Hello, it is very beneficial to read your article. How should I change it if I want to use RNN encoder instead of BiRNN encoder? Thank you!

    with tf.name_scope("encoder"):
        fw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]
        bw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]
        fw_cells = [rnn.DropoutWrapper(cell) for cell in fw_cells]
        bw_cells = [rnn.DropoutWrapper(cell) for cell in bw_cells]

        encoder_outputs, encoder_state_fw, encoder_state_bw = tf.contrib.rnn.stack_bidirectional_dynamic_rnn(
            fw_cells, bw_cells, self.encoder_emb_inp,
            sequence_length=self.X_len, time_major=True, dtype=tf.float32)
        self.encoder_output = tf.concat(encoder_outputs, 2)
        encoder_state_c = tf.concat((encoder_state_fw[0].c, encoder_state_bw[0].c), 1)
        encoder_state_h = tf.concat((encoder_state_fw[0].h, encoder_state_bw[0].h), 1)
        self.encoder_state = rnn.LSTMStateTuple(c=encoder_state_c, h=encoder_state_h)

how to make this project support chinese? thank

Calculate loss during testing

How do you calculate loss during testing?

Any help would be greatly appreciated!! Thanks

replace unk

please give me idea of replacing unk from source document

Dataset used on the Pre-trained model

Could you please tell me on which data set was this pretrained model trained on ? I am a little confused

Waiting too long for the rouge assessment in Colab

Hi, I just met this issue when I use the file2rouge in colab. The program is running more than 4 hours for the rouge assessment and still without any further feedback. Does anyone knows how to fix the issue? Many Thanks.

Input text for summarizing

What if I wanted to summarize larger text data using this summarizer? How can I do that?
And will it give larger summaries rather than just headlines?

missing positional argument: 'state'

Hi,

When I run the code, I have next error:

Loading Glove vectors...
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<timed exec> in <module>()

<ipython-input-168-5fd352f14db2> in __init__(self, reversed_dict, article_max_len, summary_max_len, args, forward_only)
     35 
     36         with tf.name_scope("encoder"):
---> 37             fw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]
     38             bw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]
     39             fw_cells = [rnn.DropoutWrapper(cell) for cell in fw_cells]

<ipython-input-168-5fd352f14db2> in <listcomp>(.0)
     35 
     36         with tf.name_scope("encoder"):
---> 37             fw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]
     38             bw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]
     39             fw_cells = [rnn.DropoutWrapper(cell) for cell in fw_cells]

TypeError: __call__() missing 1 required positional argument: 'state'

Unfortunately, I can't resolve her.

ModuleNotFoundError: No module named 'tensorflow.contrib'

It seems that this code was developed against older version.
Apparently in tensorflow 2.4.0 there is no tensorflow.contrib module.

What is the version numver of tensorfow and Numpy used in this code?

What is the version numver of tensorfow and numpy used in this code? I got a lot errors when I test the code?
Thanks a lot！

Replacing <unk>

Please anyone give me an idea about how to replace "unk" in the generated summary with the actual word from the article(source).

change the way of embedding

hello, excuse me, I want to use it on chinese, and I don't want to use glove as the way of embedding, and I want to use TenCent_chinses_embedding, I want to change the code. Can you give me a advise that how to change the code. thanks very much!

the loss of my training and testing has been declining. I tested it with epoch 85th, but the effect is not so good. How many times does this epoch have to be better?

Error while trying to run test.py on pre-trained model

Hi,
I'm getting a KeyError -1 error at line 36 of test.py when trying to run for the pretrained model.

File "test.py", line 36, in
prediction_output = list(map(lambda x: [reversed_dict[y] for y in x], prediction[:, 0, :]))
KeyError: -1

May i know why it is throwing this error?

How long it took you to train the model?

How long it took you to train the model and what are the specifications of your machine?

Summarized Text is having no relevance with the input text

The summary obtained in the result.txt has no relevance with the input text in valid.article.filter.txt
Is there anything else that I need to provide for this summarizer.

what if we use word2vec rather than GloVe

Well I am new to tensorflow and nlp.
Does it will make any difference if we use word2vec rather than GloVe and can we write the same model in Keras?

Using previous trained models

Hi.

You actual code always trains from the beginning. But Tensorflow permits training after a certain model, or checkpoint, and this is really useful.

I added a few lines to my train.py. This does not deserve a Pull Request, I believe. So I just let the lines here for later use / adaptation /inspiration / ignoring.

Add an argument:
parser.add_argument("--reuse", action="store_true", help="Reuse previously trained model")

Check checkpoint and argument:
after
if not os.path.exists("saved_model"):
os.mkdir("saved_model")

add
else:
if args.reuse:
old_model_checkpoint_path = open('saved_model/checkpoint', 'r')
old_model_checkpoint_path = "".join(["saved_model/",old_model_checkpoint_path.read().splitlines()[0].split('"')[1] ])

Restore model to session:
near saver = tf.train.Saver(tf.global_variables())
add
if 'old_model_checkpoint_path' in globals():
print("Continuing from previous trained model " , old_model_checkpoint_path , "...")
saver.restore(sess, old_model_checkpoint_path )

do you have preprocessing code for cnndm.tar.gz?

error test file

When I tried to run the test file , it gives keyerror. Please help me.

Final Output

I have tried to run the files but after testing the result.txt file which is the final output is empty nothing is been written to the file.

License

Hi,could you please add a license under this repo? Thank you.

Calculate accuracy

How do you calculate accuracy at the end of each epoch?

Any help would be greatly appreciated!! Thanks

why there is KeyError: -1 in prediction

when code run here::

prediction_output = [[reversed_dict[y] for y in x] for x in prediction[:, 0, :]]

    with open("result.txt", "a") as f:
        for line in prediction_output:
            summary = list()
            for word in line:
                if word == "</s>":
                    break
                if word not in summary:
                    summary.append(word)
            print(" ".join(summary), file=f)

print('Summaries are saved to "result.txt"...')

there is error:

KeyError Traceback (most recent call last)
in
27 print('prediction:', prediction.shape)
28
---> 29 prediction_output = [[reversed_dict[y] for y in x] for x in prediction[:, 0, :]]
30
31

in (.0)
27 print('prediction:', prediction.shape)
28
---> 29 prediction_output = [[reversed_dict[y] for y in x] for x in prediction[:, 0, :]]
30
31

KeyError: -1

setting an array element with a sequence.

hi,

when I run train.py at _, step, loss = sess.run([model.update, model.global_step, model.loss],feed_dict=train_feed_dict) I have error ValueError: setting an array element with a sequence.
Can u help me how fix this?

How to increase the article size ?

As of now its only reading small articles , how to increase the article size to a large one ?

About base paper

What is the base paper for this model? Can you please, give me your research paper link?

AttributeError: 'Model' object has no attribute 'update'

when i change the parameter forward_only as True ,then run train.py ,finally, i get the mistake,i don't know why,can you help me?

Request for Model file

Hi,

Does anyone happen to have the model file saved. Training is taking very long time to run and I currently don't have GPU on my device.

Predicting own sentences

How does one supply own sentences to this model to get summarized?
As far as I am seeing, the prediction uses the title as well as the sentence in the test set to create a new sentence. Is there any way to bypass the usage of titles and just feed sentences?

Number of output (in result.txt) is much smaller than number of input data.

I am a newbie in deep learning. While self-studying seq2seq model, I try to modify this code so that it can be applicable to another language. However, I faced one critical issue that the number of my output data (in result.txt) generated is much shorter than the number of input data.

Here are some differences in my code.

tensorflow version 1.10.
pretrained fasttext embedding instead of glove.

I first want to make the overall process work then optimize or train the model with more data.
But, I got stuck in this issue for a couple days.
Wonder if you have any idea what can possibly cause above issue.

Run on GPUs

How can I run it on GPU?

Model is not training from previous saved model.

I have a saved model from previous training but when I start my training it starts from the beginning.

on CNN Data set

Could you let me know that the results would look like when you use the CNN Dataset instead ? If there are pre-trained models for it, please let me know

How to use Beam Search Decoder?

It seems that to use BeamSearchDecoder forward_only must be set to True
model = Model(reversed_dict, article_max_len, summary_max_len, config, forward_only=True)

But loss and update are defined only when forward_only=False
So I get this error

 _, step, loss = sess.run([model.update, model.global_step, model.loss], feed_dict=train_feed_dict)
AttributeError: 'Model' object has no attribute 'update'

What will be the loss and update when forward_only=True
Any help would be greatly appreciated! Thank you!

Can you please provide a sample data set so that code can be run without downloading ?

I have limited compute power on my laptop, it will be really useful t study the code locally if you can provide a small dataset using which I can run it locally on my laptop and debug the code.
Thanks
Shakti

{ with open("args.pickle", "rb") as f: args = pickle.load(f) }！what is the ”args.pickle“？Is that a dataset？Please help me.Thanks

Training times

Anyone care to comment training times for this model? CPU/GPU/TPU?
I'm thinking in creating a Google Cloud ML Engine package for this (For both training and serving)

Extractive text summariser

I am doing a research requiring extractive summarize. Is it possible to convert tjis to extraction summarize

zipfile.BadZipFile: File is not a zip file

Hello when I ran my code on tensorflow I am getting that error. Can you help me please?

> Traceback (most recent call last):
>   File "model.py", line 3, in <module>
>     from utils import get_init_embedding
>   File "../Desktop/text-summarization-tensorflow-master/utils.py", line 6, in <module>
>     from gensim.models.keyedvectors import KeyedVectors
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/gensim/__init__.py", line 6, in <module>
>     from gensim import parsing, matutils, interfaces, corpora, models, similarities, summarization, utils  # noqa:F401
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/gensim/corpora/__init__.py", line 14, in <module>
>     from .wikicorpus import WikiCorpus  # noqa:F401
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/gensim/corpora/wikicorpus.py", line 471, in <module>
>     class WikiCorpus(TextCorpus):
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/gensim/corpora/wikicorpus.py", line 504, in WikiCorpus
>     def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), dictionary=None,
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/gensim/utils.py", line 1529, in has_pattern
>     from pattern.en import parse  # noqa:F401
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/pattern/text/en/__init__.py", line 61, in <module>
>     from pattern.text.en.inflect import (
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/pattern/text/en/__init__.py", line 80, in <module>
>     from pattern.text.en import wordnet
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/pattern/text/en/wordnet/__init__.py", line 57, in <module>
>     nltk.data.find("corpora/" + token)
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/nltk/data.py", line 653, in find
>     return find(modified_name, paths)
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/nltk/data.py", line 639, in find
>     return ZipFilePathPointer(p, zipentry)
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/nltk/compat.py", line 221, in _decorator
>     return init_func(*args, **kwargs)
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/nltk/data.py", line 486, in __init__
>     zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/nltk/compat.py", line 221, in _decorator
>     return init_func(*args, **kwargs)
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/nltk/data.py", line 1012, in __init__
>     zipfile.ZipFile.__init__(self, filename)
>   File "../anaconda3/envs/tf/lib/python3.6/zipfile.py", line 1108, in __init__
>     self._RealGetContents()
>   File "../anaconda3/envs/tf/lib/python3.6/zipfile.py", line 1175, in _RealGetContents
>     raise BadZipFile("File is not a zip file")
> zipfile.BadZipFile: File is not a zip file

TRAIN ON GPU

How do we train this on gpu?

Could anyone provide pre-trained model?

It takes whole day still unfinished on my MBP,I don't have ideas of how long it will takes.

No module named "spicy.sparse" running python train.py

anyone with the same issue?

File "C:\ProgramData\Anaconda3\lib\site-packages\gensim_init_.py", line 6, in
from gensim import parsing, matutils, interfaces, corpora, models, similarities, summarization, utils # noqa:F401
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\parsing_init_.py", line 4, in
from .preprocessing import (remove_stopwords, strip_punctuation, strip_punctuation2, # noqa:F401
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\parsing\preprocessing.py", line 40, in
from gensim import utils
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\utils.py", line 39, in
import scipy.sparse
ModuleNotFoundError: No module named 'scipy.sparse'

UnicodeDecodeError using the sample data

Hello,
I'm having some issues with preparing the data
Steps taken :

Clone repo
Unzip sample_data.zip
python train.py

This is the error I'm getting

Building dictionary...
Traceback (most recent call last):
  File "/Developer/Machine-Learning/Shared-folder/text-summarization-tensorflow/train.py", line 35, in <module>
    word_dict, reversed_dict, article_max_len, summary_max_len = build_dict("train", args.toy)
  File "/Developer/Machine-Learning/Shared-folder/text-summarization-tensorflow/utils.py", line 32, in build_dict
    train_article_list = get_text_list(train_article_path, toy)
  File "/Developer/Machine-Learning/Shared-folder/text-summarization-tensorflow/utils.py", line 25, in get_text_list
    return list(map(lambda x: clean_str(x.strip()), f.readlines()))
  File "/anaconda3/envs/deeplearning/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 5241: ordinal not in range(128)

Freeze a model to serve within API

Hi.

I successfully tested a portuguese corpus I prepared and trained ( change line in utils.py for word in word_tokenize(sentence, language='portuguese'): ).

I'd like to have a frozen model in a single .pb file in order to serve within an API. I tried several approaches, like this: https://blog.metaflow.fr/tensorflow-how-to-freeze-a-model-and-serve-it-with-a-python-api-d4f3596b3adc

But unsuccessfully.

Would you consider providing some method to export a saved model? Or point me to the right direction?

Thanks!