Code Monkey home page Code Monkey logo

text-summarization-tensorflow's Introduction

tensorflow-text-summarization

Simple Tensorflow implementation of text summarization using seq2seq library.

Model

Encoder-Decoder model with attention mechanism.

Word Embedding

Used Glove pre-trained vectors to initialize word embedding.

Encoder

Used LSTM cell with stack_bidirectional_dynamic_rnn.

Decoder

Used LSTM BasicDecoder for training, and BeamSearchDecoder for inference.

Attention Mechanism

Used BahdanauAttention with weight normalization.

Requirements

  • Python 3
  • Tensorflow (>=1.8.0)
  • pip install -r requirements.txt

Usage

Prepare data

Dataset is available at harvardnlp/sent-summary. Locate the summary.tar.gz file in project root directory. Then,

$ python prep_data.py

To use Glove pre-trained embedding, download it via

$ python prep_data.py --glove

Train

We used sumdata/train/train.article.txt and sumdata/train/train.title.txt for training data. To train the model, use

$ python train.py

To use Glove pre-trained vectors as initial embedding, use

$ python train.py --glove

Additional Hyperparamters

$ python train.py -h
usage: train.py [-h] [--num_hidden NUM_HIDDEN] [--num_layers NUM_LAYERS]
                [--beam_width BEAM_WIDTH] [--glove]
                [--embedding_size EMBEDDING_SIZE]
                [--learning_rate LEARNING_RATE] [--batch_size BATCH_SIZE]
                [--num_epochs NUM_EPOCHS] [--keep_prob KEEP_PROB] [--toy]

optional arguments:
  -h, --help            show this help message and exit
  --num_hidden NUM_HIDDEN
                        Network size.
  --num_layers NUM_LAYERS
                        Network depth.
  --beam_width BEAM_WIDTH
                        Beam width for beam search decoder.
  --glove               Use glove as initial word embedding.
  --embedding_size EMBEDDING_SIZE
                        Word embedding size.
  --learning_rate LEARNING_RATE
                        Learning rate.
  --batch_size BATCH_SIZE
                        Batch size.
  --num_epochs NUM_EPOCHS
                        Number of epochs.
  --keep_prob KEEP_PROB
                        Dropout keep prob.
  --toy                 Use only 5K samples of data

Test

Generate summary of each article in sumdata/train/valid.article.filter.txt by

$ python test.py

It will generate result summary file result.txt. Check out ROUGE metrics between result.txt and sumdata/train/valid.title.filter.txt using pltrdy/files2rouge.

Sample Summary Output

"general motors corp. said wednesday its us sales fell ##.# percent in december and four percent in #### with the biggest losses coming from passenger car sales ."
> Model output: gm us sales down # percent in december
> Actual title: gm december sales fall # percent

"japanese share prices rose #.## percent thursday to <unk> highest closing high for more than five years as fresh gains on wall street fanned upbeat investor sentiment , dealers said ."
> Model output:  tokyo shares close # percent higher
> Actual title: tokyo shares close up # percent

"hong kong share prices opened #.## percent higher thursday on follow-through interest in properties after wednesday 's sharp gains on abating interest rate worries , dealers said ."
> Model output: hong kong shares open higher
> Actual title: hong kong shares open higher as rate worries ease

"the dollar regained some lost ground in asian trade thursday in what was seen as a largely technical rebound after weakness prompted by expectations of a shift in us interest rate policy , dealers said ."
> Model output: dollar stable in asian trade
> Actual title: dollar regains ground in asian trade

"the final results of iraq 's december general elections are due within the next four days , a member of the iraqi electoral commission said on thursday ."
> Model output: iraqi election results due in next four days
> Actual title: iraqi election final results out within four days

"microsoft chairman bill gates late wednesday unveiled his vision of the digital lifestyle , outlining the latest version of his windows operating system to be launched later this year ."
> Model output: bill gates unveils new technology vision
> Actual title: gates unveils microsoft 's vision of digital lifestyle

Pre-trained Model

To test with pre-trained model, download pre_trained.zip, and locate it in the project root directory. Then,

$ unzip pre_trained.zip
$ python test.py

text-summarization-tensorflow's People

Contributors

dongjun-lee avatar gogasca avatar wassimseif avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

text-summarization-tensorflow's Issues

For training the model

Until how much epochs should I train the model? Because I have been used 5000 article training data and 2000 testing data. After 400 epochs I am not getting proper(It means not related to articles) summaries.

BasicLSTMCell->LSTMCell

Hi,

Only I have trouble with BasicLSTMCell?
I resolve this error using LSTMCell(num_units=self.num_hidden,name='basic_lstm_cell'). But, it's tru that I use num_units=self.num_hidden?

Model question

Hello, it is very beneficial to read your article. How should I change it if I want to use RNN encoder instead of BiRNN encoder? Thank you!

    with tf.name_scope("encoder"):
        fw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]
        bw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]
        fw_cells = [rnn.DropoutWrapper(cell) for cell in fw_cells]
        bw_cells = [rnn.DropoutWrapper(cell) for cell in bw_cells]

        encoder_outputs, encoder_state_fw, encoder_state_bw = tf.contrib.rnn.stack_bidirectional_dynamic_rnn(
            fw_cells, bw_cells, self.encoder_emb_inp,
            sequence_length=self.X_len, time_major=True, dtype=tf.float32)
        self.encoder_output = tf.concat(encoder_outputs, 2)
        encoder_state_c = tf.concat((encoder_state_fw[0].c, encoder_state_bw[0].c), 1)
        encoder_state_h = tf.concat((encoder_state_fw[0].h, encoder_state_bw[0].h), 1)
        self.encoder_state = rnn.LSTMStateTuple(c=encoder_state_c, h=encoder_state_h)

replace unk

please give me idea of replacing unk from source document

Waiting too long for the rouge assessment in Colab

rouge_waiting_too_long

Hi, I just met this issue when I use the file2rouge in colab. The program is running more than 4 hours for the rouge assessment and still without any further feedback. Does anyone knows how to fix the issue? Many Thanks.

Input text for summarizing

What if I wanted to summarize larger text data using this summarizer? How can I do that?
And will it give larger summaries rather than just headlines?

missing positional argument: 'state'

Hi,

When I run the code, I have next error:

Loading Glove vectors...
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<timed exec> in <module>()

<ipython-input-168-5fd352f14db2> in __init__(self, reversed_dict, article_max_len, summary_max_len, args, forward_only)
     35 
     36         with tf.name_scope("encoder"):
---> 37             fw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]
     38             bw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]
     39             fw_cells = [rnn.DropoutWrapper(cell) for cell in fw_cells]

<ipython-input-168-5fd352f14db2> in <listcomp>(.0)
     35 
     36         with tf.name_scope("encoder"):
---> 37             fw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]
     38             bw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]
     39             fw_cells = [rnn.DropoutWrapper(cell) for cell in fw_cells]

TypeError: __call__() missing 1 required positional argument: 'state'

Unfortunately, I can't resolve her.

Replacing <unk>

Please anyone give me an idea about how to replace "unk" in the generated summary with the actual word from the article(source).

change the way of embedding

hello, excuse me, I want to use it on chinese, and I don't want to use glove as the way of embedding, and I want to use TenCent_chinses_embedding, I want to change the code. Can you give me a advise that how to change the code. thanks very much!

Error while trying to run test.py on pre-trained model

Hi,
I'm getting a KeyError -1 error at line 36 of test.py when trying to run for the pretrained model.

File "test.py", line 36, in
prediction_output = list(map(lambda x: [reversed_dict[y] for y in x], prediction[:, 0, :]))
KeyError: -1

May i know why it is throwing this error?

Using previous trained models

Hi.

You actual code always trains from the beginning. But Tensorflow permits training after a certain model, or checkpoint, and this is really useful.

I added a few lines to my train.py. This does not deserve a Pull Request, I believe. So I just let the lines here for later use / adaptation /inspiration / ignoring.

Add an argument:
parser.add_argument("--reuse", action="store_true", help="Reuse previously trained model")

Check checkpoint and argument:
after
if not os.path.exists("saved_model"):
os.mkdir("saved_model")

add
else:
if args.reuse:
old_model_checkpoint_path = open('saved_model/checkpoint', 'r')
old_model_checkpoint_path = "".join(["saved_model/",old_model_checkpoint_path.read().splitlines()[0].split('"')[1] ])

Restore model to session:
near saver = tf.train.Saver(tf.global_variables())
add
if 'old_model_checkpoint_path' in globals():
print("Continuing from previous trained model " , old_model_checkpoint_path , "...")
saver.restore(sess, old_model_checkpoint_path )

error test file

When I tried to run the test file , it gives keyerror. Please help me.
error

Final Output

I have tried to run the files but after testing the result.txt file which is the final output is empty nothing is been written to the file.

License

Hi,could you please add a license under this repo? Thank you.

Calculate accuracy

How do you calculate accuracy at the end of each epoch?

Any help would be greatly appreciated!! Thanks

why there is KeyError: -1 in prediction

when code run here::

prediction_output = [[reversed_dict[y] for y in x] for x in prediction[:, 0, :]]

    with open("result.txt", "a") as f:
        for line in prediction_output:
            summary = list()
            for word in line:
                if word == "</s>":
                    break
                if word not in summary:
                    summary.append(word)
            print(" ".join(summary), file=f)

print('Summaries are saved to "result.txt"...')

there is error:


KeyError Traceback (most recent call last)
in
27 print('prediction:', prediction.shape)
28
---> 29 prediction_output = [[reversed_dict[y] for y in x] for x in prediction[:, 0, :]]
30
31

in (.0)
27 print('prediction:', prediction.shape)
28
---> 29 prediction_output = [[reversed_dict[y] for y in x] for x in prediction[:, 0, :]]
30
31

in (.0)
27 print('prediction:', prediction.shape)
28
---> 29 prediction_output = [[reversed_dict[y] for y in x] for x in prediction[:, 0, :]]
30
31

KeyError: -1

setting an array element with a sequence.

hi,

when I run train.py at _, step, loss = sess.run([model.update, model.global_step, model.loss],feed_dict=train_feed_dict) I have error ValueError: setting an array element with a sequence.
Can u help me how fix this?

About base paper

What is the base paper for this model? Can you please, give me your research paper link?

Request for Model file

Hi,

Does anyone happen to have the model file saved. Training is taking very long time to run and I currently don't have GPU on my device.

Predicting own sentences

How does one supply own sentences to this model to get summarized?
As far as I am seeing, the prediction uses the title as well as the sentence in the test set to create a new sentence. Is there any way to bypass the usage of titles and just feed sentences?

Number of output (in result.txt) is much smaller than number of input data.

I am a newbie in deep learning. While self-studying seq2seq model, I try to modify this code so that it can be applicable to another language. However, I faced one critical issue that the number of my output data (in result.txt) generated is much shorter than the number of input data.

Here are some differences in my code.

  1. tensorflow version 1.10.
  2. pretrained fasttext embedding instead of glove.

I first want to make the overall process work then optimize or train the model with more data.
But, I got stuck in this issue for a couple days.
Wonder if you have any idea what can possibly cause above issue.

on CNN Data set

Could you let me know that the results would look like when you use the CNN Dataset instead ? If there are pre-trained models for it, please let me know

How to use Beam Search Decoder?

It seems that to use BeamSearchDecoder forward_only must be set to True
model = Model(reversed_dict, article_max_len, summary_max_len, config, forward_only=True)

But loss and update are defined only when forward_only=False
So I get this error

 _, step, loss = sess.run([model.update, model.global_step, model.loss], feed_dict=train_feed_dict)
AttributeError: 'Model' object has no attribute 'update'

What will be the loss and update when forward_only=True
Any help would be greatly appreciated! Thank you!

Training times

Anyone care to comment training times for this model? CPU/GPU/TPU?
I'm thinking in creating a Google Cloud ML Engine package for this (For both training and serving)

Extractive text summariser

I am doing a research requiring extractive summarize. Is it possible to convert tjis to extraction summarize

zipfile.BadZipFile: File is not a zip file

Hello when I ran my code on tensorflow I am getting that error. Can you help me please?

> Traceback (most recent call last):
>   File "model.py", line 3, in <module>
>     from utils import get_init_embedding
>   File "../Desktop/text-summarization-tensorflow-master/utils.py", line 6, in <module>
>     from gensim.models.keyedvectors import KeyedVectors
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/gensim/__init__.py", line 6, in <module>
>     from gensim import parsing, matutils, interfaces, corpora, models, similarities, summarization, utils  # noqa:F401
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/gensim/corpora/__init__.py", line 14, in <module>
>     from .wikicorpus import WikiCorpus  # noqa:F401
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/gensim/corpora/wikicorpus.py", line 471, in <module>
>     class WikiCorpus(TextCorpus):
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/gensim/corpora/wikicorpus.py", line 504, in WikiCorpus
>     def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), dictionary=None,
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/gensim/utils.py", line 1529, in has_pattern
>     from pattern.en import parse  # noqa:F401
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/pattern/text/en/__init__.py", line 61, in <module>
>     from pattern.text.en.inflect import (
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/pattern/text/en/__init__.py", line 80, in <module>
>     from pattern.text.en import wordnet
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/pattern/text/en/wordnet/__init__.py", line 57, in <module>
>     nltk.data.find("corpora/" + token)
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/nltk/data.py", line 653, in find
>     return find(modified_name, paths)
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/nltk/data.py", line 639, in find
>     return ZipFilePathPointer(p, zipentry)
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/nltk/compat.py", line 221, in _decorator
>     return init_func(*args, **kwargs)
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/nltk/data.py", line 486, in __init__
>     zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/nltk/compat.py", line 221, in _decorator
>     return init_func(*args, **kwargs)
>   File "../anaconda3/envs/tf/lib/python3.6/site-packages/nltk/data.py", line 1012, in __init__
>     zipfile.ZipFile.__init__(self, filename)
>   File "../anaconda3/envs/tf/lib/python3.6/zipfile.py", line 1108, in __init__
>     self._RealGetContents()
>   File "../anaconda3/envs/tf/lib/python3.6/zipfile.py", line 1175, in _RealGetContents
>     raise BadZipFile("File is not a zip file")
> zipfile.BadZipFile: File is not a zip file

No module named "spicy.sparse" running python train.py

anyone with the same issue?

File "C:\ProgramData\Anaconda3\lib\site-packages\gensim_init_.py", line 6, in
from gensim import parsing, matutils, interfaces, corpora, models, similarities, summarization, utils # noqa:F401
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\parsing_init_.py", line 4, in
from .preprocessing import (remove_stopwords, strip_punctuation, strip_punctuation2, # noqa:F401
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\parsing\preprocessing.py", line 40, in
from gensim import utils
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\utils.py", line 39, in
import scipy.sparse
ModuleNotFoundError: No module named 'scipy.sparse'

UnicodeDecodeError using the sample data

Hello,
I'm having some issues with preparing the data
Steps taken :

  1. Clone repo
  2. Unzip sample_data.zip
  3. python train.py

This is the error I'm getting

Building dictionary...
Traceback (most recent call last):
  File "/Developer/Machine-Learning/Shared-folder/text-summarization-tensorflow/train.py", line 35, in <module>
    word_dict, reversed_dict, article_max_len, summary_max_len = build_dict("train", args.toy)
  File "/Developer/Machine-Learning/Shared-folder/text-summarization-tensorflow/utils.py", line 32, in build_dict
    train_article_list = get_text_list(train_article_path, toy)
  File "/Developer/Machine-Learning/Shared-folder/text-summarization-tensorflow/utils.py", line 25, in get_text_list
    return list(map(lambda x: clean_str(x.strip()), f.readlines()))
  File "/anaconda3/envs/deeplearning/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 5241: ordinal not in range(128)

Freeze a model to serve within API

Hi.

I successfully tested a portuguese corpus I prepared and trained ( change line in utils.py for word in word_tokenize(sentence, language='portuguese'): ).

I'd like to have a frozen model in a single .pb file in order to serve within an API. I tried several approaches, like this: https://blog.metaflow.fr/tensorflow-how-to-freeze-a-model-and-serve-it-with-a-python-api-d4f3596b3adc

But unsuccessfully.

Would you consider providing some method to export a saved model? Or point me to the right direction?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.