Code Monkey home page Code Monkey logo

caption_generator's Introduction

caption_generator: An image captioning project

license

Note: This project is no longer under active development. However, queries and pull requests will be responded to. Thanks!

To generate a caption for any image in natural language, English. The architecture for the model is inspired from [1] by Vinyals et al. The module is built using keras, the deep learning library.

This repository serves two purposes:

  • present/ discuss my model and results I obtained
  • provide a simple architecture for image captioning to the community

Model

The Image captioning model has been implemented using the Sequential API of keras. It consists of three components:

  1. An encoder CNN model: A pre-trained CNN is used to encode an image to its features. In this implementation VGG16 model[d] is used as encoder and with its pretrained weights loaded. The last softmax layer of VGG16 is removed and the vector of dimention (4096,) is obtained from the second last layer.

    To speed up my training, I pre-encoded each image to its feature set. This is done in the prepare_dataset.py file to form a resultant pickle file encoded_images.p. In the current version, the image model takes the (4096,) dimension encoded image vector as input. This can be overrided by uncommenting the VGG model lines in caption_generator.py. There is no fine tuning in the current version but can be implemented.

  2. A word embedding model: Since the number of unique words can be large, a one hot encoding of the words is not a good idea. An embedding model is trained that takes a word and outputs an embedding vector of dimension (1, 128).

    Pre-trained word embeddings can also be used.

  3. A decoder RNN model: A LSTM network has been employed for the task of generating captions. It takes the image vector and partial captions at the current timestep and input and generated the next most probable word as output.

The overall architecture of the model is described by the following picture. It also shows the input and output dimension of each layer in the model.



Dataset

The model has been trained and tested on Flickr8k dataset[2]. There are many other datasets available that can used as well like:

  • Flickr30k
  • MS COCO
  • SBU
  • Pascal

Experiments and results

The model has been trained for 50 epochs which lowers down the loss to 2.6465. With a larger dataset, it might be needed to run the model for atleast 50 more epochs.

With the current training on the Flickr8k dataset, running test on the 1000 test images results in, BLEU = ~0.57.

Some captions generated are as follows:




Requirements

  • tensorflow
  • keras
  • numpy
  • h5py
  • pandas
  • Pillow

These requirements can be easily installed by: pip install -r requirements.txt

Scripts

  • caption_generator.py: The base script that contains functions for model creation, batch data generator etc.
  • prepare_dataset.py: Prepares the dataset for training. Changes have to be done to this script if new dataset is to be used.
  • train_model.py: Module for training the caption generator.
  • test_model.py: Contains module for testing the performance of the caption generator, currently it contains the (BLEU)[https://en.wikipedia.org/wiki/BLEU] metric. New metrics can be added.

Usage

After the requirements have been installed, the process from training to testing is fairly easy. The commands to run:

  1. python prepare_dataset.py
  2. python train_model.py
  3. python test_model.py

References

[1] Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan. Show and Tell: A Neural Image Caption Generator

[2] Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. Collecting Image Annotations Using Amazon's Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk.


Acknowledgements

[a] I am thankful to my project guide Prof. NK Bansode and a big shoutout to my project teammates. We have also developed an implementation of [1] in TensorFlow available at image-caption-generator which had been trained and tested on MS COCO dataset.

[b] Special thanks to Ashwanth Kumar for helping me with the resources and effort to train my models.

[c] Keras: Deep Learning library for Theano and TensorFlow: Thanks to François Chollet for developing and maintaining such a wonderful library.

[d] deep-learning-models: Thanks to François Chollet for providing pretrained VGG16 model and weights.

caption_generator's People

Contributors

anuragmishracse avatar crowoy avatar y12uc231 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

caption_generator's Issues

Running caption_generator on Google Cloud Engine

I am attempting to get this up and running on a Google Cloud Engine (GCE) VM - Debian 4.9.51-1 x86_64. I performed a: sudo pip install -r requirements.txt, and everything installed correctly.

I then attempt to run python caption_generator/prepare_dataset.py. This outputs the following error:

Using TensorFlow backend.
RuntimeError: module compiled against API version 0xb but this version of numpy is 0xa
RuntimeError: module compiled against API version 0xb but this version of numpy is 0xa
Traceback (most recent call last):
  File "caption_generator/prepare_dataset.py", line 2, in <module>
    from keras.preprocessing import image
  File "/usr/local/lib/python2.7/dist-packages/keras/__init__.py", line 3, in <module>
    from . import activations
  File "/usr/local/lib/python2.7/dist-packages/keras/activations.py", line 4, in <module>
    from .utils.generic_utils import deserialize_keras_object
  File "/usr/local/lib/python2.7/dist-packages/keras/utils/__init__.py", line 6, in <module>
    from . import io_utils
  File "/usr/local/lib/python2.7/dist-packages/keras/utils/io_utils.py", line 10, in <module>
    import h5py
  File "/usr/local/lib/python2.7/dist-packages/h5py/__init__.py", line 31, in <module>
    from .highlevel import *
  File "/usr/local/lib/python2.7/dist-packages/h5py/highlevel.py", line 13, in <module>
    from ._hl.base import is_hdf5, HLObject
  File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/base.py", line 78, in <module>
    dlapl = default_lapl()
  File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/base.py", line 65, in default_lapl
    lapl = h5p.create(h5p.LINK_ACCESS)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5p.pyx", line 131, in h5py.h5p.create
  File "h5py/h5p.pyx", line 72, in h5py.h5p.propwrap
ValueError: Not a property list class (Not a property list class)

Any help would be much appreciated.

ValueError:

You are trying to load a weight file containing 19 layers into a model with 16 layers. (prepare_dataset.pt-- tensorflow beckend)

can you please share a copy of your 'weights-improvement-XX.hdf5'

I find the training takes a lot of time. Can you share a copy of your model weight?
We hope to test if we deploy your code successfully.
Up to now, we only run 1 epoch and the test result shows not a single caption can be produced.
Is this normal? will more epoch rounds improve brings reasonable output instead of empty output?
Thank you for your cool work.

Network does not converge, bad captions

Hello,

I've followed your instructions and started training the network. The loss reaches its minimum value after about 5 epochs and then it starts to diverge again.

After 50 epochs, the generated captions of the best epoch (5th or 6th) look like this:

Predicting for image: 992
2351479551_e8820a1ff3.jpg : exercise lamb Fourth headphones facing pasta soft her soft her soft her soft her soft her dads college soft her dads college soft her her her her her soft her her her her her soft her her her her
Predicting for image: 993
3514179514_cbc3371b92.jpg : fist graffitti soft her soft her Hollywood Fourth Crowd soft her her soft her her her her her soft her her her her her her soft her her her her soft her her her her soft her her her
Predicting for image: 994
1119015538_e8e796281e.jpg : closeout security soft her soft her security fall soft her her her her her fall soft her her her her her her soft her her her her her soft her her her her soft her her her her her
Predicting for image: 995
3727752439_907795603b.jpg : roots college Fourth tree-filled o swing-set places soft her soft her her soft her her soft her her college soft her her her her her her her soft her her her her soft her her her her her her

Any idea what's wrong?

AttributeError: 'module' object has no attribute 'CaptionGenerator'

Hi, after running prepare_dataset.py I tried to ran train_dataset.py. But it gives this error message.
``File "train_model.py", line 25, in
train_model(epochs=50)
File "train_model.py", line 6, in train_model
cg = caption_generator.CaptionGenerator()
AttributeError: 'module' object has no attribute 'CaptionGenerator'

raise RuntimeError('You must compile your model before using it.')

Hey when i try to train the model i am getting the following error:

Using TensorFlow backend.
413439
WARNING:tensorflow:From C:\Users\UserName\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
Model created!
Traceback (most recent call last):
File "D:\Image-caption\Image-Captioning-master\train.py", line 14, in
train(int(sys.argv[1]))
File "D:\Image-caption\Image-Captioning-master\train.py", line 9, in train
model.fit_generator(sd.data_process(batch_size=batch_size), steps_per_epoch=sd.no_samples/batch_size, epochs=epoch, verbose=2, callbacks=None)
File "C:\Users\UserName\Anaconda3\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "C:\Users\UserName\Anaconda3\lib\site-packages\keras\engine\training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "C:\Users\UserName\Anaconda3\lib\site-packages\keras\engine\training_generator.py", line 40, in fit_generator
model._make_train_function()
File "C:\Users\UserName\Anaconda3\lib\site-packages\keras\engine\training.py", line 496, in _make_train_function
raise RuntimeError('You must compile your model before using it.')
RuntimeError: You must compile your model before using it.

I am trying to solve it but getting no luck. Its been 2 weeks since i am stuck on the problem. Please help me out here.

Pre-trained models

Hi,

The model works good and i would like to know if you can share a pre-trained models.

Thanks

bad caption

I have trained the model and the loss is 1.16, acc is 0.7574. But when I take the model to generate caption , the caption is excursive and have no sense.
image
How should I do?

Module Error

Which version of keras is used? I'm getting error saying "module" object is not callable.

GPU not working

Hi!
When I ran the train_model to train, keras did not use gpu automatically. However, when I ran mnist codes, the keras used the gpu automatically. I cant tell why. Can somebody enlighten me, please!

Thanks

A small doubt about implementation.

So I have read the paper and have a small doubt, The authors just initialise the CNN to pre-trained weights(ImageNet) and they don't change the weights further? Can you just tell me your approach for the cnn, you just initialised it with imagenet weights and never altered them? right?

Thank You.

caption_generator.py : How do you break out of "while 1"

In file caption_generator.py in function "data_generator" at line 81 there's a

while 1:

During training I only see thousands of executions of the statement

print "yielding count: "+str(gen_count)

which is within this loop.

Since I don't see any exit conditions or break within this loop, I am wondering when and how do we break out of this loop.

Thanks

Final Model

Hey would it be possible for you to upload the final model?

Thanks,
Rohin

Value Error

ValueError: decode_predictions expects a batch of predictions (i.e. a 2D array of shape (samples, 1000)). Found array with shape: (1, 14, 14, 512)

When I ran ..
model = VGG16(include_top=False, weights='imagenet')
img_path = 'Elephant.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
print('Input image shape:', x.shape)
preds = model.predict(x)
print('Predicted:', decode_predictions(preds))

I got this error can u help me fix it please.

Running on Mac OS X (10.12.6)

I would like to get this running on my local machine before pushing it up to Google Compute Engine.

I setup a new virtual environment, and cloned the repo. I then ran pip install -r requirements.txt and everything installed correctly.

I then tried running python caption_generator/prepare_dataset.py and get the following error:

Using TensorFlow backend.
Traceback (most recent call last):
  File "prepare_dataset.py", line 5, in <module>
    from imagenet_utils import preprocess_input
ImportError: No module named imagenet_utils

I was under the impression that imagenet_utils was included in Keras?

Missing InceptionV3

The prepare_dataset.py has finished. I am trying to run the train_model.py however, it is giving me the following error:

ImportError: No module named inception_v3

I thought that the inception_v3 module was included in Tensorflow?

When to use pretrained word embeddings?

Hello, I was wondering when do you think is worth to use a pretrained word embeddings model. I am facing a one-to-many problem as well, where my "many" are text paragraphs (~80 words). I have 100K training instances. What do you think?

Also, if I were about to use a pretrained word embedding model, where should I insert it in your code?

Thanks in advance.

Installing tensorflow

Looks like the implementation of this project is by python 2.7. Im facing issues in installing tensorflow. Am i right with the version of python?

I couldn't get the weights-improvement-48.hdf5

After training, I noticed I just generate weights as follows:
weights-improvement-01.hdf5
weights-improvement-02.hdf5
weights-improvement-03.hdf5
weights-improvement-04.hdf5
Apart from mentioned above, I get no other hdf5 file generated, can anyone tell me what's the problem?
I had checked the address of my training dataset is right and the epoch is 50.

Why don't you validate during trainning?

Hello, I was wonder why is it that you don't add a validation generator during training. How do you control that the model doesn't overfit too much/it is able to generalize?
Thanks in advance

Instalation

Hі, I was looking for a solution that would make captions for pictures, and came across your project. But I faced the problem of not being able to run this application on my laptop. Is there any more detailed instruction that will describe step by step what exactly needs to be done to run the program?

Low accuracy on MSCOCO

HI, I'm following your code and try to train the network on MSCOCO
Here is my code

class Caption_Model:
def init(self,char_to_int,int_to_char,vocab_size=26688,max_caption_len=20,folder_path=path,epochs=10,batch_size=64):
self.img_model=Sequential()
self.text_model=Sequential()
self.model=Sequential()
self.vocab_size=vocab_size
self.max_caption_len=max_caption_len
self.folder_path=folder_path
self.data={}
self.char_to_int=char_to_int
self.int_to_char=int_to_char
self.batch_size=batch_size
self.epochs=epochs

def get_image_model(self):
    self.img_model.add(Dense(Embedding_dim,input_dim=4096,activation='relu'))
    self.img_model.add(RepeatVector(self.max_caption_len+1))
    # self.img_model.summary()
    return self.img_model

def get_text_model(self):
    self.text_model.add(Embedding(self.vocab_size,256,input_length=self.max_caption_len+1))
    self.text_model.add(LSTM(512,return_sequences=True))
    #self.text_model.add(Dropout(0.2))
    self.text_model.add(TimeDistributed(Dense(Embedding_dim,activation='relu')))
    # self.text_model.summary()
    return self.text_model

def get_caption_model(self,predict=False):
    self.get_image_model()
    self.get_text_model()
    self.model.add(Merge([self.img_model,self.text_model],mode='concat'))
    self.model.add(LSTM(1000,return_sequences=False))
    self.model.add(Dense(self.vocab_size))
    self.model.add(Activation('softmax'))
    print "Now model.model"
    sgd = SGD(lr=1e-3, decay=1e-6, momentum=0.99, nesterov=True)
    rms = RMSprop(lr=0.005)
    if predict:
        return
    else:
        # weight='/home/paperspace/Document/DeepLearning/ImageCaption/code/Models/checkpoint/weights-improvement-02-5.2473.hdf5'
        # self.model.load_weights(weight)
        self.model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['accuracy'])

def load_data(self,set_type='train'):
    data={}
    with open(self.folder_path+set_type+'.processed_img.2.pkl') as f:
        data['imgs']=pickle.load(f)
    with open(os.path.join(self.folder_path,'all%spartial_sentences_0.pkl'%set_type)) as f:
        data['partial_sentences']=pickle.load(f)
    return data

def data_generator(self,set_type='train'):
    data=self.load_data(set_type)
    j=0
    temp=data['partial_sentences'].keys()
    partial_sentences,images=[],[]
    next_words=np.zeros((self.batch_size,self.vocab_size)).astype(float)
    count=0
    round_count=0
    while True:
        round_count+=1
        random.shuffle(temp)
        print "the %d round!" %round_count
        for key in temp:
            image=data['imgs'][key]
            for sen in data['partial_sentences'][key]:
                for k in range(len(sen)):
                    count+=1
                    partial=sen[:k+1]
                    partial_sentences.append(partial)
                    images.append(image)
                    # print "index is: ",count-1
                    if k==len(sen)-1:
                        next_words[count-1][self.char_to_int['<end>']]=1
                    else:
                        next_words[count-1][sen[k+1]]=1
                    if count>=self.batch_size:
                        partial_sentences=sequence.pad_sequences(partial_sentences, maxlen=self.max_caption_len+1, padding='post')
                        partial_sentences=np.asarray(partial_sentences)
                        images=np.asarray(images)
                        # partial_sentences=partial_sentences/float(self.vocab_size)
                        # print partial_sentences
                        count=0
                        yield [images,partial_sentences],next_words
                        partial_sentences,images=[],[]
                        next_words=np.zeros((self.batch_size,self.vocab_size)).astype(float)

        j+=1

def train(self):
    self.get_caption_model()
    filepath="Models/checkpoint/weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
    checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
    callbacks_list = [checkpoint]

    self.model.fit_generator(self.data_generator('train'),steps_per_epoch=step_size/self.batch_size,epochs=self.epochs,validation_data=self.data_generator('val'),validation_steps=v_step_size/self.batch_size,callbacks=callbacks_list)
    # self.model.fit_generator(self.data_generator('train'),steps_per_epoch=step_size/self.batch_size,epochs=self.epochs,callbacks=callbacks_list)

    try:
        self.model.save('Models/WholeModel.h5',overwrite=True)
        self.model.save_weights('Models/Weights.h5',overwrite=True)
    except:
        print "Error in saving model."
    print "After training model...\n"

Accuracy maintains about 35% in the end and training loss is about 3.xxx
I just cannot figure out what's wrong with the code.
Could you please offer some help.
Thank you so much!

Awful Captions Generated when Testing

After finally getting it to work (with basically no changes to the code), and letting it run for 25 epochs, I tried running the test_model.py script. This produces really bad results. I have changed the weights file in the script to use the most recent weights file generated, which was weights-improvement-03.hdf5 for some reason. When training, the accuracy was not increasing, and the loss was increasing.

Here are some of the captions generated when I use that weights file:

Predicting for image: 0
3385593926_d3e9c21170.jpg : A black and white dog jumps in the grass .
Predicting for image: 1
2677656448_6b7e7702af.jpg : A black and white dog is running in the grass .
Predicting for image: 2
311146855_0b65fdb169.jpg : A man wearing a black shirt and blue hair and blue hair and blue hair and blue hair and blue hair and blue hair and blue hair and blue hat and blue shirt and black shirt is people
Predicting for image: 3
1258913059_07c613f7ff.jpg : A man in a black shirt is sitting in a mountain .
Predicting for image: 4
241347760_d44c8d3a01.jpg : A man in a red shirt is playing in the field .
Predicting for image: 5
2654514044_a70a6e2c21.jpg : A black and white dog is running on a field .
Predicting for image: 6
2339106348_2df90aa6a9.jpg : A man wearing a black shirt and blue hair and blue hair and blue hair and blue hair and blue hair and blue hair and blue hair and blue hair and blue hair and white shirt
Predicting for image: 7
256085101_2c2617c5d0.jpg : A black and white dog jumps in the grass .
Predicting for image: 8
280706862_14c30d734a.jpg : A black and white dog running in the snow .
Predicting for image: 9
3072172967_630e9c69d0.jpg : A man in a red shirt is sitting in the background .
Predicting for image: 10
3482062809_3b694322c4.jpg : A man wearing a black shirt and blue hair and blue shorts and blue shorts and blue shorts and blue shirt and blue hair and blue shirt and blue shirt and white shirt and black shirt is people

RuntimeError: You must compile your model before using it.

Running prepare_dataset.py is okay.
When I run test_model.py, I found the following error.
How should I change code? Please guide me!

C:\Anaconda3\envs\caption_generator-master\python.exe
D:/code/Python/caption_generator-master/caption_generator/train_model.py
Using TensorFlow backend.
Total samples : 383454
Vocabulary size: 8256
Maximum caption length: 40
Variables initialization done!
Model created!
Traceback (most recent call last):
File "D:/Python/caption_generator-master/caption_generator/train_model.py", line 26, in
train_model(epochs=50)
File "D:/code/Python/caption_generator-master/caption_generator/train_model.py", line 17, in train_model
model.fit_generator(cg.data_generator(batch_size=batch_size), steps_per_epoch=cg.total_samples/batch_size, epochs=epochs, verbose=2, callbacks=callbacks_list)
File "C:\Anaconda3\envs\caption_generator-master\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "C:\Anaconda3\envs\caption_generator-master\lib\site-packages\keras\engine\training.py", line 1420, in fit_generator
initial_epoch=initial_epoch)
File "C:\Anaconda3\envs\caption_generator-master\lib\site-packages\keras\engine\training_generator.py", line 40, in fit_generator
model._make_train_function()
File "C:\Anaconda3\envs\caption_generator-master\lib\site-packages\keras\engine\training.py", line 496, in _make_train_function
raise RuntimeError('You must compile your model before using it.')
RuntimeError: You must compile your model before using it.

Process finished with exit code 1

Generating Captions on non-Flickr8k images

How do you go about generating captions on images of your choice?

The testing code seems to reply on the pre-generated encoded_images.py file, which only contains encodings for Flickr8k images in this case. The prepare_dataset.py file does not seem easy to adapt to encode images of your choice into encoded_images.py.

Any help would be appreciated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.