The devise-zero-shot-classification from fg91

Loss Function

I see that in the notebok you used cosine loss function, but in the paper a more hinge-like-ranking loss is used. Why did you used cosine loss?

Latest Paper related to DeViSE

Hi Fabio.
I read your article on Medium. Due to some reasons I am not able to post response. I enjoyed reading your explanation of Paper. Can you point recent advancements in this space? I see this Paper was published in 2013 but it still looks relevant. Simple and Powerful.

Possible improvement?!

Hey Fabio, read the full article, maybe the most interesting post that I have read on Medium. I didn't knew something like this existed! Some decades down the line, I can see novels converted into movies by NN.

Here's what I felt about DeVise:
I think there can be a improvement here, you are doing two things at once, first is finding a vector space to represent your image and second to map it to the vector space of word vectors.

There is no reason this can't be done separately. As you know Variational Autoencoders (or whatever their latest improvement is) are more suited to find a continuous vector space such that the image can be reconstructed. Since the space is continuous, it has similar properties to word vectors, like ( man + glasses -> man with glasses, you get the idea). Word vectors also have this property like (king - man + woman -> queen).

Here is what I suggest: what if you try to map these two continuous vector spaces using a NN. You maybe able to generate a high amount of training data, for example, if you have black cat as a class, you can find a derived word vector representation for it, and a derived latent space representation for an image that may represent the same idea (black + cat, if the individual words were present as classes for images), but you get the idea... that you should be able to generate a huge amount of combinations(more data, better NN!) that you can use to train NN and find better transformations between the two continuous vector spaces.

When testing for an image, you can use encoder part of Autoencoder to generate the encoding and then use your newly trained NN to find the word vector representation for it. What's interesting is that you can do the opposite as well if you want, generate a latent space representation from the any word vector and then us ethe decoder part of the Autoencoder to generate an image.

Two Autoencoders sould also be able to do all of this. First VAE finds a latent representation for the image. Afte the first VAE has been thoroughly trained, you can train a second VAE. Second VAE finds the word representation for the latent space representation of the image. You should be able to map word vectors to images and images to word vectors.

The loss for second VAE could be a combination of how well it reconstructs latent space as well as how well it generates an encoding that is similar to any word vector. The data used to train second VAE would be generated by the class combinations for which we can also generate a word vector representation, both the source and the target vector representations are generated as seen in above examples.

You may say that we can use one VAE instead of two for this, but then your first VAE won't train properly since it loses all information at the layer where it will need to represent a state with same size as that of any word vector. Then how come second VAE will train properly? It won't but we should get a good enough encoder and decoder (better than DeVise?), the advantage of decoupling is better latent space representation for the images.

Imagine a future where your NN is generating an image for a complex word vector like 'Woman riding on a white horse'. Maybe you will need a Transformer for this one.

Haven't thought if we can get GANs to work with this since we can't get the internal space representation for images. But since GANs are good at mapping even noise to a continuous space which is not accessible to us, maybe latent space for an image found by decoding a word vector by the VAE could be transformed by a GAN to generate a realistic image. Unfortunately, I am on the poor side of the globe, you have the passion as well as a gpu, I hope you will entertain this idea or build upon its flaws, I am open to your thoughts on this.

fg91 / devise-zero-shot-classification Goto Github PK

devise-zero-shot-classification's People

Contributors

Stargazers

Watchers

Forkers

devise-zero-shot-classification's Issues

Loss Function

Latest Paper related to DeViSE

Possible improvement?!

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent