fg91 / devise-zero-shot-classification Goto Github PK
View Code? Open in Web Editor NEWDeViSE model (zero-shot learning) trained on ImageNet and deployed on AWS using Docker
DeViSE model (zero-shot learning) trained on ImageNet and deployed on AWS using Docker
I see that in the notebok you used cosine loss function, but in the paper a more hinge-like-ranking loss is used. Why did you used cosine loss?
Hi Fabio.
I read your article on Medium. Due to some reasons I am not able to post response. I enjoyed reading your explanation of Paper. Can you point recent advancements in this space? I see this Paper was published in 2013 but it still looks relevant. Simple and Powerful.
Hey Fabio, read the full article, maybe the most interesting post that I have read on Medium. I didn't knew something like this existed! Some decades down the line, I can see novels converted into movies by NN.
Here's what I felt about DeVise:
I think there can be a improvement here, you are doing two things at once, first is finding a vector space to represent your image and second to map it to the vector space of word vectors.
There is no reason this can't be done separately. As you know Variational Autoencoders (or whatever their latest improvement is) are more suited to find a continuous vector space such that the image can be reconstructed. Since the space is continuous, it has similar properties to word vectors, like ( man + glasses -> man with glasses, you get the idea). Word vectors also have this property like (king - man + woman -> queen).
Here is what I suggest: what if you try to map these two continuous vector spaces using a NN. You maybe able to generate a high amount of training data, for example, if you have black cat as a class, you can find a derived word vector representation for it, and a derived latent space representation for an image that may represent the same idea (black + cat, if the individual words were present as classes for images), but you get the idea... that you should be able to generate a huge amount of combinations(more data, better NN!) that you can use to train NN and find better transformations between the two continuous vector spaces.
When testing for an image, you can use encoder part of Autoencoder to generate the encoding and then use your newly trained NN to find the word vector representation for it. What's interesting is that you can do the opposite as well if you want, generate a latent space representation from the any word vector and then us ethe decoder part of the Autoencoder to generate an image.
Two Autoencoders sould also be able to do all of this. First VAE finds a latent representation for the image. Afte the first VAE has been thoroughly trained, you can train a second VAE. Second VAE finds the word representation for the latent space representation of the image. You should be able to map word vectors to images and images to word vectors.
The loss for second VAE could be a combination of how well it reconstructs latent space as well as how well it generates an encoding that is similar to any word vector. The data used to train second VAE would be generated by the class combinations for which we can also generate a word vector representation, both the source and the target vector representations are generated as seen in above examples.
You may say that we can use one VAE instead of two for this, but then your first VAE won't train properly since it loses all information at the layer where it will need to represent a state with same size as that of any word vector. Then how come second VAE will train properly? It won't but we should get a good enough encoder and decoder (better than DeVise?), the advantage of decoupling is better latent space representation for the images.
Imagine a future where your NN is generating an image for a complex word vector like 'Woman riding on a white horse'. Maybe you will need a Transformer for this one.
Haven't thought if we can get GANs to work with this since we can't get the internal space representation for images. But since GANs are good at mapping even noise to a continuous space which is not accessible to us, maybe latent space for an image found by decoding a word vector by the VAE could be transformed by a GAN to generate a realistic image. Unfortunately, I am on the poor side of the globe, you have the passion as well as a gpu, I hope you will entertain this idea or build upon its flaws, I am open to your thoughts on this.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.