Code Monkey home page Code Monkey logo

clip-implement's Introduction

CLIP-implement

Try to get used with CLIP neural network

CLIP: Connection Text and Images

blog | paper | code

  • Problem:

    • Typical vision datasets are labor intensive and costly to create while teaching only a narrow set of visual concepts
    • Standard vision models are good at one task and one task only, and require significant effort to adapt to a new task
    • Models that perform well on benchmarks have disappointingly poor performance on stress tests
  • Introduction:

    • CLIP (Contrastive Language–Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning.
    • Zero-shot Learning: Zero-shot learning is a promising learning method, in which the classes covered by training instances and the classes we aim to classify are disjoint. In other words, Zero-shot learning is about leveraging supervised learning with no additional training data.
  • Approach

    • Training task for CLIP: given an image, predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in our dataset.

    • Mitigate some major problems in the standard deep learning approach to compputer vision:

      • Costly datasets. CLIP learns from text-image pairs that are already publicly avaliable on the internet. Reducing the need for expensive large labeled datasets has been extensively studied by prior works.
      • Narrow: CLIP can be adapted to perform a wide variety of visual classification tasks without needing additional training examples. To apply CLIP to a new task, all we need to do is “tell” CLIP’s text-encoder the names of the task’s visual concepts, and it will output a linear classifier of CLIP’s visual representations. The accuracy of this classifier is often competitive with fully supervised models.
      • Poor real-world performance: There is a gap between “benchmark performance” and “real performance.” We conjecture that this gap occurs because the models “cheat” by only optimizing for performance on the benchmark, much like a student who passed an exam by studying only the questions on past years’ exams. In contrast, the CLIP model can be evaluated on benchmarks without having to train on their data, so it can’t “cheat” in this manner. This results in its benchmark performance being much more representative of its performance in the wild.
    • Limitations

      • CLIP strugggles on these tasks such as counting the number of objects in an image and on more complex tasks such as predicting how close the nearest car is in a photo.
      • CLIP also still has poor generalization to images not covered in its pre-training dataset.

clip-implement's People

Contributors

keyanub avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.