Code Monkey home page Code Monkey logo

Comments (9)

arampacha avatar arampacha commented on July 29, 2024 2

@iejMac and I are going to give this a shot. Our current understanding of what needs to be done for a start can be summarized as follows:

Create a version of hf_model.py by analogy to timm_model.py with some PreTrainedTextEncoder class with interface compatible to TextTransformer from this PR

  • Wrapper class: takes base model from HF adds proper Pooler and handles inputs/outputs format
  • Pooler classes - similar to those in sentence-transformers library and Eleuthers CARP
  • list of supported HF architectures (start with RoBERTa, and add more as we go)
  • ensure tokenizers compatibility
  • verify grad checkpointing works with HF models correctly

If you have any comments, recommendations, etc. feel free to ping us here or on LAIONs' discord (Maciek and Arto). The progress can be followed in this PR

from open_clip.

rwightman avatar rwightman commented on July 29, 2024 1

@Zasder3 I moved timm models to own file in recent PR, I can take a crack at moving the text transformer to a submodule in CLIP, and adding the checkpoint loader / adaptation for backwards weight compat if it'd help.

I'd leave it in a branch / open PR that you could build from ... may reduce some of the workload for getting HF text models in there

from open_clip.

rwightman avatar rwightman commented on July 29, 2024

@Zasder3 that could be useful, although as far as pretrained is concerned, the LiT paper suggests it's much more useful to use pretrained + (optionally) frozen vision backbones while keeping the language trainable. However a larger variety of options for the language would be useful.

I'd proceed with a wrapper of sorts like the TimmModel, so say a 'HuggingFaceModel/Transformer`, keep dependencies optional, etc. We should probably start splitting timm code, hf code into separate files from the main model if we add more...

from open_clip.

rwightman avatar rwightman commented on July 29, 2024

Some additional thoughts on this...

  • TimmModel and related imports should move from open_clip/model.py to open_clip/timm_model.py
  • create a huggingface_model.py with maybe a HuggingFaceTransformer (or HfTextTransformer?). This adapter module will..
    • create a HF model using their interface
    • add or remove any layers as needed
  • Similar to the timm_ fields in the ClipVisionCfg, we can add some hf_ / huggingface_ fields to ClipTextCfg which specify necessary config values for creating a hugging face text model
  • create some configs with naming roughl image_tower_name--text_tower_name.json

The biggest mess in all of this is going to be the tokenizer. I haven't really figured out how this will work. The OpenAI CLIP tokenizer is a bit special relative to most other text transformers. It's unique in being setup to clean very ugly input. It's also a BPE tokenizer, but not quite matching the usual. A pretrained model is very much dependent on it's tokenizer, so, using the default tokenizers for pretrained models in HF we'd have to determine appropriate input cleaning, etc that works with the defaults...

@Zasder3

from open_clip.

rwightman avatar rwightman commented on July 29, 2024

Oh yeah, and there was somthing annoying getting in the way of the text tower encapsulation...

Some of the logic and params for the text tower are directly in the main model class right nowe main model class right now encode_text, etc ... which differs from the image tower that is wholly self contained.

So, we'll need a OpenClipTextTransformer for CLIP.transformer that includes this...

def encode_text(self, text):
        x = self.token_embedding(text)  # [batch_size, n_ctx, d_model]

        x = x + self.positional_embedding
        x = x.permute(1, 0, 2)  # NLD -> LND
        x = self.transformer(x, attn_mask=self.attn_mask)
        x = x.permute(1, 0, 2)  # LND -> NLD
        x = self.ln_final(x)

        # x.shape = [batch_size, n_ctx, transformer.width]
        # take features from the eot embedding (eot_token is the highest number in each sequence)
        x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection

        return x

and related params...

The base model should just call self.text(text)

However, I don't know how to do this without breaking compatibility with existing weights :( might have to think through this a bit more or create a completely new CLIP class for having custom image and text towers...

A state_dict adapter could work, we can provide a utility function that maps old state dicts to new for pretrained or other checkpoints passed in... if it detects the old names, it can shift them all from . to text. .. ie transformer. to text.transformer., token_embedding to text.token_embedding, etc

from open_clip.

rwightman avatar rwightman commented on July 29, 2024

As further comment to #88 and detail on my previous comment

There is some groundwork mixed up in my grad cache PR https://github.com/rwightman/open_clip/blob/grad_cache/src/open_clip/model.py

for impl gradient caching, it was easier to treat the towers independently so it was a bit cleaner to move text -> .text ...

I'm not sure if grad caching is ready to merge though, so I could pull out the model refactoring as a diff PR if there is a high priority for this.

from open_clip.

dmarx avatar dmarx commented on July 29, 2024

Here's an example of how another CLIP variant implements this: https://github.com/EleutherAI/magiCARP/blob/main/carp/pytorch/model/encoders/pool_encoder.py

from open_clip.

rwightman avatar rwightman commented on July 29, 2024

@arampacha @iejMac that sounds good, whether or not I merge the grad caching branch (I likely will as I feel the rest of the changes are sound and the grad caching part can be refined), the interface for the separate text transformer will remain the same..

from open_clip.

rom1504 avatar rom1504 commented on July 29, 2024

done now

from open_clip.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.