I've been chatting with some others interested in training CLIP for different domain t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

As further comment to <a class="issue-link js-issue-link" data-error-text="Failed to l

Here's an example of how another CLIP variant implements this: <a href="https://github

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Generalizable Text Transformer Usage about open_clip HOT 9 CLOSED

mlfoundations commented on July 29, 2024 1

Generalizable Text Transformer Usage

from open_clip.

Comments (9)

arampacha commented on July 29, 2024 2

@iejMac and I are going to give this a shot. Our current understanding of what needs to be done for a start can be summarized as follows:

Create a version of hf_model.py by analogy to timm_model.py with some PreTrainedTextEncoder class with interface compatible to TextTransformer from this PR

Wrapper class: takes base model from HF adds proper Pooler and handles inputs/outputs format
Pooler classes - similar to those in sentence-transformers library and Eleuthers CARP
list of supported HF architectures (start with RoBERTa, and add more as we go)
ensure tokenizers compatibility
verify grad checkpointing works with HF models correctly

If you have any comments, recommendations, etc. feel free to ping us here or on LAIONs' discord (Maciek and Arto). The progress can be followed in this PR

from open_clip.

rwightman commented on July 29, 2024 1

@Zasder3 I moved timm models to own file in recent PR, I can take a crack at moving the text transformer to a submodule in CLIP, and adding the checkpoint loader / adaptation for backwards weight compat if it'd help.

I'd leave it in a branch / open PR that you could build from ... may reduce some of the workload for getting HF text models in there

from open_clip.

rwightman commented on July 29, 2024

@Zasder3 that could be useful, although as far as pretrained is concerned, the LiT paper suggests it's much more useful to use pretrained + (optionally) frozen vision backbones while keeping the language trainable. However a larger variety of options for the language would be useful.

I'd proceed with a wrapper of sorts like the TimmModel, so say a 'HuggingFaceModel/Transformer`, keep dependencies optional, etc. We should probably start splitting timm code, hf code into separate files from the main model if we add more...

from open_clip.

rwightman commented on July 29, 2024

Some additional thoughts on this...

TimmModel and related imports should move from open_clip/model.py to open_clip/timm_model.py
create a huggingface_model.py with maybe a HuggingFaceTransformer (or HfTextTransformer?). This adapter module will..
- create a HF model using their interface
- add or remove any layers as needed
Similar to the timm_ fields in the ClipVisionCfg, we can add some hf_ / huggingface_ fields to ClipTextCfg which specify necessary config values for creating a hugging face text model
create some configs with naming roughl image_tower_name--text_tower_name.json

The biggest mess in all of this is going to be the tokenizer. I haven't really figured out how this will work. The OpenAI CLIP tokenizer is a bit special relative to most other text transformers. It's unique in being setup to clean very ugly input. It's also a BPE tokenizer, but not quite matching the usual. A pretrained model is very much dependent on it's tokenizer, so, using the default tokenizers for pretrained models in HF we'd have to determine appropriate input cleaning, etc that works with the defaults...

@Zasder3

from open_clip.

rwightman commented on July 29, 2024

Oh yeah, and there was somthing annoying getting in the way of the text tower encapsulation...

Some of the logic and params for the text tower are directly in the main model class right nowe main model class right now encode_text, etc ... which differs from the image tower that is wholly self contained.

So, we'll need a OpenClipTextTransformer for CLIP.transformer that includes this...

def encode_text(self, text):
        x = self.token_embedding(text)  # [batch_size, n_ctx, d_model]

        x = x + self.positional_embedding
        x = x.permute(1, 0, 2)  # NLD -> LND
        x = self.transformer(x, attn_mask=self.attn_mask)
        x = x.permute(1, 0, 2)  # LND -> NLD
        x = self.ln_final(x)

        # x.shape = [batch_size, n_ctx, transformer.width]
        # take features from the eot embedding (eot_token is the highest number in each sequence)
        x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection

        return x

and related params...

The base model should just call self.text(text)

However, I don't know how to do this without breaking compatibility with existing weights :( might have to think through this a bit more or create a completely new CLIP class for having custom image and text towers...

A state_dict adapter could work, we can provide a utility function that maps old state dicts to new for pretrained or other checkpoints passed in... if it detects the old names, it can shift them all from . to text. .. ie transformer. to text.transformer., token_embedding to text.token_embedding, etc

from open_clip.

rwightman commented on July 29, 2024

As further comment to #88 and detail on my previous comment

There is some groundwork mixed up in my grad cache PR https://github.com/rwightman/open_clip/blob/grad_cache/src/open_clip/model.py

for impl gradient caching, it was easier to treat the towers independently so it was a bit cleaner to move text -> .text ...

I'm not sure if grad caching is ready to merge though, so I could pull out the model refactoring as a diff PR if there is a high priority for this.

from open_clip.

dmarx commented on July 29, 2024

Here's an example of how another CLIP variant implements this: https://github.com/EleutherAI/magiCARP/blob/main/carp/pytorch/model/encoders/pool_encoder.py

from open_clip.

rwightman commented on July 29, 2024

@arampacha @iejMac that sounds good, whether or not I merge the grad caching branch (I likely will as I feel the rest of the changes are sound and the grad caching part can be refined), the interface for the separate text transformer will remain the same..

from open_clip.

rom1504 commented on July 29, 2024

done now

from open_clip.

Generalizable Text Transformer Usage about open_clip HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent