Comments (9)
@iejMac and I are going to give this a shot. Our current understanding of what needs to be done for a start can be summarized as follows:
Create a version of hf_model.py
by analogy to timm_model.py
with some PreTrainedTextEncoder
class with interface compatible to TextTransformer
from this PR
- Wrapper class: takes base model from HF adds proper Pooler and handles inputs/outputs format
- Pooler classes - similar to those in sentence-transformers library and Eleuthers CARP
- list of supported HF architectures (start with RoBERTa, and add more as we go)
- ensure tokenizers compatibility
- verify grad checkpointing works with HF models correctly
If you have any comments, recommendations, etc. feel free to ping us here or on LAIONs' discord (Maciek and Arto). The progress can be followed in this PR
from open_clip.
@Zasder3 I moved timm models to own file in recent PR, I can take a crack at moving the text transformer to a submodule in CLIP, and adding the checkpoint loader / adaptation for backwards weight compat if it'd help.
I'd leave it in a branch / open PR that you could build from ... may reduce some of the workload for getting HF text models in there
from open_clip.
@Zasder3 that could be useful, although as far as pretrained is concerned, the LiT paper suggests it's much more useful to use pretrained + (optionally) frozen vision backbones while keeping the language trainable. However a larger variety of options for the language would be useful.
I'd proceed with a wrapper of sorts like the TimmModel
, so say a 'HuggingFaceModel/Transformer`, keep dependencies optional, etc. We should probably start splitting timm code, hf code into separate files from the main model if we add more...
from open_clip.
Some additional thoughts on this...
TimmModel
and related imports should move from open_clip/model.py to open_clip/timm_model.py- create a huggingface_model.py with maybe a
HuggingFaceTransformer
(or HfTextTransformer?). This adapter module will..- create a HF model using their interface
- add or remove any layers as needed
- Similar to the
timm_
fields in the ClipVisionCfg, we can add somehf_
/huggingface_
fields to ClipTextCfg which specify necessary config values for creating a hugging face text model - create some configs with naming roughl
image_tower_name--text_tower_name.json
The biggest mess in all of this is going to be the tokenizer. I haven't really figured out how this will work. The OpenAI CLIP tokenizer is a bit special relative to most other text transformers. It's unique in being setup to clean very ugly input. It's also a BPE tokenizer, but not quite matching the usual. A pretrained model is very much dependent on it's tokenizer, so, using the default tokenizers for pretrained models in HF we'd have to determine appropriate input cleaning, etc that works with the defaults...
from open_clip.
Oh yeah, and there was somthing annoying getting in the way of the text tower encapsulation...
Some of the logic and params for the text tower are directly in the main model class right nowe main model class right now encode_text
, etc ... which differs from the image tower that is wholly self contained.
So, we'll need a OpenClipTextTransformer for CLIP.transformer that includes this...
def encode_text(self, text):
x = self.token_embedding(text) # [batch_size, n_ctx, d_model]
x = x + self.positional_embedding
x = x.permute(1, 0, 2) # NLD -> LND
x = self.transformer(x, attn_mask=self.attn_mask)
x = x.permute(1, 0, 2) # LND -> NLD
x = self.ln_final(x)
# x.shape = [batch_size, n_ctx, transformer.width]
# take features from the eot embedding (eot_token is the highest number in each sequence)
x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection
return x
and related params...
The base model should just call self.text(text)
However, I don't know how to do this without breaking compatibility with existing weights :( might have to think through this a bit more or create a completely new CLIP class for having custom image and text towers...
A state_dict adapter could work, we can provide a utility function that maps old state dicts to new for pretrained or other checkpoints passed in... if it detects the old names, it can shift them all from .
to text.
.. ie transformer.
to text.transformer.
, token_embedding
to text.token_embedding
, etc
from open_clip.
As further comment to #88 and detail on my previous comment
There is some groundwork mixed up in my grad cache PR https://github.com/rwightman/open_clip/blob/grad_cache/src/open_clip/model.py
for impl gradient caching, it was easier to treat the towers independently so it was a bit cleaner to move text -> .text ...
I'm not sure if grad caching is ready to merge though, so I could pull out the model refactoring as a diff PR if there is a high priority for this.
from open_clip.
Here's an example of how another CLIP variant implements this: https://github.com/EleutherAI/magiCARP/blob/main/carp/pytorch/model/encoders/pool_encoder.py
from open_clip.
@arampacha @iejMac that sounds good, whether or not I merge the grad caching branch (I likely will as I feel the rest of the changes are sound and the grad caching part can be refined), the interface for the separate text transformer will remain the same..
from open_clip.
done now
from open_clip.
Related Issues (20)
- [1,512] Data conversion
- Handling Negative Pairs in Fine-Tuning of CLIP Models HOT 1
- [1,512] data HOT 1
- How to inference trained checkpoint on my own dataset?
- 1111
- Conceptual help on SigLIP + pre-trained CLIP
- Best practice with distributed training? HOT 1
- `coca_ViT-B-32`: `self._encode_text` bug HOT 2
- Coca Image-to-Text, RuntimeError: Boolean value of Tensor with more than one value is ambiguous HOT 4
- Help With Training Bottleneck HOT 1
- fastpath inference is not supported HOT 2
- How do you profile the CLIP models HOT 1
- Huge performance drop in zero-shot image classification for all SigLIP models
- RuntimeError: The shape of the 2D attn_mask is torch.Size([77, 77]), but should be (128, 128). HOT 4
- Text Similarity not equal to 1 for identical inputs HOT 1
- Hello, where can I find the 'vision.cfg' and 'text.cfg' files? HOT 1
- [bug] loading saved weights of an open_clip model does give back the same results HOT 2
- RuntimeError: expected scalar type Float but found BFloat16 HOT 2
- Duplicate Class Names in IMAGENET_CLASSNAMES HOT 1
- Error loading state_dict using pre-trained checkpoint with CustomTextCLIP model HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from open_clip.