The declip from sense-gvt

Performance of Declip-88M checkpoint

Hi, I want to reproduce the zero-shot result of DeClip-88M under ResNet50 in ImageNet-1K (whose performance is 62.5 in the table). But the evaluation result I got is 7.264 which is too low. But the result of ViT-B32 is correct. And I found a problem during loading the ResNet50 checkpoint:

size mismatch for module.logit_scale: copying a param with shape torch.Size([]) from checkpoint, the shape in current model is torch.Size([1]).

I didn't change any code of the model.

Another question is that why run.sh of declip-88m-resnet50 uses clip_solver while other run.sh files use declip_solver? I use declip_solver to do the evaluation for DeClip-88M-ResNet50 by replacing the yaml file. The following figure is the results reproduced on my own compute resources:

Do you have any ideas? Thanks!

Detailed Preprocess

What is the preprocess like ? Such as the resize shape, mean and std

Could you provide model weights of the pretrained text encoder?

环境问题

Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/Spring_Prototype-3.0.0b0-py3.8.egg/prototype/solver/declip_solver.py", line 22, in
from prototype.model import model_entry
File "/opt/conda/lib/python3.8/site-packages/Spring_Prototype-3.0.0b0-py3.8.egg/prototype/model/init.py", line 5, in
from .declip import declip_res50, declip_vitb32
File "/opt/conda/lib/python3.8/site-packages/Spring_Prototype-3.0.0b0-py3.8.egg/prototype/model/declip.py", line 11, in
from .image_encoder.visual_transformer import visual_transformer_B32, visual_transformer_B16
ModuleNotFoundError: No module named 'prototype.model.image_encoder'

请问这个问题怎么解决？

Training code

It would be awesome if you'd release your code for preprocessing and training :)

11 Why freeze the parameters of conv1 in ViT?

module 'nvidia.dali.ops' has no attribute 'McReader'

DeCLIP/prototype/data/pipelines/imagenet_pipeline_v2.py

Line 42 in e47a5ff

self.mc_input = ops.McReader(file_root=data_root,

I use the nvidia-dali-cuda110 with version 1.14.0, and get the error: module 'nvidia.dali.ops' has no attribute 'McReader'

In the requirements, the need nvidia-dali is 0.14, but there is no nvidia-dali=0.14.

About the BPE file

Hi~ @zlccccc @SlotherCui
I notice that there isn't BPE file here. In your token embedding weight, the shape is [49409, 512], but the shape in CLIP is [49408, 512]. Are yours BPE file consistent with CLIP?
If I missed something, please comment~ Thanks a lot!

get_started.md is None

The get_started.md is None. When will you upload this file?

Where can I find the file named 'val_official.json'?

Fused AdamW_SGD optimizer issues

Hi, authors! Thanks for your awesome work!
I'm confused about the usage of fused AdamW_SGD optimizer as described in paper Appendix C, paragraph implementation details.
It's said you use AdamW with 1e-3 lr and 0.05 wd for ViT vision encoder, and SGD with 0.02 lr and 1e-4 wd for text transformer.
However, in your configuration, ViT-B/32 is also optimized by SGD instead of fused AdamW_SGD. So which optimizer is your choice in experiment actually?
And, if you use fused AdamW_SGD optimizer just as said in paper, why did you use it? CLIP only uses AdamW optimizer. Is this beneficial to CLIP?
Looking forward for your reply!😁

Mismatched YFCC15M URL set compared to OpenAI's subset

Hey,

Thanks for the great work.

I noted that the 15M subset of YFCC you use is significantly different from the subset that OpenAI uses and the Quality not Quantity paper uses. To compare the proportion of matching samples, I just did a quick test and saw that the overall stats for the three datasets are:

declip json: 15,388,848 samples
quality-not-quantity csv: 14,825,236 samples
open-ai csv: 14,829,396 samples

The difference between the quality-not-quantity and openai csvs can simply be attributed to link rot.

Further, when I take an intersection between your photo-ids and the photo-ids used by OpenAI / Quality not Quantity:

declip intersection with openai: 6,642,077 matches
declip intersection with quality-not-quantity: 6,640,264 matches

It is interesting that there are still so many matches (~40%). I just wanted to add this information here since I found it quite hard to figure out the exact differences and intersections between the different YFCC subsets. So in-case people are trying to use YFCC15M subsets based on the OpenAI-CLIP subsets, this is useful to keep in mind that the DeCLIP subset is substantially different. This is also mentioned in the DeCLIP paper, appendix F and table 8.

AttributeError: module 'linklink' has no attribute 'new_group'

The code here
https://github.com/Sense-GVT/DeCLIP/blob/main/prototype/model/image_encoder/modified_resnet.py#L103
calls a non-defined method (new_group)

What is AllGather for. Why use ALLGather.

DeCLIP/prototype/model/clip.py

Lines 136 to 146 in 9d9e25d

    
           if self.training and self.use_allgather or all_gather: 
        
               gathered_image_features = self.all_gather(image_features) 
        
               gathered_text_features = self.all_gather(text_features) 
        
               logits_per_image = logit_scale * image_features @ gathered_text_features.t() 
        
               logits_per_text = logit_scale * text_features @ gathered_image_features.t() 
        
           else: 
        
               logits_per_image = logit_scale * image_features @ text_features.t() 
        
               logits_per_text = logit_scale * text_features @ image_features.t() 
        
           return logits_per_image, logits_per_text

finetune yfcc15m 的与训练权重不收敛

作者您好，我采用自己的数据集 finetune yfcc15m 的预训练权重，loss一直不收敛，可能的原因是什么呢？

weights of models

It seems that the url link to download the models weights doesn't work

Filip & DeFILIP? :)

Hi, I just realized that you have Filip and DeFilip implementations, very interesting.

Do you already have results How they compare with Clip and DeCLIP?
With respect benchmark scores and compute efficiency? :)

May I also ask about the details of the hardware you used for training?

… And - do you have code for multinode training?

Can you provide the installation instructions or dockerfile?

How long did it take to train at YFCC15M-V2?

Hello, I found that you used 32 A100s. Would you like to ask how much memory is the A100? 80 G?
If we use 8 cards 32G V100, how long will it take to complete YFCC-15M training?

train

In fact, after reading the paper, I still don't understand, I don't know whether this code is trained from scratch or frozen some parameters fine-tuning of CLIP. Sorry, I just started to learn deep learning, every kind person can tell me the answer

worked (simple) example of loading model and transforms?

Thank you for this exciting repository. Can you provide a simple example of how I might be able to load the models you provide in your model zoo?

Something along the lines of what is provided by the timm (pytorch-image-models) model repository:

import timm
model_name = 'ghostnet_100'
model = timm.create_model(model_name, pretrained=True)
model.eval()

from timm.data.transforms_factory import create_transform
from timm.data import resolve_data_config
    
config = resolve_data_config({}, model = model_name)
transform = create_transform(**config)

Ideally, this would allow us to use the models in a jupyter notebook or other interactive context.

Thanks in advance!

Filter YFCC data

Hi, thanks for the great work. After downloading the provided YFCC15M label file, I can see there are three keys caption filename url in each one of the labels. how should we find the corresponding YFCC image according to your label? i.e., which key should we use to align with YFCC data?

KeyError: 'SLURM_PROCID'

I use the followed command to run zero-shot evaluation:
python -m prototype.solver.clip_solver --config ./experiments/declip_experiments/declip88m/declip88m_r50_declip/config.yaml --evaluate
And then it report this error:
import FusedFP16SGD failed, FusedFP16AdamW replace slurm Traceback (most recent call last): File "/opt/conda/envs/openmmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/opt/conda/envs/openmmlab/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/apdcephfs/share_1227775/mingzhenzhu/DeCLIP/prototype/solver/clip_solver.py", line 769, in <module> main() File "/apdcephfs/share_1227775/mingzhenzhu/DeCLIP/prototype/utils/dist.py", line 11, in wrapper dist_init() File "/apdcephfs/share_1227775/mingzhenzhu/DeCLIP/prototype/utils/dist.py", line 21, in dist_init proc_id = int(os.environ['SLURM_PROCID']) File "/opt/conda/envs/openmmlab/lib/python3.7/os.py", line 681, in __getitem__ raise KeyError(key) from None KeyError: 'SLURM_PROCID'
How to fix it? Thanks!

Hosting the weights on huggingface

Are there any plans on hosting the weights on huggingface? I am willing to help if need be.

Different results in 《Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision》 and 《SUPERVISION EXISTS EVERYWHERE: A DATA EFFICIENT CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING PARADIGM》

In the paper《SUPERVISION EXISTS EVERYWHERE: A DATA EFFICIENT CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING PARADIGM》, training on the YFCC_V2 dataset, CLIP and DECLIP can get 31.3 and 41.9 zero-shot performance of Imagenet, but it is reported 37.3 and 44.4 in the paper 《Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision》. So what's the difference between them?

No module named 'prototype.solver.cls_solver'

hello, thank you for your work. And I wonder how to deal with this problem?

how to install dataflow==1.2.1

how to install dataflow==1.2.1,
when I intall dataflow==1.2.1 with pip, the error is that:

ERROR: Could not find a version that satisfies the requirement dataflow==1.2.1 (from versions: 0.1.1)
ERROR: No matching distribution found for dataflow==1.2.1

lacking module

Hi, when I run run.sh, I got errors:

ModuleNotFoundError: No module named 'springvision'
KeyError: 'SLURM_PROCID'
KeyError: 'SLURM_NTASKS'
KeyError: 'SLURM_NODELIST'
Could you tell me how to fix them?

	if self.training and self.use_allgather or all_gather:
	gathered_image_features = self.all_gather(image_features)
	gathered_text_features = self.all_gather(text_features)

	logits_per_image = logit_scale * image_features @ gathered_text_features.t()
	logits_per_text = logit_scale * text_features @ gathered_image_features.t()
	else:
	logits_per_image = logit_scale * image_features @ text_features.t()
	logits_per_text = logit_scale * text_features @ image_features.t()

	return logits_per_image, logits_per_text

sense-gvt / declip Goto Github PK

declip's People

Contributors

Stargazers

Watchers

Forkers

declip's Issues

Recommend Projects

Recommend Topics

Recommend Org