facebookresearch / slip Goto Github PK

View Code? Open in Web Editor NEW

735.0 18.0 67.0 1.77 MB

Code release for SLIP Self-supervision meets Language-Image Pre-training

License: MIT License

Python 100.00%

slip's Introduction

SLIP: Self-supervision meets Language-Image Pre-training

What you can find in this repo:

Pre-trained models (with ViT-Small, Base, Large) and code to reproduce results from our paper: SLIP: Self-supervision meets Language-Image Pre-training. Norman Mu, Alexander Kirillov, David Wagner and Saining Xie, arXiv 2021
An improved CLIP baseline (31.3% → 34.6% ImageNet 0-shot w/ Modified ResNet-50) on YFCC15M dataset.
Zero-shot transfer and linear classification evaluation scripts on 26 downstream datasets.

Updates:

Jan 18 2022: Added support for training on RedCaps

Jan 17 2022: Released CC3M/CC12M CLIP/SLIP ViT-B checkpoints

Results and Pre-trained Models

The following models are pre-trained on YFCC15M and evaluated on ImageNet-1K (ILSVRC2012).

ViT-Small (MoCo v3 version w/ 12 vs. 6 heads)

Method	Epochs	0-shot	Linear	Finetuned	Weights
CLIP	25	32.7	59.3	78.2	url
SimCLR	25	-	58.1	79.9	url
SLIP	25	38.3	66.4	80.3	url
SLIP	50	39.3	67.6	80.7	url
SLIP	100	39.5	68.3	80.7	url

ViT-Base

Method	Epochs	0-shot	Linear	Finetuned	Weights
CLIP	25	37.6	66.5	80.5	url
SimCLR	25	-	64.0	82.5	url
SLIP	25	42.8	72.1	82.6	url
SLIP	50	44.1	73.0	82.9	url
SLIP	100	45.0	73.6	83.4	url

ViT-Large

Method	Epochs	0-shot	Linear	Finetuned	Weights
CLIP	25	40.4	70.5	81.0	url
SimCLR	25	-	66.7	84.0	url
SLIP	25	46.2	76.0	84.2	url
SLIP	50	47.4	75.8	84.7	url
SLIP	100	47.9	75.1	84.8	url

Additional Datasets and Models

Dataset	Method	Model	Epochs	0-shot	Linear	Finetuned	Weights
CC3M	CLIP	ViT-B	40	17.1	53.3	79.5	url
CC3M	SLIP	ViT-B	40	23.0	65.4	81.4	url
CC12M	CLIP	ViT-B	35	36.5	69.0	82.1	url
CC12M	SLIP	ViT-B	35	40.7	73.7	83.1	url

1. Setup

Install PyTorch and timm. The code has been tested with CUDA 11.3/CuDNN 8.2.0, PyTorch 1.10.0 and timm 0.5.0.

1.1. YFCC15M Setup

Download the YFCC100M dataset. Our dataloader expects the following dataset directory structure with 100 folders containing 1000 zip archives of 1000 images each. The concatenation of the folder, archive, and file names is the index of the image (i.e. image 12345678 is stored as 678.jpg within 12/345.zip):

/path/to/yfcc100m/
├── images/
│   ├── 00/
│   │   └── 000.zip
│   │   │   ├── 000.jpg
│   │   │   │   ...
│   │   │   └── 999.jpg
│   │   ...
│   │   └── 999.zip
│   ...
│   └── 99/
...

Prepare the YFCC15M subset metadata pickle:

Download and compile a list of downloaded images to flickr_unique_ids.npy (ours)
Download OpenAI's list of captioned YFCC100M images according to instructions here
Run python make_dataset.py to create the yfcc15m.pkl metadata pickle

When pre-training with YFCC15M, set --dataset yfcc15m --root /path/to/yfcc100m --metadata /path/to/yfcc15m.pkl.

1.2. COCO Captions Setup

Download and unzip the 2017 Train images and annotations. When pre-training on COCO, set --dataset coco --root /path/to/coco --metadata /path/to/captions_train2017.json.

1.3. Conceptual Captions Setup

CC3M and CC12M are published as tsv files listing original image urls and processed captions. Download images and collect the captions of all available images (many will be missing due to broken links) into cc3m.npy and cc12m.npy.

For CC3M our dataloader expects cc3m.npy to contain a NumPy array of dicts in the following format:

{
  'image_id': 1510438788,  # local file path relative to root
  'captions': ['large field with pink tulips on a clear sunny summer day with a blue sky']
}

For CC12M our dataloader expects cc12m.npy to contain a NumPy array of dicts in the following format:

{
  'image_name': '0.jpg',  # local file path relative to root
  'image_id': 0,
  'captions': ['Metal Design Within Reach Ivory Slipper Chairs - a Pair For Sale - Image 7 of 10']
}

When pre-training on CC3M set --dataset cc3m --root /path/to/cc3m --metadata /path/to/cc3m.npy, and whe pre-training on CC12M set --dataset cc12m --root /path/to/cc12m --metadata /path/to/cc12m.npy.

1.4. RedCaps Setup

RedCaps is published as a list of JSON annotation files containing image urls and raw/processed captions. Images can be downloaded from these annotations with a helpful downloader tool. Then merge all per-subreddit annotations into a single file with the combine_captions.py script:

python redcaps/combine_captions.py --input /path/to/redcaps/annotations --output /path/to/redcaps_v1.json

To pre-train on RedCaps set --dataset redcaps --root /path/to/redcaps --metadata /path/to/redcaps_v1.json.

1.4. Downstream Dataset Setup

Zero-shot (in main.py and eval_zeroshot.py) and linear (in main_linear.py) evaluations read dataset paths from dataset_catalog.json. Zero-shot evaluations read CLIP's class labels and caption templates from labels.json and templates.json. If just pre-training models on YFCC15M, only the ImageNet path is required for model validation between training epochs. See Section 3 below on zero-shot transfer evaluation for dataset preparation details.

2. Pre-training

We use the following pre-training recipes for SLIP, CLIP, and SimCLR. See main.py for the full list of default arguments. We use the same lr and wd settings for all model sizes within the same training framework, and different model sizes can be selected by passing in different strings to the --model argument such as SLIP_VITS16 or SLIP_VITL16.

In our workflow we use submitit, which interfaces nicely with Slurm. For local training with the torchrun utility (supersedes torch.distributed.launch), replace python run_with_submitit.py with torchrun --nproc_per_node=8 main.py. Local multi-node training with torchrun should also be possible.

We train most of our models on 8x 8-gpu nodes, but training with fewer gpus is possible by reducing the batch size and setting the --update-freq argument above 1 to enable gradient accumulation. Note that gradient accumulation will increase the variance of minibatch statistics and alter the training dynamics of batchnorm, which is used in SLIP and SimCLR.

SLIP ViT-Base with 8-nodes (batch size 4096)

python run_with_submitit.py \
  --root /path/to/yfcc100m \
  --model SLIP_VITB16 \
  --lr 3e-3 --wd 0.1

CLIP ViT-Base with 8-nodes (batch size 4096)

python run_with_submitit.py \
  --root /path/to/yfcc100m \
  --model CLIP_VITB16 \
  --lr 5e-4 --wd 0.5

SimCLR ViT-Base with 8-nodes (batch size 4096)

python run_with_submitit.py \
  --root /path/to/yfcc100m \
  --model SIMCLR_VITB16 \
  --ssl-mlp-dim 4096 --ssl-emb-dim 256 --ssl-temp 0.1 \
  --lr 3.2e-3 --wd 0.1

Some important arguments:

--dataset: pre-training dataset name. choices include yfcc15m, cc12m, cc3m, coco.

--root: path to dataset root

--metadata: path to metadata file (see section 1 for details)

--ssl-mlp-dim: hidden dim of SimCLR mlp projection head

--ssl-emb-dim: output embed dim of SimCLR mlp projection head

--ssl-scale: loss scale for SimCLR objective

--ssl-temp: softmax temperature for SimCLR objective

--batch-size: number of samples per-device/per-gpu

--lr-start: initial warmup lr

--lr-end: minimum final lr

--update-freq: optimizer update frequency, i.e. gradient accumulation steps

--disable-amp: disable mixed-precision training (requires more memory and compute)

3. Evaluation: Zero-shot Transfer

First, prepare additional downstream classification datasets:

MNIST, CIFAR-10/100, STL-10: Automatic download via torchvision datasets
HatefulMemes: Manual download from official website and sort images according to train.jsonl/dev.jsonl into train/dev folder
Rendered SST2, Country211: Manual download from CLIP repo
Other datasets: Use scripts from VISSL

Then set all dataset paths in dataset_catalog.json.

Evaluate zero-shot transfer to various classification benchmarks with eval_zeroshot.py, which reads labels and templates from labels.json/templates.json and dataset paths from dataset_catalog.json. Inference is performed with a single gpu. By default, the script iterates through all datasets in dataset_catalog.json and evaluates zero-shot in order. Evaluation can be limited to a subset of datasets by replacing for d in datasets: with for d in ['imagenet']: on line 78.

python eval_zeroshot.py --resume /path/to/checkpoint.pt

4. Evaluation: Linear Classification

We use a modified version of the MoCo v3 ImageNet linear classification script, main_linear.py. We use the same single node 8-gpu recipe for all model sizes. See main_linear.py for the full list of default arguments. As with pre-training, our workflow uses submitit. For local training with torchrun, replace python run_with_submitit_linear.py with torchrun --nproc_per_node=8 main_linear.py. This script reads the ImageNet dataset path from the dataset catalog (dataset_catalog.json), which must be set properly before training.

python run_with_submitit_linear.py  \
  --arch vit_base_patch16_224 --dataset imagenet \
  --pretrained /path/to/checkpoint.pt

To evaluate linear classification on other datasets, set --dataset to the corresponding dataset name listed in dataset_catalog.json.

5. Evaluation: End-to-End Finetuning

We use a modified version of the ImageNet finetuning script from BeiT. Our code has been tested with commit f8f3df8. We have removed the explicit torch, torchvision, and timm dependencies from beit_finetuning/requirements.txt, as they conflict with the versions used in our SLIP code (CUDA 11.3/CuDNN 8.2.0, PyTorch 1.10.0 and timm 0.5.0). The fintuning code has been modified and tested to work with these versions.

5.1. Setup

To evaluate end-to-end finetuning on ImageNet, first clone the BeiT repo and checkout the correct commit:

git clone [email protected]:microsoft/unilm.git
cd unilm/beit
git checkout f8f3df8

Now copy over modified files from our beit_finetuning directory:

cp beit_finetuning/* unilm/beit
cd unilm/beit

Install pip dependencies and Nvidia Apex:

pip install -r requirements.txt
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

5.2. Commands

As with pre-training, our workflow uses submitit. For local training with torchrun, replace python run_with_submitit_finetune.py with torchrun --nproc_per_node=8 run_class_finetuning.py. We established finetuning recipes based on the BeiT recipes with some light additional hyperparameter tuning. We increase regularization with model size: ViT-S uses drop_path=0 and layer_decay=0.65, ViT-B uses drop_path=0.1 and layer_decay=0.65, and ViT-L uses drop_path=0.1 and layer_decay=0.75. Note the use of the --finetune argument instead of --resume.

ViT-Small (MoCo v3 version w/ 12 vs. 6 heads)

python run_with_submitit_finetune.py \
    --batch_size 128 --enable_deepspeed \
    --epochs 100 --warmup_epochs 20 \
    --model beit_small_patch16_224 --nb_classes 1000 \
    --imagenet_default_mean_and_std \
    --model_key state_dict --model_prefix module.visual. \
    --disable_rel_pos_bias --abs_pos_emb --use_cls \
    --mixup 0.8 --cutmix 1 \
    --layer_scale_init_value 0 \
    --lr 4e-3 --drop_path 0 --layer_decay 0.65 \
    --output_dir /path/to/output_dir --finetune /path/to/checkpoint.pt

ViT-Base

python run_with_submitit_finetune.py \
    --batch_size 128 --enable_deepspeed \
    --epochs 100 --warmup_epochs 20 \
    --model beit_base_patch16_224 --nb_classes 1000 \
    --imagenet_default_mean_and_std \
    --model_key state_dict --model_prefix module.visual. \
    --disable_rel_pos_bias --abs_pos_emb --use_cls \
    --mixup 0.8 --cutmix 1 \
    --layer_scale_init_value 0 \
    --lr 4e-3 --drop_path 0.1 --layer_decay 0.65 \
    --output_dir /path/to/output_dir --finetune /path/to/checkpoint.pt

ViT-Large

python run_with_submitit_finetune.py \
    --batch_size 128 --enable_deepspeed \
    --epochs 50 --warmup_epochs 5 \
    --model beit_large_patch16_224 --nb_classes 1000 \
    --imagenet_default_mean_and_std \
    --model_key state_dict --model_prefix module.visual. \
    --disable_rel_pos_bias --abs_pos_emb --use_cls \
    --mixup 0.8 --cutmix 1 \
    --layer_scale_init_value 0 \
    --lr 4e-3 --drop_path 0.1 --layer_decay 0.75 \
    --output_dir /path/to/output_dir --finetune /path/to/checkpoint.pt

License

This project is under the MIT license. See LICENSE for details.

Citation

@Article{mu2021slip,
  author  = {Norman Mu and Alexander Kirillov and David Wagner and Saining Xie},
  title   = {SLIP: Self-supervision meets Language-Image Pre-training},
  journal = {arXiv preprint arXiv:2112.12750},
  year    = {2021},
}

slip's People

Contributors

Stargazers

Watchers

Forkers

zlapp huang-xx afiaka87 robertwyq namnaku87 sailfish009 elliotthwang cxxgtxy shaofeichen jlqzzz mikeyhodl regiononterrey ssahgal lingxufeng hercules261188 peternara metavai moein-shariatnia wolferobert3 techthiyanes bmyan rishistyping bryant1410 dotpyu arrowluo magnumenforcer vaishaal paulasquin linhuixiao maxwell-aladago mengqidyangge hao-pt bram-w bmsookim 16seunghun hjeun eleazhong long8v xuehuiping mehdidc pukelevicius harvard-visionlab shichao2023 githublht14470309 hagiss davgit soonhwan-kwon shahabmokari chenyutongthu haooooooqi donkeyshot21 yonkshi gg-big-org arthurxl shra1-25 leslietrue cryptowealth-technology candemircan srushti98 dearborn-open-ai mucaoshen vayzenb whuhxb jeremymorlier surecheun

slip's Issues

How use SLIP to do text-to-image retrieval?

Could we use the feature that produced by model.encode_text and model.encode_image function, just as CLIP.

text_projection_shape: torch.Size([512, 512]) - image_projection_shape: torch.Size([1024, 512])

how can the cosine similarity between these two projections be calculated when the image is [1, 1024] and the text is [1, 512] -- these aren't compatible. the original clip has image and text both at 512. I'm trying to test clip-large model (ViT-L-16) to start but not having any luck to get it running

Possible bug in Tokenizer about max sequence length

I found here that the token sequence is truncated to context_length = 77. The issue is that the truncation is done after wrapping original tokens with [SOT] and [EOT], and I think it's possible that [EOT] token is cut off if the original token sequence is too long. Meanwhile the text transformer uses the embedding of the [EOT] as feature representation, I guess something would be wrong for long text input.

Am I understanding right here ?

Error while installing SLIP

I am using Fedora 37 Linux and getting the following error:

Collecting slip
  Using cached SLIP-20191113.tar.gz (17.5 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [12 lines of output]
      /tmp/pip-install-_az8uh86/slip_dee631c0e5734df79ab8107394ad9e3a/SLIP/SLIP.py:807: SyntaxWarning: "is not" with a literal. Did you mean "!="?
        if not figpath is '':
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-_az8uh86/slip_dee631c0e5734df79ab8107394ad9e3a/setup.py", line 6, in <module>
          import SLIP
        File "/tmp/pip-install-_az8uh86/slip_dee631c0e5734df79ab8107394ad9e3a/SLIP/__init__.py", line 7, in <module>
          from .SLIP import Image, imread
        File "/tmp/pip-install-_az8uh86/slip_dee631c0e5734df79ab8107394ad9e3a/SLIP/SLIP.py", line 75, in <module>
          from NeuroTools.parameters import ParameterSet
      ModuleNotFoundError: No module named 'NeuroTools'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

SyncBatchNorm causing NaN predictions during pertaining

Hello everyone,

I am running the pretraining of SLIP model on ISIC pathology dataset and noticed some nan predictions of some images. The images have been appropriately normalised so there is no issue with the dataset.
If I remove the SyncBatchNorm layers, there are no issues of nan predictions but the model gradients are very small causing no leaning during the pretraining.
It is weird since the model trains well for some iterations and eventually fails after the epoch is 40% complete.

Any help/suggestions would be appreciated!

Checkpoint for RedCaps

Congratulations on this great work!

I am wondering if you have plans to release the CLIP/SLIP models for RedCaps dataset in the near future .

Thanks.

When Pretraining on CoCo, the acc is low

When pre training on COCO, after 9 epochs training, the zero-shot acc of ViTB is still 0.100.
I follow the instruction, --lr 3e-3 --wd 0.1.
Since I have 2 GPUs, I set batchsize=8, and --update-fre= 256 to keep total batchsize 4096.

TypeError: 'NoneType' object is not callable


class FairSlipLoaderBase(BaseMmcLoader):
    """
    SLIP models via https://github.com/facebookresearch/SLIP
    """
    def __init__(
        self,
        id,
        architecture,
    ):
        self.architecture = architecture
        self.publisher = 'facebookresearch'
        self.id = id
        self.modalities = (TEXT, IMAGE)
    def _napm_install(self):
        logger.debug('using napm to "install" facebookresearch/SLIP')
        url = "https://github.com/facebookresearch/SLIP"
        napm.pseudoinstall_git_repo(url, env_name='mmc', add_install_dir_to_path=True)
        napm.populate_pythonpaths('mmc')
        from SLIP.models import (
            SLIP_VITS16,
            SLIP_VITB16, 
            SLIP_VITL16
            )

    def load(self, device=DEVICE):
        """
        Returns the MMC associated with this loader.
        """
        self._napm_install()

        model_factory = model_factory_from_id(self.id)
        logger.debug(f"model_factory: {model_factory}")
        ckpt_url = url_from_id(self.id)
        ckpt = fetch_weights(
            url=ckpt_url, 
            namespace='fair_slip', 
            device=device,
            )
        d_args = vars(ckpt['args'])
        kwargs = {k:d_args[k] for k in ('ssl_emb_dim', 'ssl_mlp_dim') if k in d_args}
        logger.debug(kwargs)
        fix_param_names(ckpt)
        model = model_factory(**kwargs)
        model.load_state_dict(ckpt['state_dict'], strict=True)
        model = model.eval().to(device)

        from SLIP.tokenizer import SimpleTokenizer
        tokenizer = SimpleTokenizer()

        def preprocess_image_extended(*args, **kwargs):
            x = val_transform(*args, **kwargs)
            if x.ndim == 3:
                logger.debug("adding batch dimension")
                x = x.unsqueeze(0)
            return x.to(device)
        #logger.debug(model)
        mmc = MultiModalComparator(name=str(self), device=device)
        mmc.register_modality(modality=TEXT, projector=model.encode_text, preprocessor=tokenizer)
        mmc.register_modality(modality=IMAGE, projector=model.encode_image, preprocessor= preprocess_image_extended)
        mmc._model = model
        return mmc

Training recipe for ImageNet 1k

Hi, thanks for the great work!

I am trying to reproduce the results that reported at Table 1 in this paper.
I was wondering if you could provide the training recipe for pretraining SimCLR and MoCo v3 on ImageNet 1k?
Thank you!

What's the difference between all_gather_batch and all_gather_batch_with_grad ?

Thanks for the great work!

I notice CLIP and SLIP use all_gather_batch and all_gather_batch_with_grad, respectively.

What's the difference between the two?

Thanks!

How to use SLIP with text to image?

Hi,

I was testing CLIP and completed text to image search there.

I was wondering how I can do it with SLIP?

(Just confused on where to start)

If anyone could help, it would be highly appreciated

ResNet-50 codes and pretrained weights

In your paper, you said that Our improved training procedure achieves 34.6% zero-shot transfer to ImageNet with a modified1 ResNet-50, exceeding the original result of 31.3%.

Might you provide the code of RN-50 with pre-trained weights? Thank you!

Provided Weights Finetuned

Are the weights provided finetuned on ImageNet or are they the pre-trained weights?

Can the author provide the YFCC-100M data downloader？

Can the author or someone provide the YFCC-100M data downloader？

It mentioned that the YFCC-100M data format must follow as：
'''
Download the YFCC100M dataset. Our dataloader expects the following dataset directory structure with 100 folders containing 1000 zip archives of 1000 images each. The concatenation of the folder, archive, and file names is the index of the image (i.e. image 12345678 is stored as 678.jpg within 12/345.zip):
'''

It seems not the original data collect format.

thank you.

Arxiv Link is Broken

The link in the readme points at the arxiv.org homepage.

Loading model shows a name that is not defined: CLIP_SIMCLR_VITB16

I'm trying to load a model and it shows the model name above. The name doesn't exist in models.py, and I believe it should be SLIP_VITB16.

Slow data loading

Hi
I am trying to use torchrun --nproc_per_node=8 to train SimCLR on ImageNet using 8 GPUs in parallel. I am using this command to distribute the model

model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], bucket_cap_mb=200)

The problem is that every couple of batches the data loading is very slow. This is not only for the first batch, which I guess is normal.

Here is the (data_time_, batch_time_) when j=8

142.89722776412964 172.6102113723755
0.0011818408966064453 1.149657964706421
0.0005125999450683594 0.5908513069152832
0.000789642333984375 0.5789885520935059
0.0006721019744873047 0.5847604274749756
0.0006527900695800781 0.5850253105163574
0.0006945133209228516 0.5916690826416016
0.0006604194641113281 0.5858132839202881
33.176689863204956 113.46687459945679
0.0006616115570068359 3.237961530685425
0.0004947185516357422 0.5849909782409668
0.0006363391876220703 0.5738029479980469
0.0005011558532714844 0.5823519229888916
0.0004696846008300781 0.5900559425354004
0.0006518363952636719 0.5800015926361084
0.0006239414215087891 0.5876047611236572
0.0006771087646484375 70.80163884162903
0.0009722709655761719 0.5865006446838379
0.0006647109985351562 0.5858757495880127
0.0006537437438964844 1.4045929908752441
0.0006687641143798828 32.53645730018616
0.0008571147918701172 0.5862514972686768
0.0006554126739501953 0.5853633880615234
0.0006177425384521484 16.638994216918945
0.0009629726409912109 66.59664487838745
0.001010894775390625 0.588956356048584
0.0006909370422363281 0.5857172012329102
0.0006442070007324219 0.5854201316833496
0.0006425380706787109 29.423442363739014
0.0008411407470703125 0.5904080867767334
0.0007281303405761719 0.5878152847290039
0.0007319450378417969 47.11639881134033
0.0008819103240966797 23.894486904144287
0.0007138252258300781 0.578960657119751
0.0005040168762207031 0.5796892642974854
0.0004954338073730469 0.5855348110198975
0.0004782676696777344 66.34711623191833
0.0006091594696044922 0.5864002704620361
0.0005247592926025391 0.5784976482391357
0.0005164146423339844 0.6909780502319336
0.0006034374237060547 26.061028718948364
0.0008776187896728516 0.5903408527374268
0.0006973743438720703 0.584754467010498
0.0006537437438964844 0.5849916934967041
…
36.196861028671265 36.79169154167175
0.0008547306060791016 0.5894830226898193
0.0008087158203125 0.5778903961181641
0.0006210803985595703 0.5889377593994141
0.0008003711700439453 0.5878915786743164
0.00077056884765625 0.5870087146759033
0.0007855892181396484 93.86848998069763
0.0007998943328857422 0.5807287693023682
17.311529874801636 17.90171504020691
0.0007803440093994141 9.284274816513062
0.0008406639099121094 0.5794563293457031
0.0008251667022705078 0.6089217662811279
0.00078582763671875 0.5598442554473877
0.0007565021514892578 0.5864059925079346
0.0007340908050537109 42.826006174087524
0.0010673999786376953 0.5904500484466553
23.019705295562744 59.32295536994934
0.0007565021514892578 31.347289085388184
0.0006775856018066406 0.5731685161590576
0.0007195472717285156 0.5763015747070312
0.0005919933319091797 0.5776708126068115
0.0005700588226318359 0.5778248310089111
0.0006148815155029297 7.108304738998413
0.0005848407745361328 0.5788106918334961
0.0006554126739501953 32.21546387672424
0.0007257461547851562 88.52377581596375
0.0008158683776855469 0.5769295692443848

I also noticed that although GPUs are 100% utilized but their power usage are around 80/350 W.

Training recipe for cc3m and cc12m

Hi, thanks for the great work!

I was wondering if you could provide the training recipe for cc3m and cc12m ((lr, wd, bsize, etc.)?

I am trying to reproduce the reported 0-shot result on CC3m on CLIP and SLIP with the same hyper parameters as YFCC-100M. But I am getting 14.1 and 18.4 (compared to the reported 17.1 and 23.0). The environment (OS, PyTorch, CUDA versions) has been double checked.

Thanks!

Number of text transformer layers

Hi, thanks for the amazing code!

I have one question about the number of text transformer layers. In the paper it says that the text transformer contains 38M parameters. However, your code seems to use the 12-layer 512-wide model with 8 attention heads, which contains 63M parameter according to the CLIP paper. May I know which one is used?

Thanks a lot!

Best,
Junnan

Multiple GPU training

Thanks for sharing the code. I was going to use your code to train SimCLR on ImageNet-1K, but could not use multiple gpus on one machine. Can you please let me know how I should use multiple gpus? also how the hyperparameters change with respect to number of gpus?

Secondly, what do you recommend for ImageNet1K hyperparameters, e.g. LR, batch-size, etc?

The CLIP loss implementation seems not completely right

Hi,

Thank you for sharing your code!

If I am not missing something, I think the CLIP loss implementation here is not completely correct, in the sense that one should gather with gradient. Do you have a specific reason for not using it (since you have already implemented it for SimCLR), or you have back-propagated the gradient somewhere that I have missed?

Any reason for slightly non-standard val augmentation?

Hello, I am wondering if there's any reason to use these transforms for validation agumentation https://github.com/facebookresearch/SLIP/blob/main/main.py#L172-L177 when I often see https://github.com/pytorch/examples/blob/main/imagenet/main.py#L223-L230.

CC3M results cannot be reproduced

Thanks for the great paper. However, I cannot reproduce results based on this repo and it would be greatly appreciated if more details could be provided.

Similar to issue #9 , I also cannot reproduce the results of CC3M with 64 GPUs. Concretely, on the ImageNet-1K linear probing task, the model is trained on CC3M for 40 epochs with your recommended hyper-parameters (weight decay 0.1, learning rate 3e-3 and warmup 2 epochs). However, on ImageNet-1K linear probing task, the 65.4% top-1 accuracy (as reported in the paper) cannot be achieved, and only ~50% top-1 accuracy is achieved.

Besides, I also notice that in your released checkpoint, the hyper-parameters are not fully contained in the 'args' key, e.g., the ssl_scale is lost which could be essential for reproduction. The checkpoint of CC3M also conveys the weight decay is set to 0.5, but the paper and README both suggest the weight decay is 0.1. You also answered that the best result is achieved far away from the last epoch, and the checkpoint conveys the epoch is 36. I also tested the model from the 36th epoch, the result is still far away from 65.4%.

Hope for your reply on hyper-parameters for pre-training on CC3M. As stated in issue #9 , it would be greatly helpful if the training log could be provided. If the wandb is used, it should be uploaded to the cloud and easy to find.

About license

Thanks for the great work! The project is released under an MIT license. I want to know whether that means the pre-trained models are also released under the MIT license? Thanks.

Load pretrained model from CLIP repo to this SLIP repo

Hi.

Is there any code snippet loading pretrained model from CLIP repo to this SLIP repo ?

Left One is CLIP pretrained model from this repo, and Right one is another CLIP pretrained model from CLIP repo
As you see. key names are different, and some tensor have differenet shape (768 != [1,1,768])

Thank you!

Is there a way to display a table in a cell?

Hi thanks for providing the great tool. I am wondering is there is a way to support display a table in a text cell? I have tried to create a table string as input to the text. But it cannot render the '\n'. For example, I want to give a caption-frequency table in a cell. The string in the second row below actually has a '\n' after 'caption frequency'. But it does render the breakline, causing it cannot show the table-like string in a proper way. Is there a way to show a table per cell?

How use SLIP to predict the specific picture?

How to use SLIP to predict the specific picture? Such as CLIP:

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("D:\OD\CLIP\ViT-B-32.pt", device=device)

image = preprocess(Image.open("fuliqiang.png")).unsqueeze(0).to(device)
text = clip.tokenize(["sleep", "play cellphone", "work"]).to(device)

with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)

logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs) # prints: [[0.9927937 0.00421068 0.00299572]]

facebookresearch / slip Goto Github PK

slip's Introduction

What you can find in this repo:

Updates:

Results and Pre-trained Models

ViT-Small (MoCo v3 version w/ 12 vs. 6 heads)

ViT-Base

ViT-Large

Additional Datasets and Models

1. Setup

1.1. YFCC15M Setup

1.2. COCO Captions Setup

1.3. Conceptual Captions Setup

1.4. RedCaps Setup

1.4. Downstream Dataset Setup

2. Pre-training

SLIP ViT-Base with 8-nodes (batch size 4096)

CLIP ViT-Base with 8-nodes (batch size 4096)

SimCLR ViT-Base with 8-nodes (batch size 4096)

3. Evaluation: Zero-shot Transfer

4. Evaluation: Linear Classification

5. Evaluation: End-to-End Finetuning

5.1. Setup

5.2. Commands

ViT-Small (MoCo v3 version w/ 12 vs. 6 heads)

ViT-Base

ViT-Large

License

Citation

slip's People

Contributors

Stargazers

Watchers

Forkers

slip's Issues

Recommend Projects

Recommend Topics

Recommend Org