Code Monkey home page Code Monkey logo

lilt's Introduction

Paper Conference

LilT

Contrastive Aligned of Vision to Language Through Parameter-Efficient Transfer Learning [ICLR 23]

Setup

Dependencies

conda env create --name lilt --file environment.yaml

Data

See individual sections below for instructions.

Weights

The links to the weights all point to Google Drive. To download them without a browser, do pip install gdown use the ID of the file you want to download. For example, to download model weights at https://drive.google.com/file/d/1mdXQK9Jidk97FrLR2Bqzhbuptt07BPo4/view?usp=drive_link, you would use gdown 1mdXQK9Jidk97FrLR2Bqzhbuptt07BPo4. The weights were saved with torch.save() and can be loaded with torch.load().

Pretraining

Pretraining Data

The data format used for pretraining looks like this:

pretraining_data: List[Dict]  = [
    {
        "image": "/media/coco2017/coco-images/000000203564.jpg",
        "image_id": 203564,
        "caption": "A bicycle replica with a clock as the front wheel."
    },
    {
        "image": "/media/coco2017/coco-images/000000322141.jpg",
        "image_id": 322141,
        "caption": "A room with blue walls and a white sink and door."
    }, 
    ...
]

The image_id does not need to correspond to anything for the pretraining (in this example, it is the COCO image IDs). The image field needs to contain an absolute path. This means the path should start with a slash /. You can download examples of pretraining files from Salesforce Research, but note that the paths for each image will need to be changed to match your local setup.

You can download COCO from https://cocodataset.org/#home.

The other pretraining datasets can be downloaded from Huggingface.

Pretraining Commands

See examples/pretrain_{clip, lilt, lit}.sh. In general, the pretraining commands look something like this:

python -m torch.distributed.launch --master_port=43770 --nproc_per_node=4 \
    --use_env PretrainHydra.py --config CLIPAdjustable \
    --output_dir ./storage/lilt_cache/clip_example \
    --overrides text_encoder=base vision_encoder=base \
    +save_last_only=True disable_wandb=False

The model checkpoints will be placed in --output_dir. The --overrides flag can be used to specify overrides for items in the configuration, following the syntax of Hydra. For example, the default size (in the config) for the text encoder is small, but we are overriding it to base from the command line.

The code was tested with V100, A6000, and A100 GPUs. You will need around 32GB of GPU memory (with 4x GPUs) to match the training settings of the CLIP model exactly, but the LilT and LiT models can be trained on 4x GPUs for much less memory. Of course, you can always lower the batch size if you want to train on a single GPU.

Evaluation

Classification

First, set up the ImageNetv2 dataset as described here. Next, edit classification.py, specifically this part:

test_dataset = ImageNetV2Dataset(
    location="./storage/10/imagenet_v2", transform=test_transform
)

to point to wherever you downloaded ImageNetV2. To run multimodal classification, you can use a command like the following:

python -m torch.distributed.launch --nproc_per_node=1 --use_env classification.py \
--config Retrieval_AdjustableCLIP_Flickr \
--output_dir ./clf_output \
 --checkpoint ./storage/lilt_example/checkpoint_14.pth \ --evaluate \
  --overrides text_encoder=base vision_encoder=base

The value of the --config flag is not a typo, we just reuse the same config for classification and retrieval.

Retrieval

Follow the instructions in codezakh/SIMLA#zero-shot-flickr to set up the data for Flickr. The only change is that you should edit configs-v2/Retrieval_AdjustableCLIP_Flickr.yaml to point to the downloaded dataset rather than configs/Retrieval_flickr.yaml.

See examples/evaluate_{clip, lilt, lit}.sh for evaluation scripts.

Multilingual Retrieval

Download the XD10 dataset from official repo. Then, run the script process_xd10.py, making sure to edit the paths at the top of the file:

XTD10_DIR = Path("/home/zkhan/Cross-lingual-Test-Dataset-XTD10/XTD10")
COCO_TRAIN_DIR = Path("./storage/10/coco2014/train2014")
OUTPUT_DIR = Path("./storage/10/multilingual_coco2014_xtd10")

so that XD10_DIR and COCO_TRAIN_DIR point to where you have downloaded the respective datasets.

You can then evaluate a pretrained model like this:

python -m torch.distributed.launch --master_port=40770 \
--nproc_per_node=1 \
--use_env zero_shot_retrieval.py \
--config Retrieval_AdjustableCLIP_COCO \
--output_dir ./eval_output \
--checkpoint ./path/to/checkpoint.pth \
--evaluate --overrides text_encoder=base_multilingual \
vision_encoder=base \
test_file=$XD10_OUTPUT_DIR/val.json"

where XD10_OUTPUT_DIR is the place you told the process_xd10.py script to put the preprocessed dataset.

To evaluate on a specific language as done in the paper, change test_file=$XD10_OUTPUT_DIR/val.json to test_file=$XD10_OUTPUT_DIR/val_{lang_abbrv}.json where lang_abbrv is one of the following:

LANGUAGES = ("es", "it", "ko", "pl", "ru", "tr", "zh")

Citation

@inproceedings{ContrastiveAlignmentVisionFu2023,
  title = {Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning},
  booktitle = {The Eleventh International Conference on Learning Representations},
  author = {Khan, Zaid and Fu, Yun},
  year = {2023},
}

Acknowledgements

This code borrows heavily from https://github.com/salesforce/ALBEF.

lilt's People

Contributors

codezakh avatar

Stargazers

Xiaobing Han avatar Zexu Lin avatar Sander Moonemans avatar Jeff Carpenter avatar Zmu avatar Mayug Maniparambil avatar Nanyang Du avatar XuRui Zhou avatar cnxup avatar Raihan Saputra avatar Rohit Gupta avatar SeshurajuP avatar Victor Turrisi avatar  avatar Jiazhi Yang avatar  avatar  avatar Rainylt avatar anahuauda avatar John S. Dvorak avatar Chen Sun avatar Abd Shomad avatar CarbonCoo avatar shawn lin avatar Slice avatar 爱可可-爱生活 avatar Chenxin Li avatar slyviacassell avatar Ellen Wang avatar YANHONG ZENG avatar Aniki avatar Pu Cao avatar Xu Ma avatar  avatar  avatar Wang Yizhou avatar

Watchers

 avatar Kostas Georgiou avatar  avatar

lilt's Issues

Release of pre-trained models

Hi,

This is a really valuable work, in particular I really like that you have results for various degrees of finetuning of language and vision encoders.

I am interested in evaluating some of your models on additional tasks, for that purpose it would be useful if weights for all the models mentioned in Table 1 were released. Is there a plan to do do ?

Thanks

”coco2017/pretrain-pairs.json “Acquisition of pre training data

Thank you for your work, which has inspired me a lot. However, there is a question that I hope you can provide some help and advice. In the pre training data example you provided, Coco2014 was used, but during the pre training, the configuration file used Coco2017/pre train pairs.json. Could you please provide the corresponding script or provide me with any reference information? Thank you!

Have you tried to add adapter on CLIP?

hello, I'm interested in your pretty work and have some questions about the paper. I'm curious about that have you tried to add adapter on Clip, as the way you applied to bert-vit.
The CLIP and bert-vit seem to have different pre-training data in table 1. As we know, original CLIP use 400M img-text pairs for training. Or does the CLIP in table 1 mean the bert-vit model which finetune all parameter actually?

More, why don't you fully finetune bert-vit on coco2014, and make it be the 100% trained version in table 1? I think such a model may represent the 100% trained model more precisely than CLIP.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.