salesforce / must Goto Github PK

PyTorch code for MUST

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

clip masked-image-modeling self-training unsupervised-learning zero-shot-classification zero-shot-learning

must's Introduction

Masked Unsupervised Self-training for Zero-shot Image Classification

This is the PyTorch code of the MUST paper. The repository supports finetuning a CLIP model on unlabeled images from a target domain.

Requirements

pytorch 1.10.0
timm 0.4.12
tensorboardX
ftfy

Dataset Setup

Dataset paths are stored in dataset_catalog.json, which need to be modified to local paths. The imagenet dataset follows the standard folder structure. For other datasets, please refer to the scrips from VISSL to download and prepare. CLIP's labels and prompt templates are stored in classes.json and templates.json.

Training

Run the following code on 16 A100 GPUs:

python -m torch.distributed.run --nproc_per_node=16 train.py --dataset [name_of_dataset] --clip_model ViT-B/16

Results

ViT-B/16:

Method	ImageNet	SUN397	Food101	GTSRB	DTD	UCF101
CLIP	68.3	64.4	88.7	43.4	44.7	68.8
MUST	77.7	71.8	92.7	65.5	54.1	81.1

ViT-L/14:

Method	ImageNet	SUN397	Food101	GTSRB	DTD	UCF101
CLIP	75.5	67.4	92.9	50.6	55.4	77.0
MUST	82.1	74.6	95.3	68.7	62.6	85.7

Citation

@inproceedings{li2022masked,
      title={Masked Unsupervised Self-training for Label-Free Image Classification}, 
      author={Junnan Li and Silvio Savarese and Steven C. H. Hoi},
      year={2023},
      booktitle={ICLR},
}

must's People

Stargazers

Watchers

Forkers

fendaq mymuli ggsonic skbl5694 jonkahana verigle michalsr sanoojan ghas-results zhaoxin94 whuhxb sdhrushi

must's Issues

How to set hyperparameter using the ResNet backbone (pre-trained CLIP)?

Hi, Thanks for the nice code!
On the UCF101 dataset, I used pre-trained Rsenet50 (CLIP) backbone (VIT backbone works well), but found that the performance of fine-tuning drops a lots at the beginning, about 10% compared to CLIP, and then the performance increases slowly. How shoould I set hyperparamenters such as learning rate to mitigate the drop in performance at the begining?

How to organize the dataset

Thanks for the nice work.
I would like to know how to organize the data. I currently have ucf101 data. The default data structure is: "ucf101_rawframes\rawframes\101 classes folder"
Because I don't have enough graphics card (only one 2060 6g), but I want to learn your code by debugging, so I change the command from the distribution to this
"train.py --dataset ucf101_frames --clip_model ViT-B/16"
but now it raises this error: [WinError 3] 系统找不到指定的路径。 : '/export/share/datasets/vision/image_datasets/ucf101\train'
I would like to know how to organize the data to match your code. I found the dataset structure given in the comments in the function 'find_classes', but I don't know how to get files with extensions like .ext.
Can you give me some help, thanks a lot.

templates of different data

Hi, great job!
I am wondering how you generate the templates used in CLIP for different dataset, are you write these templates by yourself or there is an public one?
thanks for your reply.

Timeout error during training to reproduce results

Hi,

I have been trying to train the model for Imagenet dataset using 16 V100 GPUs.
I am getting a timeout error during evaluation in the training script after the first epoch. It occurs exactly at the same point [2100/11010] iteration in evaluation. Any idea as to why this is occurring?

STACK TRACE:

Test: [ 2090/10010] eta: 1:53:32 acc1: 73.3203 (76.6279) ema_acc1: 77.7734 (76.4956) time: 0.8568 data: 0.0
001 max mem: 14135
[E ProcessGroupNCCL.cpp:587] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800172 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800768 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:587] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801901 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801862 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801679 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802173 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802073 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801667 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801777 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801821 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802399 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802025 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802703 milliseconds before timing out.
Test: [ 2100/10010] eta: 1:53:23 acc1: 79.5312 (76.6506) ema_acc1: 81.9922 (76.5298) time: 0.8566 data: 0.0
001 max mem: 14135
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1546 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1549 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1553 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1554 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1555 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1556 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1558 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1562 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1563 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1564 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1565 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1566 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1570 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1574 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 12 (pid: 1567) of binary: /
nfs/users/ext_jameel.hassan/anaconda3/envs/must/bin/python
Traceback (most recent call last): File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/runpy.py", line 194, in _run_module_as_mai n
return _run_code(code, main_globals, None,
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/run.py", l
ine 723, in
main()
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/elastic/mu
ltiprocessing/errors/init.py", line 345, in wrapper
return f(*args, kwargs)
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/run.py", l
ine 719, in main
run(args)
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/run.py", l
ine 710, in run
elastic_launch(
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
[1]:
time : 2023-01-02_15:24:36
host : p4-r66-a.g42cloud.net
rank : 15 (local_rank: 15)
exitcode : -6 (pid: 1575)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1575

Root Cause (first observed failure):
[0]:
time : 2023-01-02_15:24:36
host : p4-r66-a.g42cloud.net
rank : 12 (local_rank: 12)
exitcode : -6 (pid: 1567)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1567

Great work is not accepted~

I didn't see this work in the Neurips acceptance list. I can't believe that such a great job was not accepted. I think it is the reviewers who misunderstand the downstream tasks of CLIP. Because there is no previous work to fine-tune CLIP using unsupervised learning fashion.

Model weights

Hi,
Thank you for making this work public. Are the model weights for the trained MUST available?

SUN397 .npy file format

Hi! Thanks for releasing this great work to the public!

I followed the instructions to make the datasets referred to Vissl. But there is the problem I met: Vissl convert the SUN397 dataset into some '.npy' file compared to others' simple image folders. And the 'train.py' couldn't work with '.npy'.

So I want to ask if you used another special 'train.py' on SUN397 or made a special SUN397 dataset different from Vissl.

Thank you!

Discrepancy in Accuracy without Distributed Mode

Hello, I am currently executing your code using a single GPU (without distributed mode). However, the results are significantly different from what was presented in your paper. Is it expected for the results to vary? For instance, the result on a single GPU for the DTD dataset is 50.1%, whereas in your paper, it is reported as 54.1% using Vit-B/16

training on small data

Hi, thanks for your work. 16 A100 GPUs and large batch size are necessary on small datasets? such as caltech101 and UFC101?

Global-local Feature Alignment

Hi,
Have you tried the InfoNCE loss in Global-local Feature Alignment ?

[CLS] and [MSK] in the same sentence constitute positive pairs
[CLS] and [MSK] in different sentence constitute negative pairs

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.