salesforce / must Goto Github PK

View Code? Open in Web Editor NEW

103.0 6.0 12.0 1.36 MB

PyTorch code for MUST

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

clip masked-image-modeling self-training unsupervised-learning zero-shot-classification zero-shot-learning

must's Issues

SUN397 .npy file format

Hi! Thanks for releasing this great work to the public!

I followed the instructions to make the datasets referred to Vissl. But there is the problem I met: Vissl convert the SUN397 dataset into some '.npy' file compared to others' simple image folders. And the 'train.py' couldn't work with '.npy'.

So I want to ask if you used another special 'train.py' on SUN397 or made a special SUN397 dataset different from Vissl.

Thank you!

How to set hyperparameter using the ResNet backbone (pre-trained CLIP)?

Hi, Thanks for the nice code!
On the UCF101 dataset, I used pre-trained Rsenet50 (CLIP) backbone (VIT backbone works well), but found that the performance of fine-tuning drops a lots at the beginning, about 10% compared to CLIP, and then the performance increases slowly. How shoould I set hyperparamenters such as learning rate to mitigate the drop in performance at the begining?

Timeout error during training to reproduce results

Hi,

I have been trying to train the model for Imagenet dataset using 16 V100 GPUs.
I am getting a timeout error during evaluation in the training script after the first epoch. It occurs exactly at the same point [2100/11010] iteration in evaluation. Any idea as to why this is occurring?

STACK TRACE:

Test: [ 2090/10010] eta: 1:53:32 acc1: 73.3203 (76.6279) ema_acc1: 77.7734 (76.4956) time: 0.8568 data: 0.0
001 max mem: 14135
[E ProcessGroupNCCL.cpp:587] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800172 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800768 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:587] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801901 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801862 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801679 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802173 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802073 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801667 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801777 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801821 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802399 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802025 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802703 milliseconds before timing out.
Test: [ 2100/10010] eta: 1:53:23 acc1: 79.5312 (76.6506) ema_acc1: 81.9922 (76.5298) time: 0.8566 data: 0.0
001 max mem: 14135
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1546 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1549 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1553 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1554 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1555 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1556 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1558 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1562 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1563 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1564 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1565 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1566 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1570 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1574 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 12 (pid: 1567) of binary: /
nfs/users/ext_jameel.hassan/anaconda3/envs/must/bin/python
Traceback (most recent call last): File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/runpy.py", line 194, in _run_module_as_mai n
return _run_code(code, main_globals, None,
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/run.py", l
ine 723, in
main()
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/elastic/mu
ltiprocessing/errors/init.py", line 345, in wrapper
return f(*args, kwargs)
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/run.py", l
ine 719, in main
run(args)
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/run.py", l
ine 710, in run
elastic_launch(
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
[1]:
time : 2023-01-02_15:24:36
host : p4-r66-a.g42cloud.net
rank : 15 (local_rank: 15)
exitcode : -6 (pid: 1575)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1575

Root Cause (first observed failure):
[0]:
time : 2023-01-02_15:24:36
host : p4-r66-a.g42cloud.net
rank : 12 (local_rank: 12)
exitcode : -6 (pid: 1567)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1567

Discrepancy in Accuracy without Distributed Mode

Hello, I am currently executing your code using a single GPU (without distributed mode). However, the results are significantly different from what was presented in your paper. Is it expected for the results to vary? For instance, the result on a single GPU for the DTD dataset is 50.1%, whereas in your paper, it is reported as 54.1% using Vit-B/16

templates of different data

Hi, great job!
I am wondering how you generate the templates used in CLIP for different dataset, are you write these templates by yourself or there is an public one?
thanks for your reply.

Great work is not accepted~

I didn't see this work in the Neurips acceptance list. I can't believe that such a great job was not accepted. I think it is the reviewers who misunderstand the downstream tasks of CLIP. Because there is no previous work to fine-tune CLIP using unsupervised learning fashion.

Global-local Feature Alignment

Hi,
Have you tried the InfoNCE loss in Global-local Feature Alignment ?

[CLS] and [MSK] in the same sentence constitute positive pairs
[CLS] and [MSK] in different sentence constitute negative pairs

training on small data

Hi, thanks for your work. 16 A100 GPUs and large batch size are necessary on small datasets? such as caltech101 and UFC101?

How to organize the dataset

Thanks for the nice work.
I would like to know how to organize the data. I currently have ucf101 data. The default data structure is: "ucf101_rawframes\rawframes\101 classes folder"
Because I don't have enough graphics card (only one 2060 6g), but I want to learn your code by debugging, so I change the command from the distribution to this
"train.py --dataset ucf101_frames --clip_model ViT-B/16"
but now it raises this error: [WinError 3] 系统找不到指定的路径。 : '/export/share/datasets/vision/image_datasets/ucf101\train'
I would like to know how to organize the data to match your code. I found the dataset structure given in the comments in the function 'find_classes', but I don't know how to get files with extensions like .ext.
Can you give me some help, thanks a lot.

Model weights

Hi,
Thank you for making this work public. Are the model weights for the trained MUST available?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.