whwu95 / text4vis Goto Github PK

【AAAI'2023 & IJCV】Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

License: MIT License

Python 99.13% Shell 0.87%

cross-modal-learning transfer-learning video-recognition video-understanding action-recognition

text4vis's Introduction

Hi, I'm Wenhao Wu 👋

Wenhao Wu (吴文灏🇨🇳) is a Ph.D. student in the School of Computer Science at The University of Sydney, supervised by Prof. Wanli Ouyang. I have a close collaboration with Department of Computer Vision Technology (VIS) at Baidu led by Dr. Jingdong Wang (IEEE Fellow). I received my M.S.E degree from Multimedia Laboratory (MMLab@SIAT), University of Chinese Academy of Sciences, supervised by Prof. Shifeng Chen and Prof. Yu Qiao. I was also fortunate to intern/RA at MMLab@CUHK, Baidu, iQIYI, SenseTime, Samsung Research and Chinese Academy of Sciences. I am honored to be awarded the 11th Baidu PhD Fellowship (2023).

My current research interest includes Cross-Modal Learning and Video Understanding. I have published 20+ papers at the top international CV/AI conferences or journals such as CVPR/ICCV/ECCV/AAAI/IJCAI/ACMMM/TPAMI/IJCV.

🔭 Research Interest

My research interests broadly lie in the areas of Computer Vision and Deep Learning, including:

Cross-Modal Learning (2022-Present): Video-Language Matching, Multimodal Large Language Model (MLLM)
Video Foundation Model (2017-Present): Video Recognition, Efficient Video Tuning
Video-related Applications (2017-2022): Video Sampler, Temporal Action Detection, Anomaly Detction in Video
Self-supervised Learning (2021-2022): Contrastive Video Learning, Masked Video Modeling
Low-level Vision (2021-2022): Image Colorization, Style Transfer, Image Rescaling

🔥 News

2024.05: The extension of Cap4Video has been accepted by TPAMI.
2024.01: I am honored to receive the 11th🎖Baidu Scholarship🎖, a prestigious fellowship awarding 200,000 RMB (about $30,000) to a select 10 PhD students worldwide in Artificial Intelligence, selected from thousands of applicants.
2023.11: We release GPT4Vis , which provides a Quantitative Evaluation of GPT-4 for Visual Understanding across images, videos and point clouds, spinning on 16 popular datasets.
2023.11: We release Side4Video , a Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning, which significantly reduces the training memory cost for action recognition (↓75%) and text-video retrieval (↓30%).
2023.08: The extension of Text4Vis has been accepted by IJCV.
2023.07: Two First-author papers (Temporal Modeling: ATM , Cross-Modal Retrieval: UA ) are accepted by ICCV2023.
2023.02: Two First-author papers for video understanding (BIKE , Cap4Video ) are accepted by CVPR 2023. Cap4Video involves GPT to enhance text-video learning, is selected as a 🎉Highlight paper🎉 (Top 2.5%).
2022.11: Two papers (Video Recognition: Text4Vis , Style Transfer: AdaCM) are accepted by AAAI 2023.
2022.07: Three papers (Video Sampling: NSNet, TSQNet, Cross-Modal Learning: CODER) are accepted by ECCV 2022.
2022.06: Our MaMiCo, a new video self-supervised learning work, is accepted by ACMMM 2022 (🎉Oral Presentation🎉).

text4vis's People

Contributors

Stargazers

Watchers

Forkers

xupercoin farmingtong n0wwa xowwa maigone staccats bashimr xacheng1996 erictang000 tian1327 renmeixiang zyh1994 hay-man jbluv ntt720

text4vis's Issues

Could you provide the code/setting/threshold for producing Figure 1?

Thanks for your great work in shedding new light on the transfer of VL models!

Could you please provide your code for plotting Figure 1? as I would like to test this idea on different tasks/datasets.

Or maybe you can share your experience with plotting, e.g., choosing which set of classes to plot, deciding on the right threshold, etc.

Thanks again!

could you provide a demo code for test the video in the wild?

Activitynet dataset

Could you tell me how to extracts keyframes the Activitynet dataset and what are the rules?

The checkpoint link in one drive cannot be opened

Thanks for providing great resource for video recognitions.

However, the one drive link cannot be opened, is it invalid? Or can you provide a new link?
Looking forward to your reply, thank you very much

关于k400数据集

您好！
请问一下您实验用的K400数据集是完整版还是有遗失的呢？

Pre-trained models

@whwu95

Are there any pre-trained models publicly available that I can use to test zero-shot? If not, are you open to sharing the models trained on Kinetics400 for testing purpose?

Thanks,

Imran

CoOP

May I ask how the CoOP in the paper is implemented? Is there a tutorial available？

CoopCLIP

您好，最近拜读了您的文章，想请问CoopCLIP在这篇文章的代码中用到了吗，这个类在modules/coop.py文件中

What's different between "train.py" and "train_nce.py"？

Which code should I run to get the published result? Also, I noticed that "train_nce.py" is quite similar to the code for BIKE. It would be helpful if you could briefly explain the differences between them. Looking forward to your reply.

Data prep and training time

@whwu95

I have setup an 8 GPU instance on AWS and I am trying to download the Kinetics400 dataset. A couple of questions:

Approximately how long it takes to prepare the dataset for training i.e. extracting and resizing the frames?
Approximately how long it will take to train the model on Kinetics400 dataset. The machine specs are given below?

Compute	Value
vCPUs	96
Memory (GiB)	384.0
Memory per vCPU (GiB)	4.0
Physical Processor	Intel Xeon Family
Clock Speed (GHz)	2.5
CPU Architecture	x86_64
GPU	8
GPU Architecture	nvidia t4 tensor core
Video Memory (GiB)	128
GPU Compute Capability (?)	7.5

Thanks,

Imran

lda_0.1.pt等文件

您好，我对您的工作十分感兴趣，并且有两个问题想询问您。
1.您如何获取到Classifier（训练过程）即：如何通过Transferring visual statistic knowledge(LDA)得到lda_0.1.pt文件，以及如何通过Transferring textual semantic knowledge得到classes_features的训练过程
2.相关pt文件distilbert-base-k400.pt和lda_0.1.pt没有给出。
十分期待您的回信

关于OSError: [Errno 5] Input/output error

您好！
我在尝试训练的过程中，发生了如下错误：
OSError: [Errno 5] Input/output error: '/opt/data/private/dataset/k400_frame/train/auctioning/97nosiYXJm8_000087_000097'
（每次错误发生的时间点随机、读取错误的文件也是不固定的，并不是固定几个文件报这个错误）
训练设备是2张32G Tesla V100
初步分析是因为读写小文件（切帧图片）速度慢。请问这种错误您有遇到过吗？因为还不确定是系统的问题，还是因为用于训练的GPU资源不够。
谢谢！

Training and testing on single dGPU

Hi Guys

Looks like a great project. Nice work. I am trying to explore it for zero-shot. I have an NVIDIA GeForce RTX 3080 on my machine. The instructions are to run the training on single machine with 8 GPUs. Is it possible/worthwhile to train and test the model on my machine? If so, any pointers to do that will be much appreciated.

Thanks.

Imran

On the issue of modeling

Hello author! I admire your work and like it, I would like to ask if the text encoder and image encoder of the model in training are frozen or trainable?

回归任务

你好！您的项目十分有趣，我想将其修改为回归任务应用在我的数据集上，请问是否可以修改呢？如果可以，具体该如何修改哪些部分呢？

Model zoo links expired

While the links to GITHUB are available, all links to OneDrive are expired. But the training of HDMB51, UCF101 involves pre-trained ViT-L models, which are unavailable to access. Please extend the OneDrive expiration date. Thx!

model without text

Thank you for your impressive work,

Could you provide your pretrained model without text on HMDB as shown in Table 6? Thank you very much.

Kind Regards,

About the dataloader.

Hi, thanks for your great job! I notice that there are two ways to load the video data: 1.extrated frames 2. on-the-fly decoding
Could you please provide some details about the difference in their loading speed? And how much memory space will the extracted frames take up (e.g. K400)?

Access to checkpoints

Hi Wenhao,
Thanks for your great work!

I was trying to download the ViT/L-14-f8 checkpoint, but I encountered the following message: "We're sorry, but ****@gmail.com can't be found in the unisyd-my.sharepoint.com directory. Please try again later, while we try to automatically fix this for you."

Do you have any idea on how to solve this?
Thanks!

数据并行or模型并行

您好，请问一下如果用您的训练方式是不是属于数据并行？

Zero-shot Video Performance.

Hello,

Thanks for your job!

I've used the Kinetics-400 pre-trained model (ViT-L/14 with 8 frames downloaded from https://drive.google.com/file/d/1tGfE6HDjTGZ7-y6XM7D6UJAx1Esj-q7u/view?usp=share_link) to perform a cross-dataset zero-shot evaluation on UCF-101 dataset. And I get the following results:

-----Full-classes Evaluation------
Overall Top1 57.928% Top5 88.071%
-----Half-classes Evaluation-----
Top1: mean 69.553%, std 6.283%
Top5: mean 92.737%, std 1.957%

And in your paper, the performance of zero-shot video recognition is that:

It's different and far below the performance that your paper reports. And I want to know which model is used for zero-shot video recognition.

Thanks,

关于多模态融合以及结果复现问题

作者您好，看了您的论文深受启发，觉得您写的很好，有两个问题想咨询您。
1、我已经成功复现了代码，预训练模型使用的vit-l-14，两张4090显卡跑的结果是：top1: 95.3%\top5: 99.2%，跟您的结果可能还有差距。
2、关于视觉特征和文本特征融合时，您采用了CLIP模型默认的余弦相似度计算，但我不太理解这个代码思路，看CLIP原论文伪代码好像不是这样，恳请您解答一下这个logit_scale 是干啥的，有什么用，为什么要这样初始化logit_scale 。
self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
logit_scale = self.logit_scale.exp()
logits = logit_scale * image_emb @ text_emb.t()

Model zoo links expired

Thanks for providing great resource for video recognitions.

It looks like one drive links have expired, and no longer able to download models from the model zoo.

whwu95 / text4vis Goto Github PK

text4vis's Introduction

Hi, I'm Wenhao Wu 👋

🔭 Research Interest

🔥 News

text4vis's People

Contributors

Stargazers

Watchers

Forkers

text4vis's Issues

Recommend Projects

Recommend Topics

Recommend Org