Code Monkey home page Code Monkey logo

text4vis's Introduction

Hi, I'm Wenhao Wu 👋

Wenhao Wu 知乎 github LinkedIn Google Scholar X

Wenhao Wu (吴文灏🇨🇳) is a Ph.D. student in the School of Computer Science at The University of Sydney, supervised by Prof. Wanli Ouyang. I have a close collaboration with Department of Computer Vision Technology (VIS) at Baidu led by Dr. Jingdong Wang (IEEE Fellow). I received my M.S.E degree from Multimedia Laboratory (MMLab@SIAT), University of Chinese Academy of Sciences, supervised by Prof. Shifeng Chen and Prof. Yu Qiao. I was also fortunate to intern/RA at MMLab@CUHK, Baidu, iQIYI, SenseTime, Samsung Research and Chinese Academy of Sciences. I am honored to be awarded the 11th Baidu Scholarship (2023).

My current research interest includes Cross-Modal Learning and Video Understanding. I have published 20+ papers at the top international CV/AI conferences or journals such as CVPR/ICCV/ECCV/AAAI/IJCAI/ACMMM/IJCV.

Wenhao Wu's GitHub stats Top Langs

🔭 Research Interest

My research interests broadly lie in the areas of Computer Vision and Deep Learning, including:

  • Cross-Modal Learning (2022-Present): Video-Language Matching, Multimodal Large Language Model (MLLM)
  • Video Foundation Model (2017-Present): Video Recognition, Efficient Video Tuning
  • Video-related Applications (2017-2022): Video Sampler, Temporal Action Detection, Anomaly Detction in Video
  • Self-supervised Learning (2021-2022): Contrastive Video Learning, Masked Video Modeling
  • Low-level Vision (2021-2022): Image Colorization, Style Transfer, Image Rescaling

🔥 News

  • 2024.01: I am honored to receive the 11th🎖Baidu Scholarship🎖, a prestigious fellowship awarding 200,000 RMB (about $30,000) to a select 10 PhD students worldwide in Artificial Intelligence, selected from thousands of applicants.
  • 2023.11: We release GPT4Vis , which provides a Quantitative Evaluation of GPT-4 for Visual Understanding across images, videos and point clouds, spinning on 16 popular datasets.
  • 2023.11: We release Side4Video , a Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning, which significantly reduces the training memory cost for action recognition (↓75%) and text-video retrieval (↓30%).
  • 2023.08: The extension of Text4Vis has been accepted by IJCV.
  • 2023.07: Two First-author papers (Temporal Modeling: ATM , Cross-Modal Retrieval: UA ) are accepted by ICCV2023.
  • 2023.02: Two First-author papers for video understanding (BIKE , Cap4Video ) are accepted by CVPR 2023. Cap4Video involves GPT to enhance text-video learning, is selected as a 🎉Highlight paper🎉 (Top 2.5%).
  • 2022.11: Two papers (Video Recognition: Text4Vis , Style Transfer: AdaCM) are accepted by AAAI 2023.
  • 2022.07: Three papers (Video Sampling: NSNet, TSQNet, Cross-Modal Learning: CODER) are accepted by ECCV 2022.
  • 2022.06: Our MaMiCo, a new video self-supervised learning work, is accepted by ACMMM 2022 (🎉Oral Presentation🎉).

text4vis's People

Contributors

whwu95 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

text4vis's Issues

The checkpoint link in one drive cannot be opened

Thanks for providing great resource for video recognitions.

However, the one drive link cannot be opened, is it invalid? Or can you provide a new link?
Looking forward to your reply, thank you very much

Training and testing on single dGPU

Hi Guys

Looks like a great project. Nice work. I am trying to explore it for zero-shot. I have an NVIDIA GeForce RTX 3080 on my machine. The instructions are to run the training on single machine with 8 GPUs. Is it possible/worthwhile to train and test the model on my machine? If so, any pointers to do that will be much appreciated.

Thanks.

Imran

回归任务

你好!您的项目十分有趣,我想将其修改为回归任务应用在我的数据集上,请问是否可以修改呢?如果可以,具体该如何修改哪些部分呢?

Pre-trained models

@whwu95

Are there any pre-trained models publicly available that I can use to test zero-shot? If not, are you open to sharing the models trained on Kinetics400 for testing purpose?

Thanks,

Imran

关于k400数据集

您好!
请问一下您实验用的K400数据集是完整版还是有遗失的呢?

关于OSError: [Errno 5] Input/output error

您好!
我在尝试训练的过程中,发生了如下错误:
OSError: [Errno 5] Input/output error: '/opt/data/private/dataset/k400_frame/train/auctioning/97nosiYXJm8_000087_000097'
(每次错误发生的时间点随机、读取错误的文件也是不固定的,并不是固定几个文件报这个错误)
训练设备是2张32G Tesla V100
初步分析是因为读写小文件(切帧图片)速度慢。请问这种错误您有遇到过吗?因为还不确定是系统的问题,还是因为用于训练的GPU资源不够。
谢谢!

关于多模态融合以及结果复现问题

作者您好,看了您的论文深受启发,觉得您写的很好,有两个问题想咨询您。
1、我已经成功复现了代码,预训练模型使用的vit-l-14,两张4090显卡跑的结果是:top1: 95.3%\top5: 99.2%,跟您的结果可能还有差距。
2、关于视觉特征和文本特征融合时,您采用了CLIP模型默认的余弦相似度计算,但我不太理解这个代码思路,看CLIP原论文伪代码好像不是这样,恳请您解答一下这个logit_scale 是干啥的,有什么用,为什么要这样初始化logit_scale 。
self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
logit_scale = self.logit_scale.exp()
logits = logit_scale * image_emb @ text_emb.t()

model without text

Thank you for your impressive work,

Could you provide your pretrained model without text on HMDB as shown in Table 6? Thank you very much.

Kind Regards,

lda_0.1.pt等文件

您好,我对您的工作十分感兴趣,并且有两个问题想询问您。
1.您如何获取到Classifier(训练过程)即:如何通过Transferring visual statistic knowledge(LDA)得到lda_0.1.pt文件,以及如何通过Transferring textual semantic knowledge得到classes_features的训练过程
2.相关pt文件distilbert-base-k400.pt和lda_0.1.pt没有给出。
十分期待您的回信

Could you provide the code/setting/threshold for producing Figure 1?

Thanks for your great work in shedding new light on the transfer of VL models!

Could you please provide your code for plotting Figure 1? as I would like to test this idea on different tasks/datasets.

Or maybe you can share your experience with plotting, e.g., choosing which set of classes to plot, deciding on the right threshold, etc.

Thanks again!

Data prep and training time

@whwu95

I have setup an 8 GPU instance on AWS and I am trying to download the Kinetics400 dataset. A couple of questions:

  1. Approximately how long it takes to prepare the dataset for training i.e. extracting and resizing the frames?
  2. Approximately how long it will take to train the model on Kinetics400 dataset. The machine specs are given below?
Compute Value
vCPUs 96
Memory (GiB) 384.0
Memory per vCPU (GiB) 4.0
Physical Processor Intel Xeon Family
Clock Speed (GHz) 2.5
CPU Architecture x86_64
GPU 8
GPU Architecture nvidia t4 tensor core
Video Memory (GiB) 128
GPU Compute Capability (?) 7.5

Thanks,

Imran

CoopCLIP

您好,最近拜读了您的文章,想请问CoopCLIP在这篇文章的代码中用到了吗,这个类在modules/coop.py文件中

CoOP

May I ask how the CoOP in the paper is implemented? Is there a tutorial available?

Model zoo links expired

Thanks for providing great resource for video recognitions.

It looks like one drive links have expired, and no longer able to download models from the model zoo.

About the dataloader.

Hi, thanks for your great job! I notice that there are two ways to load the video data: 1.extrated frames 2. on-the-fly decoding
Could you please provide some details about the difference in their loading speed? And how much memory space will the extracted frames take up (e.g. K400)?

Activitynet dataset

Could you tell me how to extracts keyframes the Activitynet dataset and what are the rules?

Zero-shot Video Performance.

Hello,

Thanks for your job!

I've used the Kinetics-400 pre-trained model (ViT-L/14 with 8 frames downloaded from https://drive.google.com/file/d/1tGfE6HDjTGZ7-y6XM7D6UJAx1Esj-q7u/view?usp=share_link) to perform a cross-dataset zero-shot evaluation on UCF-101 dataset. And I get the following results:

-----Full-classes Evaluation------
Overall Top1 57.928% Top5 88.071%
-----Half-classes Evaluation-----
Top1: mean 69.553%, std 6.283%
Top5: mean 92.737%, std 1.957%

And in your paper, the performance of zero-shot video recognition is that:
image

It's different and far below the performance that your paper reports. And I want to know which model is used for zero-shot video recognition.

Thanks,

MS

Access to checkpoints

Hi Wenhao,
Thanks for your great work!

I was trying to download the ViT/L-14-f8 checkpoint, but I encountered the following message: "We're sorry, but ****@gmail.com can't be found in the unisyd-my.sharepoint.com directory. Please try again later, while we try to automatically fix this for you."

Do you have any idea on how to solve this?
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.