Code Monkey home page Code Monkey logo

cap4video's Introduction

Hi, I'm Wenhao Wu 👋

Wenhao Wu 知乎 github LinkedIn Google Scholar X

Wenhao Wu (吴文灏🇨🇳) is a Ph.D. student in the School of Computer Science at The University of Sydney, supervised by Prof. Wanli Ouyang. I have a close collaboration with Department of Computer Vision Technology (VIS) at Baidu led by Dr. Jingdong Wang (IEEE Fellow). I received my M.S.E degree from Multimedia Laboratory (MMLab@SIAT), University of Chinese Academy of Sciences, supervised by Prof. Shifeng Chen and Prof. Yu Qiao. I was also fortunate to intern/RA at MMLab@CUHK, Baidu, iQIYI, SenseTime, Samsung Research and Chinese Academy of Sciences. I am honored to be awarded the 11th Baidu Scholarship (2023).

My current research interest includes Cross-Modal Learning and Video Understanding. I have published 20+ papers at the top international CV/AI conferences or journals such as CVPR/ICCV/ECCV/AAAI/IJCAI/ACMMM/IJCV.

Wenhao Wu's GitHub stats Top Langs

🔭 Research Interest

My research interests broadly lie in the areas of Computer Vision and Deep Learning, including:

  • Cross-Modal Learning (2022-Present): Video-Language Matching, Multimodal Large Language Model (MLLM)
  • Video Foundation Model (2017-Present): Video Recognition, Efficient Video Tuning
  • Video-related Applications (2017-2022): Video Sampler, Temporal Action Detection, Anomaly Detction in Video
  • Self-supervised Learning (2021-2022): Contrastive Video Learning, Masked Video Modeling
  • Low-level Vision (2021-2022): Image Colorization, Style Transfer, Image Rescaling

🔥 News

  • 2024.01: I am honored to receive the 11th🎖Baidu Scholarship🎖, a prestigious fellowship awarding 200,000 RMB (about $30,000) to a select 10 PhD students worldwide in Artificial Intelligence, selected from thousands of applicants.
  • 2023.11: We release GPT4Vis , which provides a Quantitative Evaluation of GPT-4 for Visual Understanding across images, videos and point clouds, spinning on 16 popular datasets.
  • 2023.11: We release Side4Video , a Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning, which significantly reduces the training memory cost for action recognition (↓75%) and text-video retrieval (↓30%).
  • 2023.08: The extension of Text4Vis has been accepted by IJCV.
  • 2023.07: Two First-author papers (Temporal Modeling: ATM , Cross-Modal Retrieval: UA ) are accepted by ICCV2023.
  • 2023.02: Two First-author papers for video understanding (BIKE , Cap4Video ) are accepted by CVPR 2023. Cap4Video involves GPT to enhance text-video learning, is selected as a 🎉Highlight paper🎉 (Top 2.5%).
  • 2022.11: Two papers (Video Recognition: Text4Vis , Style Transfer: AdaCM) are accepted by AAAI 2023.
  • 2022.07: Three papers (Video Sampling: NSNet, TSQNet, Cross-Modal Learning: CODER) are accepted by ECCV 2022.
  • 2022.06: Our MaMiCo, a new video self-supervised learning work, is accepted by ACMMM 2022 (🎉Oral Presentation🎉).

cap4video's People

Contributors

whwu95 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cap4video's Issues

Question about implementation details.

Hello, I admit that this is a good job.
However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ).
I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.

Resume training

Hi, the resume training only loads the optimizer state and also the loss doesnt start from where it stopped

在其他数据集上训练

您好,您论文中所有的数据集caption都为多条,请问如果我的数据集caption只有1条会影响实验结果吗?目前,我在自己的数据集上进行训练,训练结果正常,但测试时结果很差,R@1=1.0(测试数量为100)

some questions

image
这里面的C指的是这个视频生成的辅助字幕的数量吗?
我看最后做消融实验的时候说最好的结果是1个辅助字幕就会得到很好的结果
所以这个C该如何理解

Questions about [SEP] token

In the code, both the query-video branch and the query-caption branch use [SEP] embedding as the global feature of query or caption, but the paper mentions [CLS] embedding. So should I use [SEP] embedding or [CLS] embedding? Thank you.

Preprocess for other datasets

Hi, it's a good work!
I want to run the code on other video retrieval datasets, therefore I wonder know that how to convert the raw video into frame.
Such as the sample rate, size of every frames in your code.

> 在我们的论文中,查询视频分支和查询标题分支是分开训练的。我们首先训练查询视频分支 5 个周期。一旦训练了该分支,我们就继续训练查询标题分支。

          > 在我们的论文中,查询视频分支和查询标题分支是分开训练的。我们首先训练查询视频分支 5 个周期。一旦训练了该分支,我们就继续训练查询标题分支。

我看了你的代码,我发现在train_video.py中就已经使用到了字幕caption,那么此时我该如何理解你所说的前5轮是训练查询-视频分支的?(在我的理解中,你前5个epoch为了训练查询-视频分支,那么就不该出现字幕,因为如果存在字幕,就会导致查询编码器也处理字幕信息了,那么此时不就没有所谓的前五轮训练查询-视频分支的吗?)
我不知道我的理解正确不?我对着一部分很困惑,期望得到你的回复

Originally posted by @shams2023 in #4 (comment)

Some questions about the file.

Good afternoon, I'm reading and trying to run your code with the file you have uploaded. But sorry, I didn't successfully run this code, maybe it's because of file “sim_matrix” is not provided. Besides, may I ask when the pre-extracted video frame features will be uploaded? Hope for your reply and may you good journey.

Checkpoint weights

Hi,

Great work and thanks for sharing your code!
Just wondering, do you have plans to release the checkpoints weights of the models you already trained so that we can directly do inference on them?

Thanks!

Question of the caption file.

Hi,

Thanks for releasing the code and data.
I have checked the provided caption data and I found that there are two additional keys in the dataset, 'title' and 'titles'.
Can you provide some explanation about them? Like how did you get the data, from url or captioning model? And what is the difference between the two sets?

Thank you!

Caption encoder and query encoder share weights?

I am very confused: the caption encoder and query encoder share weights, so what are the optimized parameters for calculating QC matching? Why do we need to pass the capture embedding of CxD through MHA and multiply it with query embedding

Sample inference code

Hi,
Do you have a sample inference code to load the model, pre-process video and text, and get the similarity score ?

Thanks !

How the entire dataset is converted into captions

Thank you very much for your work!
How do you convert a video from an entire dataset into captions? I currently want to convert all the images or videos in the entire dataset into captions, but the code involved in the article [ZeroCap: Zero Shot Image to Text Generation for Visual Semantic Arithmetic] only works by converting one image into captions, so I really want to know what I need to do if I want to convert an entire dataset?
I really hope to receive your guidance. Thank you again

Low R1 performance in the 2nd stage

Thanks for sharing your code. Is it normal to get R1=30 with train_titles.py? After running the score fusion, the title matrix does not improve the video matrix.

Two branch or two loss

In paper, you mentioned that “To reduce conflict between the two branches, the query-video branch is trained first, followed by the query-caption branch”。However, you also mentioned that "The total loss L is the sum of Query-Video loss L_{QV} and Query-Caption loss L_{QC} ". Are the two branches trained separately? My question is that what is the loss when first train query-video branch the loss and secondly train query-caption branch. In addition, how much epochs do it take to train first query-video branch.

Inference on pretrained model

Hi,

your work is very nice.

I wonder if it would be possible for you to share the pre-trained model and instructions on how to use it. At least the instructions on how to use the pre-trained model on a test set of data. For example given a set of 20 videos, their captions and a query text, retrive the first video closer to the query text.

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.