whwu95 / cap4video Goto Github PK

View Code? Open in Web Editor NEW

207.0 207.0 16.0 8.76 MB

【CVPR'2023 Highlight】Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

Home Page: https://arxiv.org/abs/2301.00184

License: MIT License

Shell 0.47% Python 99.53%

cross-modal-learning video-language-understanding video-text-retrieval video-understanding

cap4video's Introduction

Hi, I'm Wenhao Wu 👋

Wenhao Wu (吴文灏🇨🇳) is a Ph.D. student in the School of Computer Science at The University of Sydney, supervised by Prof. Wanli Ouyang. I have a close collaboration with Department of Computer Vision Technology (VIS) at Baidu led by Dr. Jingdong Wang (IEEE Fellow). I received my M.S.E degree from Multimedia Laboratory (MMLab@SIAT), University of Chinese Academy of Sciences, supervised by Prof. Shifeng Chen and Prof. Yu Qiao. I was also fortunate to intern/RA at MMLab@CUHK, Baidu, iQIYI, SenseTime, Samsung Research and Chinese Academy of Sciences. I am honored to be awarded the 11th Baidu Scholarship (2023).

My current research interest includes Cross-Modal Learning and Video Understanding. I have published 20+ papers at the top international CV/AI conferences or journals such as CVPR/ICCV/ECCV/AAAI/IJCAI/ACMMM/IJCV.

🔭 Research Interest

My research interests broadly lie in the areas of Computer Vision and Deep Learning, including:

Cross-Modal Learning (2022-Present): Video-Language Matching, Multimodal Large Language Model (MLLM)
Video Foundation Model (2017-Present): Video Recognition, Efficient Video Tuning
Video-related Applications (2017-2022): Video Sampler, Temporal Action Detection, Anomaly Detction in Video
Self-supervised Learning (2021-2022): Contrastive Video Learning, Masked Video Modeling
Low-level Vision (2021-2022): Image Colorization, Style Transfer, Image Rescaling

🔥 News

2024.01: I am honored to receive the 11th🎖Baidu Scholarship🎖, a prestigious fellowship awarding 200,000 RMB (about $30,000) to a select 10 PhD students worldwide in Artificial Intelligence, selected from thousands of applicants.
2023.11: We release GPT4Vis , which provides a Quantitative Evaluation of GPT-4 for Visual Understanding across images, videos and point clouds, spinning on 16 popular datasets.
2023.11: We release Side4Video , a Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning, which significantly reduces the training memory cost for action recognition (↓75%) and text-video retrieval (↓30%).
2023.08: The extension of Text4Vis has been accepted by IJCV.
2023.07: Two First-author papers (Temporal Modeling: ATM , Cross-Modal Retrieval: UA ) are accepted by ICCV2023.
2023.02: Two First-author papers for video understanding (BIKE , Cap4Video ) are accepted by CVPR 2023. Cap4Video involves GPT to enhance text-video learning, is selected as a 🎉Highlight paper🎉 (Top 2.5%).
2022.11: Two papers (Video Recognition: Text4Vis , Style Transfer: AdaCM) are accepted by AAAI 2023.
2022.07: Three papers (Video Sampling: NSNet, TSQNet, Cross-Modal Learning: CODER) are accepted by ECCV 2022.
2022.06: Our MaMiCo, a new video self-supervised learning work, is accepted by ACMMM 2022 (🎉Oral Presentation🎉).

cap4video's People

Contributors

Stargazers

Watchers

Forkers

dannielge lilidamowang maigone luluchou grok-king xupercoin verababe farmingtong mistyr0se wensiyuansix sailfish009 dabinishere hzwinsome hay-man ntt720 awekling

cap4video's Issues

Question about implementation details.

Hello, I admit that this is a good job.
However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ).
I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.

Training requirements

What is the GPUs used for training?

Resume training

Hi, the resume training only loads the optimizer state and also the loss doesnt start from where it stopped

1 How do I know if the video features will be better after interacting with caption?

Requesting code to generate the frames from the MSRVTT dataset

Hi, I've been trying to unzip the frames_30fps.zip for the past 2 days but its very slow and often gets disrupted. Could you please provide me the code to generate the frames_30fps using the MSRVTT video dataset as input

在其他数据集上训练

您好，您论文中所有的数据集caption都为多条，请问如果我的数据集caption只有1条会影响实验结果吗？目前，我在自己的数据集上进行训练，训练结果正常，但测试时结果很差，R@1=1.0（测试数量为100）

some questions

这里面的C指的是这个视频生成的辅助字幕的数量吗？
我看最后做消融实验的时候说最好的结果是1个辅助字幕就会得到很好的结果
所以这个C该如何理解

When will the caption files for other datasets provided?

I have only found the caption files for MSRVTT in the releases. When will the caption files for other datasets (MSVD, VATEX etc.) be provided?

Which part of the code is the interaction module implemented in?

I didn't see where the video-caption interaction was actually implemented

Instructions to run MSVD, DiDeMo datasets

Hi,

Congratulations on the amazing work. Can you release scripts to run on MSVD, DiDeMo and VATEX datasets.

Thank you.

Questions about [SEP] token

In the code, both the query-video branch and the query-caption branch use [SEP] embedding as the global feature of query or caption, but the paper mentions [CLS] embedding. So should I use [SEP] embedding or [CLS] embedding? Thank you.

The parameter batch_first=True is causing an error

When I run the code with batch_first=True, I encounter an error. Upon researching, it seems that this parameter is only available in Torch 1.9.0. How can I resolve this?

Preprocess for other datasets

Hi, it's a good work!
I want to run the code on other video retrieval datasets, therefore I wonder know that how to convert the raw video into frame.
Such as the sample rate, size of every frames in your code.

> 在我们的论文中，查询视频分支和查询标题分支是分开训练的。我们首先训练查询视频分支 5 个周期。一旦训练了该分支，我们就继续训练查询标题分支。

          > 在我们的论文中，查询视频分支和查询标题分支是分开训练的。我们首先训练查询视频分支 5 个周期。一旦训练了该分支，我们就继续训练查询标题分支。

我看了你的代码，我发现在train_video.py中就已经使用到了字幕caption，那么此时我该如何理解你所说的前5轮是训练查询-视频分支的？（在我的理解中，你前5个epoch为了训练查询-视频分支，那么就不该出现字幕，因为如果存在字幕，就会导致查询编码器也处理字幕信息了，那么此时不就没有所谓的前五轮训练查询-视频分支的吗？）
我不知道我的理解正确不？我对着一部分很困惑，期望得到你的回复

Originally posted by @shams2023 in #4 (comment)

Some questions about the file.

Good afternoon, I'm reading and trying to run your code with the file you have uploaded. But sorry, I didn't successfully run this code, maybe it's because of file “sim_matrix” is not provided. Besides, may I ask when the pre-extracted video frame features will be uploaded? Hope for your reply and may you good journey.

Checkpoint weights

Hi,

Great work and thanks for sharing your code!
Just wondering, do you have plans to release the checkpoints weights of the models you already trained so that we can directly do inference on them?

Thanks!

Question of the caption file.

Hi,

Thanks for releasing the code and data.
I have checked the provided caption data and I found that there are two additional keys in the dataset, 'title' and 'titles'.
Can you provide some explanation about them? Like how did you get the data, from url or captioning model? And what is the difference between the two sets?

Thank you!

Caption encoder and query encoder share weights?

I am very confused: the caption encoder and query encoder share weights, so what are the optimized parameters for calculating QC matching? Why do we need to pass the capture embedding of CxD through MHA and multiply it with query embedding

Sample inference code

Hi,
Do you have a sample inference code to load the model, pre-process video and text, and get the similarity score ?

Thanks !

How the entire dataset is converted into captions

Thank you very much for your work!
How do you convert a video from an entire dataset into captions? I currently want to convert all the images or videos in the entire dataset into captions, but the code involved in the article [ZeroCap: Zero Shot Image to Text Generation for Visual Semantic Arithmetic] only works by converting one image into captions, so I really want to know what I need to do if I want to convert an entire dataset?
I really hope to receive your guidance. Thank you again

Low R1 performance in the 2nd stage

Thanks for sharing your code. Is it normal to get R1=30 with train_titles.py? After running the score fusion, the title matrix does not improve the video matrix.

Two branch or two loss

In paper, you mentioned that “To reduce conflict between the two branches, the query-video branch is trained first, followed by the query-caption branch”。However, you also mentioned that "The total loss L is the sum of Query-Video loss L_{QV} and Query-Caption loss L_{QC} ". Are the two branches trained separately？ My question is that what is the loss when first train query-video branch the loss and secondly train query-caption branch. In addition, how much epochs do it take to train first query-video branch.

Python, Pytorch, Torchvision, Cudatoolkit versions

Unable to train the model, I think the library versions are not compatible.
I'm getting the following error:
unrecognized arguments: --local_rank

In preprocessing, which part of the video captions code is in your folder?

Thank you for your work, it's a great job!
I cannot find the code for generating video captions in your code. May I know which part of the code is in which folder? I really need this part of the code to verify my idea. Thank you for your help!

The data link becomes 404

Hi，
The preprocessed frames link you provided is invalid.

Inference on pretrained model

Hi,

your work is very nice.

I wonder if it would be possible for you to share the pre-trained model and instructions on how to use it. At least the instructions on how to use the pre-trained model on a test set of data. For example given a set of 20 videos, their captions and a query text, retrive the first video closer to the query text.