Question about implementation details. about cap4video HOT 17 OPEN

whwu95 commented on September 23, 2024

Question about implementation details.

from cap4video.

Comments (17)

sweet132 commented on September 23, 2024 1

The code is based on CLIP4Clip, Version of torch is 1.11.0 and cuda is 11.6 @Tiiivoo

from cap4video.

shams2023 commented on September 23, 2024

Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ). I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.

Hello, what do you need to prepare to run this project? I downloaded it for several days, but still couldn't successfully run it. How can I run this project?

from cap4video.

Tiiivoo commented on September 23, 2024

Hello, may I ask what version of PyTorch you are using? Have you encountered any issues when using batch_first=True?

from cap4video.

sweet132 commented on September 23, 2024

I just downloaded the code and data. It looks like 8 GPUs with 256 batch_size is essential for reproducing the project. @shams2023

from cap4video.

shams2023 commented on September 23, 2024

代码基于 CLIP4Clip，torch 版本为 1.11.0，cuda 为 11.6
Thank you for your answer|
The author mentioned in the article that the interaction module used is the co attention transformer, which part of the code is specifically implemented in?

from cap4video.

shallowdream66 commented on September 23, 2024

@sweet132 Have you ever noticed that how much graphics memory do you use when batch_size=128? When I turn down the batch_size_val, I still report the error of "CUDA out of memory when evaluating. Testing model at the end!".

from cap4video.

sweet132 commented on September 23, 2024

The modeling section isin modeling.py, which you can find what you want @shams2023

from cap4video.

sweet132 commented on September 23, 2024

If you have 8GPUs for batch_size=256, the memory of GPU will be around 20GB. You can reference as the setting. I am not sure why it takes up so much memory since it just needs around 11GB for CLIP4Clip @shallowdream66

from cap4video.

shallowdream66 commented on September 23, 2024

If you have 8GPUs for batch_size=256, the memory of GPU will be around 20GB. You can reference as the setting. I am not sure why it takes up so much memory since it just needs around 11GB for CLIP4Clip @shallowdream66

I am also very confused. Compared to CLIP4clip, it takes up a lot of memory and training time

from cap4video.

Tiiivoo commented on September 23, 2024

你好，我承认这是一份好工作。然而，在代码中，你设置了batch_size=256，但论文却说是128（也许我下载的论文版本不对？我是从arxiv下载的）。我复现了代码，发现当batch_size=256时，在msrvtt-1ka上的准确率与论文中相当，但当batch_size=128时，只有47%左右。

The code is based on CLIP4Clip, Version of torch is 1.11.0 and cuda is 11.6 @Tiiivoo

Hello, regarding the 'msrvtt_train_with_vitb32_max1_title_titles.json' file, I didn't understand where the 'titles' data comes from. It seems that MSR-VTT dataset doesn't have this part. If the 'titles' section is obtained through web crawling, why are there 30 of them?

from cap4video.

whwu95 commented on September 23, 2024

Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ). I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.

I'm glad to hear that you've successfully reproduced our results. Regarding the batch size issue, we apologize for any confusion, and it may indeed be an oversight in the paper. Please consider our code as the practical reference.

from cap4video.

sweet132 commented on September 23, 2024

Thank you for your reply, although I achieved similar results to the paper on msrvtt, I got poor results on msvd(46.1), where I trained directly on the raw data, while for vatex(62.0) dataset, I used the extracted frames you uploaded. I'm not sure why is that.
@whwu95

from cap4video.

sweet132 commented on September 23, 2024

Hello, I suggest you refer to the paper, the titles are generated by model (gpt-2 or clip) @Tiiivoo

from cap4video.

shams2023 commented on September 23, 2024

建模部分在modeling.py中，你可以在里面找到你想要的@shams2023

How to complete this task for a single card 3090?

from cap4video.

fazlicodes commented on September 23, 2024

from cap4video.

shams2023 commented on September 23, 2024

你好，我承认这是一份好工作。然而，在代码中，你设置了batch_size=256，但论文却说是128（也许我下载的论文版本不对？我是从arxiv下载的）。我复现了代码，发现当batch_size=256时，在msrvtt-1ka上的准确率与论文中相当，但当batch_size=128时，只有47%左右。

Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ). I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.

I want to seek some help from you!
In train_ In video.py, the first 5 epochs are used to train the video-query branch, so why do we calculate the caption value in the forward propagation of the model?

As shown in the following figure:

Aren't these first 5 epochs only used to train text encoders for text? (i.e., query encoder), then if caption is added at this point, does not it mean that the caption encoder has also been trained? I am confused about this part and hope to receive your help!
Thank you again for disturbing your time. Thank you!
（这前5个epoch对于文本来说，不是只训练文本编码器的吗?（即 query encoder），那么此时如果加入了caption，那么不就也训练了caption encoder了吗？我对这一部分，很困惑，期望能够得到你的帮助！
再次感谢，打扰到你的时间了，谢谢！）

from cap4video.

shams2023 commented on September 23, 2024

你好，我承认这是一份好工作。然而，在代码中，你设置了batch_size=256，但论文却说是128（也许我下载的论文版本不对？我是从arxiv下载的）。我复现了代码，发现当batch_size=256时，在msrvtt-1ka上的准确率与论文中相当，但当batch_size=128时，只有47%左右。

Can you send this out for me to refer to? I don't know how to define the storage location of the variables here. As shown in the following figure: (It can also be co_train_msrvtt. sh)

from cap4video.

Question about implementation details. about cap4video HOT 17 OPEN

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent