Comments (17)
The code is based on CLIP4Clip, Version of torch is 1.11.0 and cuda is 11.6 @Tiiivoo
from cap4video.
Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ). I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.
Hello, what do you need to prepare to run this project? I downloaded it for several days, but still couldn't successfully run it. How can I run this project?
from cap4video.
Hello, may I ask what version of PyTorch you are using? Have you encountered any issues when using batch_first=True?
from cap4video.
I just downloaded the code and data. It looks like 8 GPUs with 256 batch_size is essential for reproducing the project. @shams2023
from cap4video.
代码基于 CLIP4Clip,torch 版本为 1.11.0,cuda 为 11.6
Thank you for your answer|
The author mentioned in the article that the interaction module used is the co attention transformer, which part of the code is specifically implemented in?
from cap4video.
@sweet132 Have you ever noticed that how much graphics memory do you use when batch_size=128? When I turn down the batch_size_val, I still report the error of "CUDA out of memory when evaluating. Testing model at the end!".
from cap4video.
The modeling section isin modeling.py, which you can find what you want @shams2023
from cap4video.
If you have 8GPUs for batch_size=256, the memory of GPU will be around 20GB. You can reference as the setting. I am not sure why it takes up so much memory since it just needs around 11GB for CLIP4Clip @shallowdream66
from cap4video.
If you have 8GPUs for batch_size=256, the memory of GPU will be around 20GB. You can reference as the setting. I am not sure why it takes up so much memory since it just needs around 11GB for CLIP4Clip @shallowdream66
I am also very confused. Compared to CLIP4clip, it takes up a lot of memory and training time
from cap4video.
你好,我承认这是一份好工作。 然而,在代码中,你设置了batch_size=256,但论文却说是128(也许我下载的论文版本不对?我是从arxiv下载的)。 我复现了代码,发现当batch_size=256时,在msrvtt-1ka上的准确率与论文中相当,但当batch_size=128时,只有47%左右。
The code is based on CLIP4Clip, Version of torch is 1.11.0 and cuda is 11.6 @Tiiivoo
Hello, regarding the 'msrvtt_train_with_vitb32_max1_title_titles.json' file, I didn't understand where the 'titles' data comes from. It seems that MSR-VTT dataset doesn't have this part. If the 'titles' section is obtained through web crawling, why are there 30 of them?
from cap4video.
Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ). I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.
I'm glad to hear that you've successfully reproduced our results. Regarding the batch size issue, we apologize for any confusion, and it may indeed be an oversight in the paper. Please consider our code as the practical reference.
from cap4video.
Thank you for your reply, although I achieved similar results to the paper on msrvtt, I got poor results on msvd(46.1), where I trained directly on the raw data, while for vatex(62.0) dataset, I used the extracted frames you uploaded. I'm not sure why is that.
@whwu95
from cap4video.
Hello, I suggest you refer to the paper, the titles are generated by model (gpt-2 or clip) @Tiiivoo
from cap4video.
建模部分在modeling.py中,你可以在里面找到你想要的@shams2023
How to complete this task for a single card 3090?
from cap4video.
+1
from cap4video.
你好,我承认这是一份好工作。 然而,在代码中,你设置了batch_size=256,但论文却说是128(也许我下载的论文版本不对?我是从arxiv下载的)。 我复现了代码,发现当batch_size=256时,在msrvtt-1ka上的准确率与论文中相当,但当batch_size=128时,只有47%左右。
Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ). I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.
I want to seek some help from you!
In train_ In video.py, the first 5 epochs are used to train the video-query branch, so why do we calculate the caption value in the forward propagation of the model?
As shown in the following figure:
Aren't these first 5 epochs only used to train text encoders for text? (i.e., query encoder), then if caption is added at this point, does not it mean that the caption encoder has also been trained? I am confused about this part and hope to receive your help!
Thank you again for disturbing your time. Thank you!
(这前5个epoch对于文本来说,不是只训练文本编码器的吗?(即 query encoder),那么此时如果加入了caption,那么不就也训练了caption encoder了吗?我对这一部分,很困惑,期望能够得到你的帮助!
再次感谢,打扰到你的时间了,谢谢!)
from cap4video.
你好,我承认这是一份好工作。 然而,在代码中,你设置了batch_size=256,但论文却说是128(也许我下载的论文版本不对?我是从arxiv下载的)。 我复现了代码,发现当batch_size=256时,在msrvtt-1ka上的准确率与论文中相当,但当batch_size=128时,只有47%左右。
Can you send this out for me to refer to? I don't know how to define the storage location of the variables here. As shown in the following figure: (It can also be co_train_msrvtt. sh)
from cap4video.
Related Issues (20)
- Instructions to run MSVD, DiDeMo datasets HOT 4
- How the entire dataset is converted into captions HOT 1
- In preprocessing, which part of the video captions code is in your folder? HOT 1
- 1
- Which part of the code is the interaction module implemented in? HOT 1
- The parameter batch_first=True is causing an error
- How do I know if the video features will be better after interacting with caption? HOT 7
- Training requirements HOT 4
- > 在我们的论文中,查询视频分支和查询标题分支是分开训练的。我们首先训练查询视频分支 5 个周期。一旦训练了该分支,我们就继续训练查询标题分支。 HOT 1
- Training script for MSVD, DiDeMo, VATEX
- Requesting code to generate the frames from the MSRVTT dataset HOT 2
- Python, Pytorch, Torchvision, Cudatoolkit versions
- Resume training HOT 1
- some questions HOT 1
- Preprocess for other datasets HOT 1
- Questions about [SEP] token
- Inference on pretrained model
- 在其他数据集上训练
- Low R1 performance in the 2nd stage HOT 11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cap4video.