Based on my understanding, you use only 8 frames sampled uniformly across the video. I

Increasing the total number of frames will indeed increase the training time. Al

frame rate and duration of videos about video-llama HOT 2 CLOSED

damo-nlp-sg commented on May 28, 2024 2

frame rate and duration of videos

from video-llama.

hangzhang-nlp commented on May 28, 2024 4

Increasing the total number of frames will indeed increase the training time. Although it won't add to learnable parameters, the amount of computation involved in encoding video will increase.
You can try to fix the frame rate. This modification is relatively easy to make in our framework. Since the video lengths in the training set vary significantly, some truncation and padding operations will be required.
Using a higher frame length to pre-train Video-LLaMA may be better, but the training time will be longer. In some video pre-training works, such as the one mentioned in Figure 3 of the CLIP4Clip paper, it has been observed that increasing the frame length beyond eight frames only yields a marginal improvement.
In the visual encoder (ViT and Q-Former), the frames are encoded independently. However, in the video Q-Former, the features of these frames are fused, resulting in an overall representation of the video.
Sure, you can use pre-extracted features from other models to replace the BLIP-2 module in our architecture.
The learnable modules in the Vision-Language Branch include: a) the position embedding layer, b) the Video Q-Former, and c) the Linear layer. These modules are optimized to effectively connect the output of the frozen visual encoder (BLIP-2) to the frozen LLMs (Language Model Modules)

from video-llama.