rese1f / moviechat Goto Github PK

[CVPR 2024] 🎬💭 chat with over 10K frames of video!

Home Page: https://rese1f.github.io/MovieChat/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

computer-vision multimodal-large-language-models long-video-understanding llama large-language-models dataset

moviechat's Introduction

I am a master student of Information Processing Lab at University of Washington. I am currently working on embodied agent and video understanding. Have a look at my homepage for more details.

When I am not doing research, I like photography, traveling, and singing.

My GPTs:

Academic Paper Writing Assistant: For AI academic papers.
Paper Search Engine: Expert in latest academic paper search and summary.

Updates:

03/2024: One paper accepted to ICLR 2024 workshop at LLM Agents.
02/2024: Two papers accepted to CVPR 2024.
02/2024: Invited talk at AAAI 2024 workshop at IMAGEOMICS.
12/2023: One paper accepted to ICASSP 2024.
12/2023: One paper accepted to AAAI 2024.
11/2023: Two papers accepted to WACV 2024 and its workshop at CV4Smalls.
09/2023: One paper accepted to ICCV 2023 workshop at TNGCV-DataComp.
09/2023: One paper accepted to IEEE T-MM.
08/2023: One paper accepted to BMVC 2023.
07/2023: Two papers accepted to ACM MM 2023.
07/2023: Finished my research internship at Microsoft Research Asia (MSRA), Beijing.
07/2023: Two papers accepted to ICCV 2023.

moviechat's People

Stargazers

Watchers

moviechat's Issues

YAML file

In the yaml file for llama_model should it be the --target directory of the apply_delta.py or the --delta

local demo running issue

There are some problems when I try to run the local demo:

Missing "src/examples/Cooking_cake.mp4" metioned in README
In Line 373 of inference.py (https://github.com/rese1f/MovieChat/blob/main/inference.py#L373), middle_video is set True, so what's the point of setting it to 0 in args mentioned in README? How to run the model in the global mode?

THX!

Will you release the raw videos of MovieChat-1K dataset?

This is a good work. Will you release the videos and annotations of MovieChat-1K dataset?

How can I get MovieChat-1K dataset?

You guys did a great job, I would like to use your dataset to test other models, how can I get MovieChat-1K dataset?

Hi, may I know what is the file used in "--gt_file"? Is it train, test or val.json of MSVD-QA?
I want to replicate the result shown in paper.
python run_inference_qa_msvd.py
--cfg-path eval_configs/MovieChat.yaml
--gpu-id 0
--num-beams 1
--temperature 1.0
--video-path /path/to/your/video
--gt_file /path/to/your/question and answer file
--output_dir /path/to/your/output
--output_name msvd-qa
--fragment-video-path src/video_fragment/output.mp4 \

No main funciton in inference.py

Dear author,

When I try to infer the model, it seems nothing happens.

Have you release the model weights

Thank you for your great work! I find the pre-trained weight link is pointed to the weight of video-llama. Have you released the model weights of this work?

About quantitative evaluation

Hi, thanks for your work.
I am curious to know what you did to improve the quantitative evaluation results from v1 to v2.
Thanks.

Did not find video Cooking_cake.mp4

Hi! I am trying to run the demo, but I cannot find the video src/examples/Cooking_cake.mp4. Could you please upload the video?

Is positional embedding applied as intended when batch_size > 1

The codes are assuming batch_size == 1 so it wouldn't be a big deal, but the code is misleading.

Reproducing the pos emb expansion in MovieChat could be:

video_frame_position_embedding = nn.Embedding(max_frame_pos, self.Qformer.config.hidden_size) # [32,768]
n_position = 8
position_ids = torch.arange(n_position).long() # [8]
position_ids = position_ids.unsqueeze(0).expand(batch_size, -1) # [batch_size, 8]
p = video_frame_position_embedding(position_ids).squeeze(0) # [batch_size, 8, 768] <- squeeze does not have any effect here if batch_size > 1

u = []
alpha = 0.01 

for p_i in p: # <- this loops through batch_size, so all p_i are identical
    u_i = (p_i-alpha * p[0])/(1-alpha)
    u.append(u_i)
# so here, all u_i are identical, e.g., torch.all(u[0] == u[1]) is True

frame_position_embeddings = []
for i in range(n_position):
    for j in range(n_position):
        q_i = alpha * u[i] + (1-alpha) * u[j] # if 1 < batch_size < n_position, index error occurs here
        q_i = q_i.unsqueeze(0)
        frame_position_embeddings.append(q_i)

frame_position_embeddings = torch.cat(frame_position_embeddings, dim = 0)

In addition, if the self.video_frame_position_embedding is fixed, why do we need to compute frame_position_embeddings for every inference?

How to Run the Demo

While running the inference exactly as recommended on the main page and using a random video test.mp4:
'python inference.py --cfg-path eval_configs/MovieChat.yaml --gpu-id 0 --num-beams 1 --temperature 1.0 --text-query "What is he doing?" --video-path src/examples/test.mp4 --fragment-video-path src/video_fragment/output.mp4 --cur-min 1 --cur-sec 1 --middle-video 1',

it crashes with the following error:
'Traceback (most recent call last): File "MovieChat/inference.py", line 363, in cv2.imwrite(temp_frame_path, frame) cv2.error: OpenCV(4.7.0) /io/opencv/modules/imgcodecs/src/loadsave.cpp:783: error: (-215:Assertion failed) !_img.empty() in function 'imwrite''

Using these models on .yaml:
llama_model: "ckpt/Llama-2-7b-hf"
llama_proj_model: 'ckpt/minigpt4/pretrained_minigpt4.pth'
ckpt: "ckpt/finetune-vicuna7b-v2.pth"

Q: How to resolve this?

Inference fails

I tried running inference with a sample and I got the following error:
Traceback (most recent call last): File "/l/users/mohamed.imam/MovieChatV2/MovieChat/inference.py", line 363, in <module> cv2.imwrite(temp_frame_path, frame) cv2.error: OpenCV(4.7.0) /io/opencv/modules/imgcodecs/src/loadsave.cpp:783: error: (-215:Assertion failed) !_img.empty() in function 'imwrite'

Inference fails

When I inference.py, it reaches upto this point and then the process gets killed. Can I know the reason for it?
Initializing Chat Loading VIT Loading VIT Done Loading Q-Former Using pad_token, but it is not set yet. Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]

How much memory is required? I tried with 100gb RAM machine and it still failed

errors when using the global mode

when using the global mode, how to set the parameter?
I set:
"--cfg-path", "eval_configs/MovieChat.yaml",
"--gpu-id", "0",
"--num-beams", "1" ,
"--temperature", "1.0" ,
"--text-query", "Summarize the video" ,
"--video-path", "/home/develop/fyy/MovieChat-main/src/examples/11.mp4" ,
"--fragment-video-path", "src/video_fragment/output.mp4" ,
"--cur-min", "0" ,
"--cur-sec", "0" ,
"--middle-video", "0"
but

Can I deal with Chinese video? How to do this?

Thanks for your great work!
Can I deal with Chinese video?
Ask questions in Chinese?

How to do this?

Runtime Error

I got the following error when running the inference script:
Traceback (most recent call last): File "/home/mohamed.imam/Projects/MovieChat/inference.py", line 384, in <module> msg = chat.upload_video_without_audio( File "/home/mohamed.imam/Projects/MovieChat/inference.py", line 277, in upload_video_without_audio video_emb, _ = self.model.encode_long_video(cur_image, middle_video) File "/home/mohamed.imam/Projects/MovieChat/MovieChat/models/moviechat.py", line 337, in encode_long_video cur_short = torch.cat(self.temp_short_memory, dim = 0) RuntimeError: torch.cat(): expected a non-empty list of Tensors

I skipped the cur_short assignment line of code, but in the next line the variable video_features is used before assignment:
if len(self.long_memory_buffer) == 0:
self.temp_short_memory = [i.unsqueeze(0) for i in self.temp_short_memory]
cur_short = torch.cat(self.temp_short_memory, dim = 0)
video_features = torch.cat([video_features, cur_image], dim = 0)
else:
cur_video = torch.cat(self.long_memory_buffer,dim = 0)
self.temp_short_memory = [i.unsqueeze(0) for i in self.temp_short_memory]
cur_short = torch.cat(self.temp_short_memory, dim = 0)
video_features = torch.cat([cur_video,cur_short], dim = 0)
video_features = torch.cat([video_features, cur_image], dim = 0)

inference bug

There are many SyntaxError in inference.py:
File "inference.py", line 82
def call(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
^
SyntaxError: invalid syntax

File "inference.py", line 134
msg = f"The video contains {len(indices)} frames sampled at {sec} seconds. "
^
SyntaxError: invalid syntax

Breakpoint mode code questions?

Hi, thank you for the great work! I would appreciate if you could address some of my questions. For the breakpoint mode, from the code below,

The num_frames is the number of segments before the current second, and the cur_frame is the frame index in the current segment. Do I understand it correctly?
It seems that only the video fragments before the queried time are considered in the long-term memory. Is this true? If so, should it actually incorporate all fragments in the video?
Why when doing self.model.encode_short_memory_frame(video_fragment, cur_frame) in line 271, we only consider the first cur_frame frames of each segment? Should it only applies to the last segment? In other words, for the segments before the last segment, should we adopt line 273 which considers all frames in the segment?

inference的结果是一堆乱码，哪里出问题了？

python inference.py --cfg-path eval_configs/MovieChat.yaml --gpu-id 0 --num-beams 1 --temperature 1.0 --text-query "describe this video" --video-path src/examples/p1472zrh7a0.mp4 --fragment-video-path src/video_fragment/output.mp4 --cur-min 1 --cur-sec 1 --middle-video 1
出来的结果是乱码呢，大佬看看？

Unused positional embeddings

MovieChat/MovieChat/models/moviechat.py

Line 309 in 340bbde

frame_position_embeddings = p.unsqueeze(-2)

is overwritten by

MovieChat/MovieChat/models/moviechat.py

Line 319 in 340bbde

frame_position_embeddings = []

Positional embeddings are expanded with p, u, and q, so the line 309 may be safely removed.

How much time does it cost to train this model？

Dear author, How much time does it cost to train this model？ With what type of GPU cards?

Using pad_token, but it is not set yet.

When I run the inference.py, it shows me "Using pad_token, but it is not set yet." Then I have to wait a few minutes. Is that a normal phenomenon？

Pretraining details

Hi, the paper doesn't contain any explicit pretraining details. Where can i find it?

Ambiguous vicuna version

Awesome work.

May I make sure that the vicuna weight is v0 not v1?
There may be a typo as shown below.

how to correctly run the demo

I want to use the demo to get some answers for a video. Unfortunately, the output I got using your demo seem not to be the answers for the input text-query as follows（the last line is the output）

Question About Similarity Formulation in Token Merge

Hello, in your paper the similarity is calculated as:
$$s=\frac{1}{N}\sum_{j=1}^{N} [\cos (x_i^j,x_{i+1}^j)]$$

However, according to this_code,
the similarity is calculated indeed as:
$$s=\frac{1}{N^2}\sum_{j=1}^{N} \sum_{k=1}^{N} [\cos (x_i^j,x_{i+1}^k)]$$

It confused me. Which one is the right case, or do I miss some information? Thank you.

Can not run local demo

Dear author,

I encounter the following errors when running the local gradio demo:

IndexError: index 64 is out of bounds for dimension 0 with size 64

Any hints?

Thanks

Training details

Awesome work!
Will you share the training or fintuning code?

Model is being loaded into RAM not GPU memory

Loading checkpoint shards gets killed at 0%
After debugging, I found out that the model is being load into ram not gpu memory despite having gpu memory. Does anyone face this issue or know why this is happening?

Equation (1) in the paper

I think there is an issue with range of i in equation (1)

It should be i = 0,1,...,[T/C]

About short_memory_merge and long_memory_length

From conference.py, it can be seen that in the "encode_short_memory_frame" function, short_memory_merge is used to merge, however, long_ memory_length parameter looks more like the RL mentioned in the paper, but it is not used. May I ask if you can provide a detailed explanation of the operation process of the memory network? For example, in the form of pseudocode or algorithms. The expression in the paper and code really confuses me.

example video可以开放下不？

论文里的两个example视频都挺有意思的，方便给个链接开放下吗？
src/examples/Cooking_cake.mp4
src/examples/goblin.mp4

Can I use Llama2 ?

Hi,
I cant seem to get the link to download Llama weights.
I fill the form but didn't get the link.

Thanks,

Will a Gradio demo be released？

Token Merging

So when you compare x1 with x2 it is mentioned in equation (3) that you compare N tokens and get the average.
My question is x1 has a shape of CxNxD. So doesn't it have CxN number of tokens to compare with x2?

Unable to access LLaMA weights to build Vicuna-7B

Dear Team,

Thank you for sharing this great work with the community.

I am trying to set up the inference model for videochat but I am having difficulties in downloading the official LLaMA 7b weights as they are not officially available from meta.

The provided links in below instruction do not provide for LLaMA 1 models. We can only download llama 2 models from the below links.

Get the original LLaMA weights in the Hugging Face format by following the instructions here.

Could you upload the llama 7b and llama 13b weights so we can directly utilize them?

Thank you in advance!

Kind regards,
Muhammad Uzair

MovieChat-1K dataset download

Excellent work on the updated paper!
Where can I download the MovieChat-1K dataset?

Will train/finetuning code be published?

Hi thank you so much for this very interesting paper.

Are there any plans to release the code for training/finetuning MovieChat on other datasets?

Thank you so much!

release the training code and finetune code

Thanks for your excellent work!

Is there any plan about releasing the training code and finetune code?

Frame-aware?

Hello! I wanted to know if this model is frame aware? Can I ask questions like "when does the person wearing yellow jacket appear in this video?" Doesn't seem like models like VideoChat can do it based on demos on huggingface but in the paper I saw a figure where Video-Llama could say which frame it occurred in. Is MovieChat able to do that?

rese1f / moviechat Goto Github PK

moviechat's Introduction

moviechat's People

Stargazers

Watchers

Forkers

moviechat's Issues

Recommend Projects

Recommend Topics

Recommend Org