Code Monkey home page Code Monkey logo

moviechat's Introduction

I am a master student of Information Processing Lab at University of Washington. I am currently working on embodied agent and video understanding. Have a look at my homepage for more details.

When I am not doing research, I like photography, traveling, and singing.



My GPTs:


Updates:

  • 03/2024: One paper accepted to ICLR 2024 workshop at LLM Agents.
  • 02/2024: Two papers accepted to CVPR 2024.
  • 02/2024: Invited talk at AAAI 2024 workshop at IMAGEOMICS.
  • 12/2023: One paper accepted to ICASSP 2024.
  • 12/2023: One paper accepted to AAAI 2024.
  • 11/2023: Two papers accepted to WACV 2024 and its workshop at CV4Smalls.
  • 09/2023: One paper accepted to ICCV 2023 workshop at TNGCV-DataComp.
  • 09/2023: One paper accepted to IEEE T-MM.
  • 08/2023: One paper accepted to BMVC 2023.
  • 07/2023: Two papers accepted to ACM MM 2023.
  • 07/2023: Finished my research internship at Microsoft Research Asia (MSRA), Beijing.
  • 07/2023: Two papers accepted to ICCV 2023.

moviechat's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

moviechat's Issues

YAML file

In the yaml file for llama_model should it be the --target directory of the apply_delta.py or the --delta

Evaluation json file

Hi, may I know what is the file used in "--gt_file"? Is it train, test or val.json of MSVD-QA?
I want to replicate the result shown in paper.
python run_inference_qa_msvd.py
--cfg-path eval_configs/MovieChat.yaml
--gpu-id 0
--num-beams 1
--temperature 1.0
--video-path /path/to/your/video
--gt_file /path/to/your/question and answer file
--output_dir /path/to/your/output
--output_name msvd-qa
--fragment-video-path src/video_fragment/output.mp4 \

Have you release the model weights

Thank you for your great work! I find the pre-trained weight link is pointed to the weight of video-llama. Have you released the model weights of this work?

About quantitative evaluation

Hi, thanks for your work.
I am curious to know what you did to improve the quantitative evaluation results from v1 to v2.
Thanks.

Is positional embedding applied as intended when batch_size > 1

The codes are assuming batch_size == 1 so it wouldn't be a big deal, but the code is misleading.

Reproducing the pos emb expansion in MovieChat could be:

video_frame_position_embedding = nn.Embedding(max_frame_pos, self.Qformer.config.hidden_size) # [32,768]
n_position = 8
position_ids = torch.arange(n_position).long() # [8]
position_ids = position_ids.unsqueeze(0).expand(batch_size, -1) # [batch_size, 8]
p = video_frame_position_embedding(position_ids).squeeze(0) # [batch_size, 8, 768] <- squeeze does not have any effect here if batch_size > 1

u = []
alpha = 0.01 

for p_i in p: # <- this loops through batch_size, so all p_i are identical
    u_i = (p_i-alpha * p[0])/(1-alpha)
    u.append(u_i)
# so here, all u_i are identical, e.g., torch.all(u[0] == u[1]) is True

frame_position_embeddings = []
for i in range(n_position):
    for j in range(n_position):
        q_i = alpha * u[i] + (1-alpha) * u[j] # if 1 < batch_size < n_position, index error occurs here
        q_i = q_i.unsqueeze(0)
        frame_position_embeddings.append(q_i)

frame_position_embeddings = torch.cat(frame_position_embeddings, dim = 0)

In addition, if the self.video_frame_position_embedding is fixed, why do we need to compute frame_position_embeddings for every inference?

How to Run the Demo

While running the inference exactly as recommended on the main page and using a random video test.mp4:
'python inference.py --cfg-path eval_configs/MovieChat.yaml --gpu-id 0 --num-beams 1 --temperature 1.0 --text-query "What is he doing?" --video-path src/examples/test.mp4 --fragment-video-path src/video_fragment/output.mp4 --cur-min 1 --cur-sec 1 --middle-video 1',

it crashes with the following error:
'Traceback (most recent call last): File "MovieChat/inference.py", line 363, in cv2.imwrite(temp_frame_path, frame) cv2.error: OpenCV(4.7.0) /io/opencv/modules/imgcodecs/src/loadsave.cpp:783: error: (-215:Assertion failed) !_img.empty() in function 'imwrite''

Using these models on .yaml:
llama_model: "ckpt/Llama-2-7b-hf"
llama_proj_model: 'ckpt/minigpt4/pretrained_minigpt4.pth'
ckpt: "ckpt/finetune-vicuna7b-v2.pth"

Q: How to resolve this?

Inference fails

I tried running inference with a sample and I got the following error:
Traceback (most recent call last): File "/l/users/mohamed.imam/MovieChatV2/MovieChat/inference.py", line 363, in <module> cv2.imwrite(temp_frame_path, frame) cv2.error: OpenCV(4.7.0) /io/opencv/modules/imgcodecs/src/loadsave.cpp:783: error: (-215:Assertion failed) !_img.empty() in function 'imwrite'

Inference fails

When I inference.py, it reaches upto this point and then the process gets killed. Can I know the reason for it?
Initializing Chat Loading VIT Loading VIT Done Loading Q-Former Using pad_token, but it is not set yet. Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]

How much memory is required? I tried with 100gb RAM machine and it still failed

errors when using the global mode

when using the global mode, how to set the parameter?
I set:
"--cfg-path", "eval_configs/MovieChat.yaml",
"--gpu-id", "0",
"--num-beams", "1" ,
"--temperature", "1.0" ,
"--text-query", "Summarize the video" ,
"--video-path", "/home/develop/fyy/MovieChat-main/src/examples/11.mp4" ,
"--fragment-video-path", "src/video_fragment/output.mp4" ,
"--cur-min", "0" ,
"--cur-sec", "0" ,
"--middle-video", "0"
but
微信图片_20240308214844

Runtime Error

I got the following error when running the inference script:
Traceback (most recent call last): File "/home/mohamed.imam/Projects/MovieChat/inference.py", line 384, in <module> msg = chat.upload_video_without_audio( File "/home/mohamed.imam/Projects/MovieChat/inference.py", line 277, in upload_video_without_audio video_emb, _ = self.model.encode_long_video(cur_image, middle_video) File "/home/mohamed.imam/Projects/MovieChat/MovieChat/models/moviechat.py", line 337, in encode_long_video cur_short = torch.cat(self.temp_short_memory, dim = 0) RuntimeError: torch.cat(): expected a non-empty list of Tensors

I skipped the cur_short assignment line of code, but in the next line the variable video_features is used before assignment:
if len(self.long_memory_buffer) == 0:
self.temp_short_memory = [i.unsqueeze(0) for i in self.temp_short_memory]
cur_short = torch.cat(self.temp_short_memory, dim = 0)
video_features = torch.cat([video_features, cur_image], dim = 0)
else:
cur_video = torch.cat(self.long_memory_buffer,dim = 0)
self.temp_short_memory = [i.unsqueeze(0) for i in self.temp_short_memory]
cur_short = torch.cat(self.temp_short_memory, dim = 0)
video_features = torch.cat([cur_video,cur_short], dim = 0)
video_features = torch.cat([video_features, cur_image], dim = 0)

inference bug

There are many SyntaxError in inference.py:
File "inference.py", line 82
def call(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
^
SyntaxError: invalid syntax

File "inference.py", line 134
msg = f"The video contains {len(indices)} frames sampled at {sec} seconds. "
^
SyntaxError: invalid syntax

Breakpoint mode code questions?

Hi, thank you for the great work! I would appreciate if you could address some of my questions. For the breakpoint mode, from the code below,

  1. The num_frames is the number of segments before the current second, and the cur_frame is the frame index in the current segment. Do I understand it correctly?
  2. It seems that only the video fragments before the queried time are considered in the long-term memory. Is this true? If so, should it actually incorporate all fragments in the video?
  3. Why when doing self.model.encode_short_memory_frame(video_fragment, cur_frame) in line 271, we only consider the first cur_frame frames of each segment? Should it only applies to the last segment? In other words, for the segments before the last segment, should we adopt line 273 which considers all frames in the segment?

image

inference的结果是一堆乱码,哪里出问题了?

image
python inference.py --cfg-path eval_configs/MovieChat.yaml --gpu-id 0 --num-beams 1 --temperature 1.0 --text-query "describe this video" --video-path src/examples/p1472zrh7a0.mp4 --fragment-video-path src/video_fragment/output.mp4 --cur-min 1 --cur-sec 1 --middle-video 1
出来的结果是乱码呢,大佬看看?

Pretraining details

Hi, the paper doesn't contain any explicit pretraining details. Where can i find it?

Ambiguous vicuna version

Awesome work.

May I make sure that the vicuna weight is v0 not v1?
There may be a typo as shown below.
image

how to correctly run the demo

I want to use the demo to get some answers for a video. Unfortunately, the output I got using your demo seem not to be the answers for the input text-query as follows(the last line is the output)

1

Question About Similarity Formulation in Token Merge

Hello, in your paper the similarity is calculated as:
$$s=\frac{1}{N}\sum_{j=1}^{N} [\cos (x_i^j,x_{i+1}^j)]$$

However, according to this_code,
the similarity is calculated indeed as:
$$s=\frac{1}{N^2}\sum_{j=1}^{N} \sum_{k=1}^{N} [\cos (x_i^j,x_{i+1}^k)]$$

It confused me. Which one is the right case, or do I miss some information? Thank you.

Can not run local demo

Dear author,

I encounter the following errors when running the local gradio demo:

IndexError: index 64 is out of bounds for dimension 0 with size 64

Any hints?

Thanks

Model is being loaded into RAM not GPU memory

Loading checkpoint shards gets killed at 0%
After debugging, I found out that the model is being load into ram not gpu memory despite having gpu memory. Does anyone face this issue or know why this is happening?

About short_memory_merge and long_memory_length

From conference.py, it can be seen that in the "encode_short_memory_frame" function, short_memory_merge is used to merge, however, long_ memory_length parameter looks more like the RL mentioned in the paper, but it is not used. May I ask if you can provide a detailed explanation of the operation process of the memory network? For example, in the form of pseudocode or algorithms. The expression in the paper and code really confuses me.

example video可以开放下不?

论文里的两个example视频都挺有意思的,方便给个链接开放下吗?
src/examples/Cooking_cake.mp4
src/examples/goblin.mp4

Can I use Llama2 ?

Hi,
I cant seem to get the link to download Llama weights.
I fill the form but didn't get the link.

Thanks,

Token Merging

So when you compare x1 with x2 it is mentioned in equation (3) that you compare N tokens and get the average.
My question is x1 has a shape of CxNxD. So doesn't it have CxN number of tokens to compare with x2?

Unable to access LLaMA weights to build Vicuna-7B

Dear Team,

Thank you for sharing this great work with the community.

I am trying to set up the inference model for videochat but I am having difficulties in downloading the official LLaMA 7b weights as they are not officially available from meta.

The provided links in below instruction do not provide for LLaMA 1 models. We can only download llama 2 models from the below links.

Get the original LLaMA weights in the Hugging Face format by following the instructions here.

Could you upload the llama 7b and llama 13b weights so we can directly utilize them?

Thank you in advance!

Kind regards,
Muhammad Uzair

Will train/finetuning code be published?

Hi thank you so much for this very interesting paper.

Are there any plans to release the code for training/finetuning MovieChat on other datasets?

Thank you so much!

Frame-aware?

Hello! I wanted to know if this model is frame aware? Can I ask questions like "when does the person wearing yellow jacket appear in this video?" Doesn't seem like models like VideoChat can do it based on demos on huggingface but in the paper I saw a figure where Video-Llama could say which frame it occurred in. Is MovieChat able to do that?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.